Regular expressions (regex) are powerful tools used for pattern matching and text manipulation. They are widely used in programming, data validation, and text processing tasks.
Regular expressions are sequences of characters defining a search pattern. They are often used to validate input, search for specific patterns in text, and perform text replacements.
abc
matches the string
“abc” exactly.These characters have special meanings in regex:
.
: Matches any
single character except a newline.^
: Matches the
start of a string.$
: Matches the
end of a string.\
: Escapes a
special character to match it literally.Define the number of times an element can occur:
*
: 0 or more
times.+
: 1 or more
times.?
: 0 or 1 time.
{n}
: Exactly n
times.{n,}
: At least n
times.{n,m}
: Between n
and m times.Specify a set of characters to match:
[abc]
: Matches
any character a
, b
, or c
.[^abc]
: Matches
any character except a
, b
, or c
.[a-z]
: Matches
any lowercase letter.\d
: Matches any
digit (equivalent to [0-9]
).\w
: Matches any
word character (alphanumeric or _
).\s
: Matches any
whitespace character (spaces, tabs, etc.).^
: Matches the
start of a string.$
: Matches the
end of a string.\b
: Matches a
word boundary.
\bcat\b
matches
“cat” but not “catapult”.\B
: Matches a
position that is not a word boundary.()
to create
capturing groups.(abc)
captures the
substring “abc”.(?:...)
for groups you don’t
want to capture.(?:abc)
matches “abc”
without storing it.(\w+)\s\1
matches
repeated words like “hello hello”.Assert that a pattern follows the current position:
(?=...)
\d(?= dollars)
matches digits followed by " dollars".(?!...)
\d(?! dollars)
matches digits not followed by " dollars".Assert that a pattern precedes the current position:
(?<=...)
(?<=USD )\d+
matches digits preceded by "USD ".(?<!...)
(?<!USD )\d+
matches digits not preceded by "USD ".Flags modify the behavior of the regex engine:
i
:
Case-insensitive matching.g
: Global
search (matches all occurrences).m
: Multiline
mode (^
and $
match line boundaries).s
: Dot-all
mode (.
matches newlines).x
: Ignore
whitespace in the pattern for readability.|
to match one of several
patterns.cat|dog
matches “cat”
or “dog”.(?<year>\d{4})-(?<month>\d{2})
captures year
and month
.
^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$
^\+?\d{1,3}[-.\s]?\(?\d{1,4}\)?[-.\s]?\d{1,4}[-.\s]?\d{1,9}$
https?:\/\/[^\s]+
^(?=.*[A-Z])(?=.*\d)(?=.*[@$!%*?&])[A-Za-z\d@$!%*?&]{8,}$
.*
unless
necessary, as it is greedy and can cause inefficiency.^
and
$
to restrict searches to relevant parts of the text.
re
module./pattern/flags
.
java.util.regex
.System.Text.RegularExpressions
.
.*
matches
everything, including unwanted content.\.
to
match a literal period.x
flag.Pattern | Description |
---|---|
. |
Any character except newline. |
\d |
Digit (0-9). |
\D |
Non-digit. |
\w |
Word character (alphanumeric). |
\W |
Non-word character. |
\s |
Whitespace (space, tab, newline). |
\S |
Non-whitespace. |
[abc] |
Any of a , b , or c . |
[^abc] |
Not a , b , or c . |
a|b |
a or b . |
(abc) |
Capturing group. |
(?:abc) |
Non-capturing group. |
(?=abc) |
Positive lookahead. |
(?!abc) |
Negative lookahead. |
(?<=abc) |
Positive lookbehind. |
(?<!abc) |
Negative lookbehind. |
Regular expressions are versatile tools that simplify pattern matching and text manipulation tasks. By mastering regex syntax and leveraging tools effectively, you can handle a wide range of applications, from input validation to complex data extraction. With practice, regex becomes an invaluable skill for developers, data analysts, and IT professionals.
Regular expressions (regex) are an indispensable tool for text processing, offering unmatched power and flexibility for pattern matching and text manipulation. Below is an expanded guide covering additional concepts, best practices, and advanced use cases to provide a complete understanding of regex.
Understanding how regex engines process patterns can help write more efficient expressions.
grep
and awk
.a.*b
matches
from the first a
to the last b
.*?
, +?
, or ??
.
a.*?b
matches
from the first a
to the next b
.\p{L}
matches any
Unicode letter.u
flag in JavaScript or equivalent to enable Unicode.
(?>...)
.(?>a*)b
matches as
many a
s as possible, followed by b
, without backtracking.(?(condition)yes-pattern|no-pattern)
.
(?(1)abc|def)
matches
abc
if group 1 exists, otherwise matches def
.
https?:\/\/(www\.)?[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}
^\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01])$
\b\d{1,3}(\.\d{1,3}){3}\b
\b\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}
#\w+
"(.*?)"
(?:^|,)(\"(?:[^\"]+|\"\")*\"|[^,]*)
re
import re
pattern = r"\d{3}-\d{2}-\d{4}"
match = re.search(pattern, "SSN: 123-45-6789")
print(match.group()) # Output: 123-45-6789
/pattern/flags
const regex = /\d{3}-\d{2}-\d{4}/;
const match = "SSN: 123-45-6789".match(regex);
console.log(match[0]); // Output: 123-45-6789
java.util.regex
import java.util.regex.*;
Pattern pattern = Pattern.compile("\\d{3}-\\d{2}-\\d{4}");
Matcher matcher = pattern.matcher("SSN: 123-45-6789");
if (matcher.find()) {
System.out.println(matcher.group()); // Output: 123-45-6789
}
grep
,
awk
, or sed
.
echo "SSN: 123-45-6789" | grep -oE '\d{3}-\d{2}-\d{4}'
grepl
,
gsub
, regmatches
.
text <- "SSN: 123-45-6789"
pattern <- "\\d{3}-\\d{2}-\\d{4}"
regmatches(text, gregexpr(pattern, text))
pattern = r"""
^ # Start of string
\d{4} # Year
- # Separator
(0[1-9]|1[0-2]) # Month
- # Separator
(0[1-9]|[12]\d|3[01]) # Day
$ # End of string
"""
(a+)+
).\d{3}
) over general ones (e.g., .*
).
\$\d{1,3}(,\d{3})*(\.\d{2})?
\s{2,}
<[^>]+>
(\d{3})(\d{3})(\d{4}) -> (\1) \2-\3
Regular expressions are versatile and powerful tools for solving a variety of text-processing challenges. By mastering their syntax, understanding their limitations, and leveraging the right tools, you can effectively tackle tasks ranging from simple validations to complex data transformations. Practice, combined with a solid understanding of regex engines and performance considerations, ensures you can write efficient and maintainable patterns for any application.