Regular expressions are an indispensable tool for text processing and data manipulation. They allow you to search, match, and manipulate text based on specific patterns.
We will cover the basics and some advanced features of regular expressions in Python.
The re
module provides support for regular expressions, making it straightforward to incorporate regex functionality into your Python scripts.
Let's first look at a simple example of matching patterns in a string using the re.match
function.
import re pattern = r'\d+' # Matches one or more digits string = "There are 123 apples" match = re.match(pattern, string) if match: print("Match found:", match.group()) else: print("No match found")
The re.search
function helps you find the first location where the regex pattern matches in the string, while re.findall
returns all matches.
import re pattern = r'\d+' # Matches one or more digits string = "There are 123 apples and 456 oranges" # Search for the first match search_result = re.search(pattern, string) if search_result: print("Search result:", search_result.group()) # Find all matches findall_result = re.findall(pattern, string) print("Find all result:", findall_result)
The re.sub
function allows you to replace matched patterns with a specified string.
import re pattern = r'\d+' # Matches one or more digits string = "There are 123 apples and 456 oranges" replacement = "X" # Replace all matches with 'X' result = re.sub(pattern, replacement, string) print("Substitution result:", result)
In regular expressions, grouping allows you to capture specific parts of a match using parentheses (). This makes it possible to extract and reuse portions of the matched pattern, which is useful for extracting data from structured text.
import re pattern = r'(\d+)\s(\w+)' # Matches "number word" pairs string = "123 apples" match = re.search(pattern, string) if match: print("Full match:", match.group(0)) print("First group:", match.group(1)) print("Second group:", match.group(2))
Let's break down the above code example:
pattern = r'(\d+)\s(\w+)'
:(\d+)
: The first group, (\d+)
, captures one or more digits (\d+
). This will match the numeric part of the string, "123"
.\s
: This matches any whitespace character, ensuring that the number and the word are separated by space.(\w+)
: The second group, (\w+)
, captures one or more word characters (letters or digits). This will match the word part of the string, "apples"
.string = "123 apples"
: This is the string being searched. It contains a number followed by a word, matching the pattern.re.search(pattern, string)
: This function searches the string for the first match of the pattern. If a match is found, it returns a match object, otherwise, it returns None
.match.group(0)
: Returns the entire matched string. In this case, it returns "123 apples"
.match.group(1)
: Returns the first captured group, which is the portion matched by the first set of parentheses (\d+)
. In this case, it returns "123"
.match.group(2)
: Returns the second captured group, which is the portion matched by the second set of parentheses (\w+)
. In this case, it returns "apples"
.This technique is commonly used in text processing tasks like:
Lookaheads and lookbehinds are part of zero-width assertions in regular expressions, meaning they match patterns based on the context around them without including the surrounding characters in the final match result. These assertions allow you to check for the presence or absence of a pattern before or after the part you want to match, without consuming those characters in the match itself.
Lookaheads and lookbehinds are especially useful when you need to ensure a match occurs in a specific context but don’t want to include that context in the final result.
You can use lookaheads and lookbehinds in scenarios where you want to capture data with specific boundaries or conditions without including those boundaries in your results. For instance:
(?=...)
): Ensures that a certain pattern follows the current position but doesn't include it in the match.(?!...)
): Ensures that a certain pattern does not follow the current position.(?<=...)
): Ensures that a certain pattern precedes the current position but doesn't include it in the match.(?<!...)
): Ensures that a certain pattern does not precede the current position.In the example below, we use a lookbehind to match digits that are preceded by a dollar sign ($
), but without including the dollar sign in the result.
import re pattern = r'(?<=\$)\d+' # Matches digits preceded by a dollar sign string = "The price is $123" search_result = re.search(pattern, string) if search_result: print("Lookbehind result:", search_result.group())
pattern = r'(?<=\$)\d+'
:(?<=...)
: This is a positive lookbehind assertion. It asserts that what immediately precedes the current position in the string is a dollar sign (\$
), but the dollar sign itself will not be part of the match result.\d+
: Matches one or more digits. These digits are the part of the pattern we want to capture and return.string = "The price is $123"
: This is the input string, which contains a dollar sign followed by a number. The goal is to match the number (123
) that comes after the dollar sign.re.search(pattern, string)
: This searches the string for a match based on the pattern. In this case, it will look for digits (\d+
) that are preceded by a dollar sign ($
), but the dollar sign will not be included in the match result.search_result.group()
: If a match is found, this will return the captured part of the string that satisfies the lookbehind condition. In this case, it will return "123"
.pattern = r'\w+(?=\sis)' # Matches any word that is followed by " is" string = "This is a test." search_result = re.search(pattern, string) if search_result: print("Lookahead result:", search_result.group())
(?=\sis)
asserts that the matched word (\w+
) must be followed by the phrase " is"
, but " is"
is not included in the match.This
pattern = r'(?<!\$)\d+' # Matches digits not preceded by a dollar sign string = "Price: $123 or 456" search_result = re.search(pattern, string) if search_result: print("Negative lookbehind result:", search_result.group())
(?<!\$)
ensures that the digits are not preceded by a dollar sign, so it will match "456"
and not "123"
.456
pattern = r'(?<=\$)\d+(?=\sUSD)' # Matches digits preceded by $ and followed by " USD" string = "The price is $123 USD." search_result = re.search(pattern, string) if search_result: print("Combined lookahead/lookbehind result:", search_result.group())
(?<=\$)
ensures the digits are preceded by $
, and the lookahead (?=\sUSD)
ensures the digits are followed by " USD"
. The final match will only include the digits.123
Flags modify the behavior of the regex. Common flags include re.IGNORECASE
, re.MULTILINE
, and re.DOTALL
.
import re pattern = r'apple' string = "APPLE pie" # Case-insensitive search search_result = re.search(pattern, string, re.IGNORECASE) if search_result: print("Case-insensitive search result:", search_result.group())
For better performance, especially if the same pattern is used multiple times, you can compile the regex.
import re pattern = r'\d+' compiled_pattern = re.compile(pattern) string1 = "123 apples" string2 = "456 oranges" # Use the compiled pattern match1 = compiled_pattern.search(string1) match2 = compiled_pattern.search(string2) if match1: print("Compiled search result 1:", match1.group()) if match2: print("Compiled search result 2:", match2.group())
Regular expressions offer a wide range of features and syntax for pattern matching, including character classes, quantifiers, anchors, and more. You can refer to the Python documentation for more information on regular expressions and their syntax.