Image by Author | Created on Canva
If you’re a Linux or a Mac user, you’ve probably used grep at the command line to search through files by matching patterns. Regular expressions (regex) allow you to search, match, and manipulate text based on patterns. Which makes them powerful tools for text processing and data cleaning.
For regular expression matching operations in Python, you can use the built-in re module. In this tutorial, we’ll look at how you can use regular expressions to clean data. We’ll look at removing unwanted characters, extracting specific patterns, finding and replacing text, and more.
1. Remove Unwanted Characters
Before we go ahead, let’s import the built-in re module:
String fields (almost) always require extensive cleaning before you can analyze them. Unwanted characters—often resulting from varying formats—can make your data difficult to analyze. Regex can help you remove these efficiently.
You can use the sub()
function from the re module to replace or remove all occurrences of a pattern or special character. Suppose you have strings with phone numbers that include dashes and parentheses. You can remove them as shown:
text = "Contact info: (123)-456-7890 and 987-654-3210."
cleaned_text = re.sub(r'[()-]', '', text)
print(cleaned_text)
Here, re.sub(pattern, replacement, string) replaces all occurrences of the pattern in the string with the replacement. We use the r'[()-]’ pattern to match any occurrence of (, ), or – giving us the output:
Output >>> Contact info: 1234567890 or 9876543210
2. Extract Specific Patterns
Extracting email addresses, URLs, or phone numbers from text fields is a common task as these are relevant pieces of information. And to extract all specific patterns of interest, you can use the findall()
function.
You can extract email addresses from a text like so:
text = "Please reach out to us at support@example.org or help@example.org."
emails = re.findall(r'\b[\w.-]+?@\w+?\.\w+?\b', text)
print(emails)
The re.findall(pattern, string) function finds and returns (as a list) all occurrences of the pattern in the string. We use the pattern r’\b[\w.-]+?@\w+?\.\w+?\b’ to match all email addresses:
Output >>> ['support@example.com', 'sales@example.org']
3. Replace Patterns
We’ve already used the sub()
function to remove unwanted special characters. But you can replace a pattern with another to make the field suitable for more consistent analysis.
Here’s an example of removing unwanted spaces:
text = "Using regular expressions."
cleaned_text = re.sub(r'\s+', ' ', text)
print(cleaned_text)
The r’\s+’ pattern matches one or more whitespace characters. The replacement string is a single space giving us the output:
Output >>> Using regular expressions.
4. Validate Data Formats
Validating data formats ensures data consistency and correctness. Regex can validate formats like emails, phone numbers, and dates.
Here’s how you can use the match()
function to validate email addresses:
email = "test@example.com"
if re.match(r'^\b[\w.-]+?@\w+?\.\w+?\b$', email):
print("Valid email")
else:
print("Invalid email")
In this example, the email string is valid:
5. Split Strings by Patterns
Sometimes you may want to split a string into multiple strings based on patterns or the occurrence of specific separators. You can use the split()
function to do that.
Let’s split the text
string into sentences:
text = "This is sentence one. And this is sentence two! Is this sentence three?"
sentences = re.split(r'[.!?]', text)
print(sentences)
Here, re.split(pattern, string) splits the string at all occurrences of the pattern. We use the r'[.!?]’ pattern to match periods, exclamation marks, or question marks:
Output >>> ['This is sentence one', ' And this is sentence two', ' Is this sentence three', '']
Clean Pandas Data Frames with Regex
Combining regex with pandas allows you to clean data frames efficiently.
To remove non-alphabetic characters from names and validate email addresses in a data frame:
import pandas as pd
data = {
'names': ['Alice123', 'Bob!@#', 'Charlie$$$'],
'emails': ['alice@example.com', 'bob_at_example.com', 'charlie@example.com']
}
df = pd.DataFrame(data)
# Remove non-alphabetic characters from names
df['names'] = df['names'].str.replace(r'[^a-zA-Z]', '', regex=True)
# Validate email addresses
df['valid_email'] = df['emails'].apply(lambda x: bool(re.match(r'^\b[\w.-]+?@\w+?\.\w+?\b$', x)))
print(df)
In the above code snippet:
df['names'].str.replace(pattern, replacement, regex=True)
replaces occurrences of the pattern in the series.lambda x: bool(re.match(pattern, x))
: This lambda function applies the regex match and converts the result to a boolean.
The output is as shown:
names emails valid_email
0 Alice alice@example.com True
1 Bob bob_at_example.com False
2 Charlie charlie@example.com True
Wrapping Up
I hope you found this tutorial helpful. Let’s review what we’ve learned:
- Use
re.sub
to remove unnecessary characters, such as dashes and parentheses in phone numbers and the like. - Use
re.findall
to extract specific patterns from text. - Use
re.sub
to replace patterns, such as converting multiple spaces into a single space. - Validate data formats with
re.match
to ensure data adheres to specific formats, like validating email addresses. - To split strings based on patterns, apply
re.split
.
In practice, you’ll combine regex with pandas for efficient cleaning of text fields in data frames. It’s also a good practice to comment your regex to explain their purpose, improving readability and maintainability.To learn more about data cleaning with pandas, read 7 Steps to Mastering Data Cleaning with Python and Pandas.
Bala Priya C is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she’s working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource overviews and coding tutorials.