5 Tips for Using Regular Expressions in Data Cleaning5 Tips for Using Regular Expressions in Data Cleaning
Image by Author | Created on Canva

 

If you’re a Linux or a Mac user, you’ve probably used grep at the command line to search through files by matching patterns. Regular expressions (regex) allow you to search, match, and manipulate text based on patterns. Which makes them powerful tools for text processing and data cleaning.

For regular expression matching operations in Python, you can use the built-in re module. In this tutorial, we’ll look at how you can use regular expressions to clean data.  We’ll look at removing unwanted characters, extracting specific patterns, finding and replacing text, and more.

 

1. Remove Unwanted Characters

 

Before we go ahead, let’s import the built-in re module:

 

String fields (almost) always require extensive cleaning before you can analyze them. Unwanted characters—often resulting from varying formats—can make your data difficult to analyze. Regex can help you remove these efficiently.

You can use the sub() function from the re module to replace or remove all occurrences of a pattern or special character. Suppose you have strings with phone numbers that include dashes and parentheses. You can remove them as shown:

text = "Contact info: (123)-456-7890 and 987-654-3210."
cleaned_text = re.sub(r'[()-]', '', text)
print(cleaned_text) 

 

Here, re.sub(pattern, replacement, string) replaces all occurrences of the pattern in the string with the replacement. We use the r'[()-]’ pattern to match any occurrence of (, ), or – giving us the output:

Output >>> Contact info: 1234567890 or 9876543210

 

2. Extract Specific Patterns

 

Extracting email addresses, URLs, or phone numbers from text fields is a common task as these are relevant pieces of information. And to extract all specific patterns of interest, you can use the findall() function.

You can extract email addresses from a text like so:

text = "Please reach out to us at support@example.org or help@example.org."
emails = re.findall(r'\b[\w.-]+?@\w+?\.\w+?\b', text)
print(emails)

 

The re.findall(pattern, string) function finds and returns (as a list) all occurrences of the pattern in the string. We use the pattern r’\b[\w.-]+?@\w+?\.\w+?\b’ to match all email addresses:

Output >>> ['support@example.com', 'sales@example.org']

 

3. Replace Patterns

 

We’ve already used the sub() function to remove unwanted special characters. But you can replace a pattern with another to make the field suitable for more consistent analysis.

Here’s an example of removing unwanted spaces:

text = "Using     regular     expressions."
cleaned_text = re.sub(r'\s+', ' ', text)
print(cleaned_text) 

 

The r’\s+’ pattern matches one or more whitespace characters. The replacement string is a single space giving us the output:

Output >>> Using regular expressions.

 

4. Validate Data Formats

 

Validating data formats ensures data consistency and correctness. Regex can validate formats like emails, phone numbers, and dates.

Here’s how you can use the match() function to validate email addresses:

email = "test@example.com"
if re.match(r'^\b[\w.-]+?@\w+?\.\w+?\b$', email):
    print("Valid email")  
else:
    print("Invalid email")

 

In this example, the email string is valid:

 

5. Split Strings by Patterns

 

Sometimes you may want to split a string into multiple strings based on patterns or the occurrence of specific separators. You can use the split() function to do that.

Let’s split the text string into sentences:

text = "This is sentence one. And this is sentence two! Is this sentence three?"
sentences = re.split(r'[.!?]', text)
print(sentences) 

 

Here, re.split(pattern, string) splits the string at all occurrences of the pattern. We use the r'[.!?]’ pattern to match periods, exclamation marks, or question marks:

Output >>> ['This is sentence one', ' And this is sentence two', ' Is this sentence three', '']

 

Clean Pandas Data Frames with Regex

 

Combining regex with pandas allows you to clean data frames efficiently.

To remove non-alphabetic characters from names and validate email addresses in a data frame:

import pandas as pd

data = {
	'names': ['Alice123', 'Bob!@#', 'Charlie$$$'],
	'emails': ['alice@example.com', 'bob_at_example.com', 'charlie@example.com']
}
df = pd.DataFrame(data)

# Remove non-alphabetic characters from names
df['names'] = df['names'].str.replace(r'[^a-zA-Z]', '', regex=True)

# Validate email addresses
df['valid_email'] = df['emails'].apply(lambda x: bool(re.match(r'^\b[\w.-]+?@\w+?\.\w+?\b$', x)))

print(df)

 

In the above code snippet:

  • df['names'].str.replace(pattern, replacement, regex=True) replaces occurrences of the pattern in the series.
  • lambda x: bool(re.match(pattern, x)): This lambda function applies the regex match and converts the result to a boolean.

 

The output is as shown:

 	  names           	   emails    valid_email
0	  Alice	        alice@example.com     	    True
1  	  Bob          bob_at_example.com    	    False
2         Charlie     charlie@example.com     	    True

 

Wrapping Up

 

I hope you found this tutorial helpful. Let’s review what we’ve learned:

  • Use re.sub to remove unnecessary characters, such as dashes and parentheses in phone numbers and the like.
  • Use re.findall to extract specific patterns from text.
  • Use re.sub to replace patterns, such as converting multiple spaces into a single space.
  • Validate data formats with re.match to ensure data adheres to specific formats, like validating email addresses.
  • To split strings based on patterns, apply re.split.

In practice, you’ll combine regex with pandas for efficient cleaning of text fields in data frames. It’s also a good practice to comment your regex to explain their purpose, improving readability and maintainability.To learn more about data cleaning with pandas, read 7 Steps to Mastering Data Cleaning with Python and Pandas.

 
 

Bala Priya C is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she’s working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource overviews and coding tutorials.





Source link

Shares:
Leave a Reply

Your email address will not be published. Required fields are marked *