Data Science Posts and Resources

Articles on Data Science

Python - Regular Expressions

Introduction to Regular Expressions

Laxmi K Soni

REGULAR EXPRESSIONS

In programming, you will oftentimes have to deal with long texts from which we want to extract specific information. Also, when we want to process certain inputs, we need to check for a specific pattern. For example, think about emails. They need to have some text, followed by an @ character, then again some text and finally a dot and again some little text.

In order to make the validations easier, more efficient and more compact, we use so-called regular expressions .

The topic of regular expressions is very huge and you could write a whole book only about it. This is why we are not going to focus too much on the various placeholders and patterns of the expressions themselves but on the implementation of RegEx in Python. So in order to confuse you right in the beginning, let’s look at a regular expression that checks if the format of an email-address is valid.

^[a-zA-Z0-9.!#$%&'*+/=?^_{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$

Now you can see why this is a huge field to learn. We are going to focus on quite simple examples and how to properly implement them in Python.

IDENTIFIER

Let’s get started with some basic knowledge first. So-called identifiers define what kind of character should be at a certain place. Here you have some examples:

| |REGEX IDENTIFIERS |                                      |
|--------------------|--------------------------------------|
| IDENTIFIER         | DESCRIPTION                          |
| \d                 | Some digit                           |
| \D                 | Everything BUT a digit               |
| \s                 | White space                          |
| \S                 | Everything BUT a white space         |
| \w                 | Some letter                          |
| \W                 | Everything BUT a letter              |
| .                  | Every character except for new lines |
| \b                 | White spaces around a word           |
| \.                 | A dot                                |

MODIFIER

The modifiers extend the regular expressions and the identifiers. They might be seen as some kind of operator for regular expressions.

| REGEX MODIFIERS |                                            |
|-----------------|--------------------------------------------|
| MODIFIER        | DESCRIPTION                                |
| {x,y}           | A number that has a length between x and y |
| +               | At least one                               |
| ?               | None or one                                |
| *               | Everything                                 |
| $| At the end of a string | | ^ | At the beginning of a string | | | | Either Or | | | Example: x | y = either x or y | | [] | Value range | | {x} | x times | | {x,y} | x to y times | ESCAPE CHARACTERS | REGEX ESCAPE CHARATCERS | | |-------------------------|-------------| | CHARACTER | DESCRIPTION | | \n | New Line | | \t | Tab | | \s | White Space | APPLYING REGULAR EXPRESSIONS FINDING STRINGS In order to apply these regular expressions in Python, we need to import the module re . import re Now we can start by trying to find some patterns in our strings. text = ''' Mike is 20 years old and George is 29! My grandma is even 104 years old! ''' ages = re.findall( r'\d{1,3}' , text) print (ages) ## ['20', '29', '104'] In this example, we have a text with three ages in it. What we want to do is to filter these out and print them separately. As you can see, we use the function findall in order to apply the regular expression onto our string. In this case, we are looking for numbers that are one to three digits long. Notice that we are using an r character before we write our expression. This indicates that the given string is a regular expression. At the end, we print our result and get the following output: [‘20’, ‘29’, ‘104’] MATCHING STRINGS What we can also do is to check if a string matches a certain regular expression. For example, we can apply our regular expression for mails here. import re text = 'test@mail.com' result = re.fullmatch( r'^[a-zA-Z0-9.!#$%&'*+/=?^_{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*\$' , text)

if result != None :
print ( 'VALID!' )
else :
print ( 'INVALID!' )

We are not going to talk about the regular expression itself here. It is very long and complicated. But what we see here is a new function called fullmatch . This function returns the checked string if it matches the regular expression. In this case, this happens when the string has a valid mail format.

If the expression doesn’t match the string, the function returns None . In our example above, we get the message “VALID!” since the expression is met. If we enter something like “Hello World!”, we will get the other message.

MANIPULATING STRINGS

Finally, we are going to take a look at manipulating strings with regular expressions. By using the function sub we can replace all the parts of a string that match the expression by something else.

import re
text = '''
Mike is 20 years old and George is 29!
My grandma is even 104 years old!
'''
text = re.sub( r'\d{1,3}' , '100' , text)
print (text)
##
## Mike is 100 years old and George is 100!
## My grandma is even 100 years old!

In this example, we replace all ages by 100 . This is what gets printed: Mike is 100 years old and George is 100! My grandma is even 100 years old!

These are the basic functions that we can operate with in Python when dealing with regular expressions. If you want to learn more about regular expressions just google and you will find a lot of guides. Play around with the identifiers and modifiers a little bit until you feel like you understand how they work.

Nothing yet.