Skip to content

Latest commit

 

History

History
378 lines (274 loc) · 13.2 KB

python_regexp.md

File metadata and controls

378 lines (274 loc) · 13.2 KB

Regular Expressions in Python

In the last session we saw how to read files and search for text by line number, word number, column number or by using “find” to search for specific text. This is all great, but it is not very flexible. For example, imagine searching for all surnames and titles from the below text…

Dear Mr. Johnson, 
  Dear Miss. Jameson,
Dear   Ms.   Jackson, 
Dear Mrs.    Peterson, 
  Dear    Mr. Sampson
Dear Dr.Johanson,
Dear Rev Richardson,

How would you go about trying to write a program that can do this?

Searching and extracting text from files is remarkably complicated. Fortunately, computer scientists have solved this problem. The solution has been adopted by nearly all programming languages. The solution is to use what are called “regular expressions”.

Regular Expressions in Python

Regular expressions can look scary, but are pretty simple once you understand the rules. The syntax for regular expressions appeared and was standardised in the Perl language, and now nearly all programming languages support “Perl Compatible Regular Expressions” (PCRE). Python provides the “re” and “regexp” modules, that support most of PCRE. Let’s take a look using “re”

$ ipython
$ import re

We import the “re” module using the “import” function. You can get help using “help”

$ help(re)
Help on module re:

NAME
    re - Support for regular expressions (RE).

FILE
    /System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py

MODULE DOCS
    http://docs.python.org/library/re

DESCRIPTION
    This module provides regular expression matching operations similar to
    those found in Perl.  It supports both 8-bit and Unicode strings; both
    the pattern and the strings being processed can contain null bytes and
    characters outside the US ASCII range.

Regular expressions can be used for three things; Searching, pattern extraction and replacing.

Regular Expression Searching

Searching is when you want to look some text in a file.

$ ipython
$ import re
$ lines = open(“textfile”, “r”).readlines()
$ for line in lines:
$     if re.search(r”dream”, line):
$         print line,

For in that sleep of death, what dreams may come,

“re.search” is used to search, in this case for the string “dream” in string “line”. If the text is found, then re.search returns True, else it returns False. Note that we put an “r” in front of the search string. This is to tell Python that this is a raw string which should not be escaped (more about this later..)

The above was a simple, case-sensitive regular expression search. To perform a case-insensitive search, you use re.IGNORECASE,

$ for line in lines:
$     if re.search(r”dream”, line, re.IGNORECASE):
$         print line,

To sleep, perchance to Dream; Aye, there's the rub,
For in that sleep of death, what dreams may come,

So far, so the same as line.find().. Regular expressions are powerful as they provide a sub-language to control the search. Let’s say you want to find all lines containing “the” as a word. You can do that using the special character “\s”, which means “space”, e.g.

$ for line in lines:
$     if re.search(r”\sthe\s”, line):
$         print line,

To be, or not to be, that is the question:
Whether 'tis Nobler in the mind to suffer
The Heart-ache, and the thousand Natural shocks
To sleep, perchance to Dream; Aye, there's the rub,
Must give us pause. There's the respect
For who would bear the Whips and Scorns of time,
The Oppressor's wrong, the proud man's Contumely,
The pangs of despised Love, the Law’s delay,
The insolence of Office, and the Spurns
That patient merit of the unworthy takes,
But that the dread of something after death,
No Traveller returns, Puzzles the will,
And thus the Native hue of Resolution
Is sicklied o'er, with the pale cast of Thought,
And lose the name of Action. Soft you now,

Now, let’s search for all lines that contain “the” where the “the” is part of a word. We can do this by using “\w” which means “any non-space character”, e.g.

$ for line in lines:
$     if re.search(r”the\w”, line):
$         print line,

Whether 'tis Nobler in the mind to suffer
And by opposing end them: to die, to sleep
To sleep, perchance to Dream; Aye, there's the rub,
And makes us rather bear those ills we have,
Than fly to others that we know not of.
With this regard their Currents turn awry,

And combining these, together, find lines containing words that start with “the”

$ for line in lines:
$    if line.search(r”\sthe\w”, line):
$        print line,

And by opposing end them: to die, to sleep
To sleep, perchance to Dream; Aye, there's the rub,
With this regard their Currents turn awry,

There are a lot of special characters. They are

  • \d Match any digit (number)
  • \s Match a space
  • \w Match any word character (alphanumeric and “_”)
  • \S Match any non-whitespace character
  • \D Match any non-digit character
  • . Match any character
  • \t Match a tab
  • \n Match a newline

Note that the backslash is a special character which is normally removed (escaped) in Python. The “r” in front of the string tells Python not to interpret, escape or remove the backslash. You must include the “r” or else your regular expressions will not compile.

As well as matching characters, you can match collections of characters, to match “th” followed by a, i or y, you would use square brackets,

$ for line in lines:
$     if re.search(r”th[aiy]”, line):
$         print line,

To be, or not to be, that is the question:
For in that sleep of death, what dreams may come,
When we have shuffled off this mortal coil,
That patient merit of the unworthy takes,
But that the dread of something after death,
Than fly to others that we know not of.
With this regard their Currents turn awry,
The fair Ophelia? Nymph, in thy Orisons
  • [abc] Match a, b or c
  • [a-z] Match any character between a to z
  • [A-Z] Match any character between A to Z
  • [a-zA-Z] Match any character from a to z and A to Z (any letter)
  • [0-9] Match any digit
  • [02468] Match any even digit
  • [^0-9] Matches NOT digits (^ means NOT)

You can also use repetition in your matching.

    • Match 0 or more times, e.g. \w* means match 0 or more word characters
    • Match 1 or more times, e.g. \w+ means match 1 or more word characters
  • ? Match 0 or 1 times, e.g. \w? means match 0 or 1 word characters
  • {n} Match exactly n times, e.g. \w{3} means match exactly 3 word characters
  • {n,} Match at least n times, e.g. \w{5,} means match at least 5 word characters
  • {m,n} Match between m and n times, e.g. \w{5,7} means match 5-7 word characters

We can use this to find all lines that contain words with 10-12 characters

$ for line in lines:
$     if re.search(r”\w{10-12}”, line):
$         print line,

The Slings and Arrows of outrageous Fortune,
That Flesh is heir to? 'Tis a consummation
The undiscovered Country, from whose bourn
Thus Conscience does make Cowards of us all,
And thus the Native hue of Resolution
And enterprises of great pitch and moment,
Be all my sins remembered.

Finally, flags can be attached to the match. To match only at the beginning of the line use a carat

$ for line in lines:
$     if re.search(r”^the\s”, line, re.IGNORECASE):
$         print line,

The Slings and Arrows of outrageous Fortune,
The Heart-ache, and the thousand Natural shocks
The Oppressor's wrong, the proud man's Contumely,
The pangs of despised Love, the Law’s delay,
The insolence of Office, and the Spurns
The undiscovered Country, from whose bourn
The fair Ophelia? Nymph, in thy Orisons

To match at the end of the line, using a dollar

$ for line in lines:
$     if re.search(r”on$”, line):
$         print line,

That Flesh is heir to? 'Tis a consummation
And thus the Native hue of Resolution

Pattern extraction

Searching is great, but substring matching is the real power of regular expressions. You can group parts of the regular expression to let you extract the matching part of the string. You do this using round brackets.

$ line = lines[0]
$ print line

To be, or not to be, that is the question:

$ m = re.search(r”the\s(\w+)”, line)

This matches “the” followed by a space, followed by 1 or more word characters. The returned object, m, contains information about the match.

$ print m.group(0)
the question

m.group(0) returns the entire matched substring, in this case “the question”. However, we put “\w+” into parentheses, and so this part is available as a sub-match, in m.group(1)

$ print m.group(1)
question

If we have added extra groups, these would be available as m.group(2), m.group(3) etc., e.g.

$ m = re.search(r”to (\w+), or not (\w+) (\w+)”, line, re.IGNORECASE)
$ print m.group(0)

To be, or not to be

$ print m.group(1)

be

$ print m.group(2)

to

$ print m.group(3)

be

For example, we could use this to extract all of the words that follow “the” in the text, e.g.

$ for line in lines:
$     m = re.search(r”\sthe\s(\w+)”, line, re.IGNORECASE)
$     if m:
$         print line,
$         print m.group(1)

To be, or not to be, that is the question:
question
Whether 'tis Nobler in the mind to suffer
mind
The Heart-ache, and the thousand Natural shocks
thousand
To sleep, perchance to Dream; Aye, there's the rub,
rub
Must give us pause. There's the respect
respect
For who would bear the Whips and Scorns of time,
Whips
The Oppressor's wrong, the proud man's Contumely,
proud
The pangs of despised Love, the Law’s delay,
Law
The insolence of Office, and the Spurns
Spurns
That patient merit of the unworthy takes,
unworthy
But that the dread of something after death,
dread

Exercise

Exercise 2a

Here is the list of surnames from above. Can you write a regular expression that will match each line, extracting the title and surname for each person? The names are in a file called greetings.txt.

Dear Mr. Johnson, 
  Dear Miss. Jameson,
Dear   Ms.   Jackson, 
Dear Mrs.    Peterson, 
  Dear    Mr. Sampson
Dear Dr.Johanson,
Dear Rev Richardson,

Note that you can match the “.” character using “\.”, e.g. to match Dr. use re.search(r”Dr\.”, line)

If you get stuck, an example output is here

Pattern Replacing

As well as using regular expressions for searching for text, you can also use it to replace text. You do this using re.sub

$ line = lines[0]
$ print line

To be, or not to be, that is the question:

$ line = re.sub(r”be”, “code”, line)
$ print line

To code, or not to code, that is the question:

As you can see, every match is replaced by “code”. We can replace “n” matches by passing that in as an extra argument

# line = lines[0]
# line = re.sub(r”be”, “code”, line, 1)
# print line

To code, or not to be, that is the question:

In this case, we only replace “1” time, hence only the first match is replaced.

We can add some logic to the replacement, e.g. replace “be” or “question” with “code”

# line = lines[0]
# line = re.sub(r”be|question”, “code”, line)
# print line

To code, or not to code, that is the code:

If you want to do a case-insensitive match, you need to compile the first string, e.g.

# line = lines[0]
# line = re.sub( re.compile(r”to be”, re.IGNORECASE), “ice-cream”, line )
# print line

ice-cream, or not ice-cream, that is the question:

You can also nest re.sub calls together if you want to perform multiple substitutions

# line = lines[0]
# line = re.sub( re.compile(r”to”, re.IGNORECASE), “go”, re.sub(r”be”, “home”, line) )
# print line

go home, or not go home, that is the question:

Exercise

### Exercise 2b

Find all words that follow “the” in “textfile” and replace them with “banana”.

If you get stuck, take a look at the example output here

Health Warning

Regular expressions are very powerful. You can use them to search for specific output from your programs and to do powerful text manipulation. However, as you have seen, they are very “write-only”. Extremely difficult to understand for non-experts, and complex regular expressions can be difficult even for your future-self to understand (i.e. “what was I thinking when I wrote that last year? What does it mean and what does it do?”). You should ALWAYS comment your regular expressions and explain in English exactly what you intended to match when you wrote them. Once you have memorised the rules, you will find regular expressions are very easy to read, use and are extremely powerful. However, without comments, they will be completely unintelligable to everyone else who looks at or relies on your code.