Cleaning up files and making CSVs with regular expressions
What we want to do
From my point of view, the real magic of regular expressions is using them in your text editor to clean up files and prepare them as CSVs. Let’s say I have file of people introducing themselves and lying about how tall they are.
My name is Jeremy and I am 40.3cm tall
My name is Patricia and I am 900cm tall
My name is Anton and I am 3cm tall
But we want this to be a CSV file that looks like this:
Jeremy,40.3
Patricia,900
Anton,3
While I know we use Python for everything, I find a lot of this easier in our text editor. Let’s use Find and Replace in our editor (Atom or Sublime or anything else!) to make this happen.
Opening up Replace in your text editor
First you need to open up Find and Replace in your text editor, and make sure regular expressions are turned on. It’s usually a little icon that looks like .*
.
Pretty simple so far!
Building our matcher
We’re going to need to CAPTURE the important parts, and then insert them later. Let’s make a fill-in-the-blanks for what we want:
My name is ______ and I am ____cm tall
You could match a thousand ways, but here’s an example. You can play around with different versions by watching what your text editor highlights!
We ended up using this one:
My name is .* and I am [\d.]+cm tall
In this case, the regex is matching the following:
My name is
- literally that phrase.*
- anything!and I am
- literally that phrase[\d.]+
- at least one decimal/periodcm tall
- literally that phrase
Building the replacer
To rearrange the important parts - the parts we’re matching - we need to use CAPTURE GROUPS. This means using parentheses to say “hey, we want to use these later!”
My name is (.*) and I am ([\d.]+)cm tall
Now instead of losing those parts forever when we replace, we get to use them, too!. We can use them by typing \1
and \2
, to insert the first and second matched places. Replace will automatically fill in the first and second capture groups.
When you’re trying to save things as a CSV, you normally just separate each captured group by a comma (and sometimes throw quotes around it). In this case we’re going to do exactly that - we’re going to do a fill-in-the-blanks with "___","____"
, and use \1
and \2
to fill them in.
"\1","\2"