Regex in Excel

Download the demo workbook to follow along. You should have QueryStorm installed, but a paid license is not necessary.

If you’re reading this article, chances are you’re the type of person who enjoys an occasional expression of the regular kind. If so, I’m sure you’ll find this article interesting, regardless of how much you use Excel.

Aside from the ability to write SQL and C# code in Excel, QueryStorm also introduces some nifty ways to run regex operations in Excel.

Regex replace in (table) columns

Suppose you have an Excel file that contains data about products and you’d like import this data into a database. The data was entered by someone particularly careless about their punctuation. For example, some times they forget to put a full stop at the end of product descriptions.

img1

Entering product data is not really an exciting job, so who can blame them. Still, it’s now up to us to fix the data. No problem; with a bit of regex-fu, we can fix this easily! Here’s how to do it with QueryStorm. Select any cell in the target column, and open the Replace in table column quick-query.

Note that this the data must be in an explicitly defined Excel table for this quick-query to be available.

Replace inside table column.

We’ll search for descriptions where the last character is a letter or a digit: (\w)$. We need to capture the last character (via parentheses) so we can use it in the replacement string (alternatively we could have used a look-behind pattern). We’re going to replace the character we matched with itself followed by a full stop ($1.). After clicking Run transformation, we can confirm that the dots have been added where they were missing.

Fixed text.

Well, that was easy. Our regex skills have once again paid off and the crisis has been deactivated. Except, that wasn’t the only mistake our careless data entry person made: multiple spaces between words, spacing before punctuation, no spacing after punctuation… We’ll need to do multiple replacements on this column. While the UI for the quick-query doesn’t (yet) allow entering multiple pattern/replacement pairs, we can easily do so in code. The RegexReplace function can take an arbitrary number of pattern/replacement pairs. Here’s our final query that takes care of all the mentioned issues:

SELECT
    RegexReplace(Description, 
        '(\w)$', '$1.',  -- add a full stop at the end of the description (if it's not already there)
        '\s{2,}', ' ',   -- replace multiple spaces with a single space
        '\s*([.,;:!?])\s*', '$1 ' --ensure no spaces before punctuation, one space after it
    ) as Description
FROM
    Table2;

Here’s the query in action:

Fixing other typos.

Once we’re happy with the results, we can fix the texts in the Excel table with an update statement:

UPDATE Table2 set Description = 
    RegexReplace(Description, 
        '(\w)$', '$1.',  -- add a full stop at the end if it's not there 
        '\s{2,}', ' ',   -- replace multiple spaces with a single space
        '\s*([.,;:!?])\s*', '$1 ' --ensure no spaces before punctuation, one space after it
        );

Text normalized.

Note: one thing we can’t fix with this function is character casing e.g. sentences that begin with a lowercase letter. We could do that with a line or two of C# – writing this I just realized that text normalization would make for a good quick-query, but I’ll leave that for another time.

Regex replace in selected cells

In the example above, all the texts we wanted to fix were in a table column. But what if we wanted to fix text in the entire sheet or in a scattered selection of cells? There’s a quick-query for that too!

Select the cells you want to fix, and open the Regex replace in selection quick-query.

Replace in selection menu.

We can run the transformation, or preview the SQL code and results, or modify the query as we see fit.

Note: Doing the regex replace operation this way (on selected cells rather than on table columns) is much slower, since the update is done one cell at a time and Excel is slow when working that way. Be careful when doing this on a large selection of cells because it can freeze your Excel for several minutes.

Splitting columns based on regex matches

Some times there are multiple kinds of information packed into a single column. For example, in the movies table (in the demo workbook), the director’s first and last name are together in one column. But what if we want to sort by last name? To easily do operations on the last name, we’ll need to extract it into its own column.

To split the full name into first name and last name, we can use the Regex split (columns) quick-query:

Splitting columns

In the pattern, we need to place the first name and last name sub-patterns into groups.

We’ll use the following pattern: (?'firstname'\w+)\s(?'lastname'\w+). For each named group in the pattern (in this case firstname and lastname), an extra column will be included in the results. The name of the new column will be the name of the group. Keep in mind, though, that only alphanumeric characters are allowed in group names. As before, we can run the transformation on the data in Excel – or – preview the query and results in QueryStorm.

Splitting columns results

Here’s what the entire procedure looks like:
img1

To be fair, this was a simple scenario, and one we could do with Excel’s string formulas alone, but I wanted to illustrate the procedure without getting too fancy with the example. A slightly more complicated example could be splitting URLs into protocol/domain/port/path, especially if some of the elements are optional (e.g. querystorm.com, www.querystorm.com, https://querystorm.com:80/api/download).

Splitting rows based on regex matches

Splitting columns usually makes sense when we have a fixed number of possible elements contained in the original column, and each element has a different meaning (e.g. first name and last name). Some times, though, a cell can contain an arbitrarily long list of elements of the same kind. For example, in the movies table, we have a column that contains a list of genres that each movie belongs to. This presents a problem: how do we, for example, find out the average rating for each genre? To do any meaningful analysis on genres, we have to split each row into multiple rows: one for each genre a movie belongs to.

To illustrate what I mean, consider the example in the image below. The data on the left side doesn’t allow us to do any meaningful analysis on genres. Once we split rows however, we can easily perform analysis based on genres.

Splitting columns

The table on the right was created using the Regex split (rows) quick-query:

Splitting rows

Splitting columns

We can write this result as a new table, but it’s also pretty easy to tweak the query so that we group by genre and calculate aggregates (e.g. average IMDB movie rating).

SELECT
    regex_split.value AS genre, avg(imdb_score)
FROM
    movies
    , regex_split(genres, '[\w-]+')
GROUP BY
    genre

Once we’ve got our results, we can write them into Excel. With a bit of formatting we can get a good sense of the characteristics of different genres.

Aggregate by genre

Conclusion

The ability to use regular expressions in Excel can come in very handy. Excel files often contain dirty data or data in a form that’s not suitable for analysis. When cleaning data, transforming it, extracting values from text or doing exploratory data analysis, regular expressions can be a valuable tool to get the job done in an easier and faster way. Even if you don’t use Excel all that often, with support for regular expressions, Excel becomes a useful scratchpad for fiddling with tabular data.

If you haven’t already, you can download the demo workbook and QueryStorm using the provided links.

Leave a Comment

Your email address will not be published. Required fields are marked *