June 17, 2020

Regular Expressions (RegEx) for Looker Studio & Analytics

In this post, I will explain how to use regular expressions (regex) in Google Looker Studio, Google Analytics, and also Tag Manager. I will also share some advanced examples and use-cases. And last but not least, I will update this post whenever stumbling upon new methods to put regex to use.

Chapters

This article is written with a focus on Google Looker Studio. However, the regex functions can also be applied in Google Analytics as well as Tag Manager.

Let’s commence.

Regex symbol meaning

The definition of regular expression according to Wikipedia is:

A regular expression, regex or regexp (sometimes called a rational expression) is a sequence of characters that define a search pattern.

There are many different types of regex used by many different programming languages and programs. Google supports a simple yet powerful variant of regex for Data Studio and Analytics.

In the table below you find the symbols used in regular expressions. Click on the name of the symbol to jump to the detailed information.

Regex – Dot < . >

The dot matches any single character. Basically it’s a wildcard. For example:

  • .age‘ matches e.g. “page”, “rage” or “sage”
  • ‘l..k’ matches e.g. “look”, “link” or “lock”
  • ‘100.’ matches e.g. “1000”, “1001”, “1009”, “100a” or “100@”

The number of dots define the number of characters. By combining the dot symbol with the asterisk symbol you can match any character and any number of characters.

Regex – Question mark < ? >

The question mark sets the preceding character as optional. This can be useful for catching misspellings or to manipulate numbers:

  • ‘miss?pelling’ matches “misspelling” and “mispelling”
  • ‘101?’ matches “101” and “10”

A common use-case for the question mark symbol is in handling IP-addresses. Check it out here.

Regex – Plus < + >

The plus means the previous character can repeat itself 1 or more times. It’s very similar to the asterisk which means 0 or more times.

An example of a plus in action:

  • ’10+’ matches e.g. “100”, “1000”, “10000” etc.
  • ‘ballo+n’ matches e.g. “balloon”, “ballooon”, “balloooon”, do I need to continue? 🙂

Regex – Asterisk < * >

The asterisk matches the preceding character 0 or more times. Like the plus symbol but in stead of a minimum of 1 character, the asterisk also matches if the character before is not a part of the result.

So:

  • 10*’ matches e.g. “10”, “100”, “1000” etc.
  • ‘ballo+n’ matches e.g. “ballon”, “balloon”, you got it right?

Awesome combo: Dot & Asterisk to mach any character, any number of times.

Regex – Pipe < | >

The pipe symbol means OR. If you need to match more variations in one single query, the pipe will come in handy!

An example: ‘/gift/|/giftcard/’

Matches any URL which contains either /gift/ or /giftcard/.

Regex – Caret < ^ >

The caret symbol is used to define the beginning of any query. So if you start with a caret, the character(s) that come after only match if it’s the beginning of the result.

For example: ‘^/giftcard/’ only matches pages that begin with /giftcard/. If your website also contains URL’s like /buy/giftcard/, these will not match the query.

Regex – Dollar < $ >

Like the caret but the other way around 🙂 The dollar sign defines the ending of any query. In other words, nothing comes after the dollar sign. Like when Gandalf puts his staff down, you ain’t getting past.

Example: ‘giftcard$’ matches ‘giftcard’, ‘buy-giftcard’ or ‘shop giftcard’. It does not match ‘giftcards‘ or ‘/giftcard/‘.

Regex – Parentheses < ( ) >

Matches enclosed characters in exact order in a string. Parentheses are also used to group expressions.

‘AB|C’ = ‘AB’ or ‘C’
‘A(B|C)’ = ‘AB’ or ‘AC’

Parentheses are very handy in handling IP-addresses. Read all about it.

Regex – Square brackets < [ ] >

With square brackets you can create lists. Put multiple character inside the square brackets and the expression will match 1 of those.

So for example ‘product[ABC]’ matches:

  • ‘productA’
  • ‘productB’
  • ‘productC’

It does not match ‘productAB’ or ‘productAC’.

Regex – Curly brackets < { } >

Curly brackets (or braces) repeat the preceding character a specific number of times. For some reason, these brackets are not mentioned in any Google Analytics documentation. Curly brackets can contain 2 numbers, separated with a comma like so {2,5}. The first number states the minimum times a character is repeated and the second number defines the maximum. The maximum setting is optional.

Regex – Dash/ hyphen < – >

The dash symbol is used in combination with the square brackets. Put them together and you can create powerful lists. For example ‘[A-Z]’ means every capital letter in the alphabet. It works the same for digits: ‘[0-9]’.

Regex – Backslash < \ >

The backslash functions as an escape in a regex. Need to use a symbol from the list above but not as a regex function? Escape it using de backslash. For example, you want to search for a string containing dots like an IP-address 123.456.10.10 you can escape the function of the dot with the backslash like so: 123\.456\.10\.10

Below you will find a few Shorthand Character Classes (Regex Shortcuts) that are supported in Google Analytics and Google Data Studio. Shorthands are a simple, more human-readable shortcuts for specific regex characters.
Check out the combo’s to see examples.

Regex – Any word < \w >

The ‘\w‘ matches any ASCII character so letters, digits, and the underscore. It is the shortcut for ‘[A-Za-z0-9_]‘.

Regex – Any number < \d >

The ‘\d‘ matches any number (digit). It’s short for ‘[0-9]’.

Regex – Space < \s >

The ‘\s’ matches a space. It can be used to separate or count words.

Check out this combo to learn how to use these shortcuts.

Regex combo’s

And now, brace yourself for the real power of Regex: combo’s!

Dot Asterisk < .* >

This is the Regex combo I use the most in Data Studio or Analytics. The dot means any character and the asterisk says that the dot can happen 0 or more times. So basically this means any character and any number of characters. For example:

‘/sign-up/.*/thank-you’ matches both
/sign-up/newsletter/thank-you
as well as
/sign-up/mailing/thank-you
or any other text or numbers in between the forward slashes.

Number of words with regex

Say you want to filter on any number of words used as a search query on your website. Using the regex shortcuts in combination with the carrot and dollar sign you can do just that!

Match 1 word:

^\w*$

I’ll skip two words because I’m a rebel! Matching 3 words:

^\w*\s\w*\s\w*$

Now you can create badass segments filtering short- and longtail search queries and be able to see the impact on your website goals.

Handling IP-addresses

There are a few tricks to matching IP-addresses with regex. First, you will need to escape the dots which are used in every IP-address.

123\.456\.789\.10

If you need to match a range of different IP-addresses you can use the list and the curly brackets quantifier together for a powerful combo. For example, you want to match all IP-addresses from 123.456.789.0 up to 123.456.789.99

123\.456\.789\.[0-9]{1,2}

That was easy right? But what if you only need to match 0-25 and 55-70?

123\.456\.789\.([0-9]|1[0-9]|2[0-5]|5[5-9]|6[0-9]|70)

Finally, if you’re looking to match any number in the last octet of the IP-address you can also use the shortcut ‘\d’

123\.456\.789\.\d

Using regex to filter IP-addresses? Check if you are using IP anonymization. If you are, always add ‘0’ at the end of every IP-address. Or use ‘\d’ in the last octet.

Matching e-mail addresses

Correctly matching e-mail addresses can be done with the regex below.

^([\w-]+(?:\.[\w-]+)*)@((?:[\w-]+\.)*\w[\w-]{0,66})\.([a-z]{2,6}(?:\.[a-z]{2})?)$

Related blogs

These other blogs might interest you as well.