Content & chapters
In this post, I will explain how to use regular expressions (regex) in Google Data Studio, Google Analytics, and also Tag Manager. I will also share some advanced examples and use-cases. And last but not least, I will update this post whenever stumbling upon new methods to put regex to use.
Chapters
This article is written with a focus on Google Data Studio. However, the regex functions can also be applied in Google Analytics as well as Tag Manager.
Let’s commence.
Regex symbol meaning
The definition of regular expression according to Wikipedia is:
A regular expression, regex or regexp (sometimes called a rational expression) is a sequence of characters that define a search pattern.
There are many different types of regex used by many different programming languages and programs. Google supports a simple yet powerful variant of regex for Data Studio and Analytics.
In the table below you find the symbols used in regular expressions. Click on the name of the symbol to jump to the detailed information.
Symbol | Name | Meaning |
. | dot | Matches any character |
? | question mark | The preceding character is optional |
+ | plus | Matches the previous character 1 or more times |
* | asterisk | Matches the previous character 0 or more times |
| | pipe | Creates an OR |
^ | caret | Must begin with next characters(s) |
$ | dollar | Must end with preceding character(s) |
( ) | parentheses | Matches enclosed characters in exact order in a string or groups expressions |
[ ] | square brackets | Matches enclosed characters in any order in a string |
{ } | curly brackets | Matches preceding character a specific number of times. |
– | dash | Used to create a range of characters |
\ | backslash | Sets the next character to be interpreted literally |
\w | any word | Matches any word, containing letters and/or numbers |
\d | any number | Matches any one or multiple consecutive numbers |
\s | a space | Matches a space |
You can also find a list on the Google support page about regex here.
Let’s dive a little deeper into every symbol.
Regex – Dot < . >
The dot matches any single character. Basically it’s a wildcard. For example:
- .age‘ matches e.g. “page”, “rage” or “sage”
- ‘l..k’ matches e.g. “look”, “link” or “lock”
- ‘100.’ matches e.g. “1000”, “1001”, “1009”, “100a” or “100@”
The number of dots define the number of characters. By combining the dot symbol with the asterisk symbol you can match any character and any number of characters.
Regex – Question mark < ? >
The question mark sets the preceding character as optional. This can be useful for catching misspellings or to manipulate numbers:
- ‘miss?pelling’ matches “misspelling” and “mispelling”
- ‘101?’ matches “101” and “10”
A common use-case for the question mark symbol is in handling IP-addresses. Check it out here.
Regex – Plus < + >
The
An example of a plus in action:
- ’10+’ matches e.g. “100”, “1000”, “10000” etc.
- ‘ballo+n’ matches e.g. “balloon”, “ballooon”, “balloooon”, do I need to continue? 🙂
Regex – Asterisk < * >
The asterisk matches the preceding character 0 or more times. Like the plus symbol but in stead of a minimum of 1 character, the asterisk also matches if the character before is not a part of the result.
So:
- ‘10*’ matches e.g. “10”, “100”, “1000” etc.
- ‘ballo+n’ matches e.g. “ballon”, “balloon”, you got it right?
Awesome combo: Dot & Asterisk to mach any character, any number of times.
Regex – Pipe < | >
The pipe symbol means OR. If you need to match more variations in one single query, the pipe will come in handy!
An example: ‘/gift/|/giftcard/’
Matches any URL which contains either /gift/ or /giftcard/.
Regex – Caret < ^ >
The caret symbol is used to define the beginning of any query. So if you start with a caret, the character(s) that come after only match if it’s the beginning of the result.
For example: ‘^/giftcard/’ only matches pages that begin with /giftcard/. If your website also contains URL’s like /buy/giftcard/, these will not match the query.
Regex – Dollar < $ >
Like the caret but the other way around 🙂 The dollar sign defines the ending of any query. In other words, nothing comes after the dollar sign. Like when Gandalf puts his staff down, you ain’t getting past.
Example: ‘giftcard$’ matches ‘giftcard’, ‘buy-giftcard’ or ‘shop giftcard’. It does not match ‘giftcards‘ or ‘/giftcard/‘.
Regex – Parentheses < ( ) >
Matches enclosed characters in exact order in a string. Parentheses are also used to group expressions.
‘AB|C’ = ‘AB’ or ‘C’
‘A(B|C)’ = ‘AB’ or ‘AC’
Parentheses are very handy in handling IP-addresses. Read all about it.
Regex – Square brackets < [ ] >
With square brackets you can create lists. Put multiple character inside the square brackets and the expression will match 1 of those.
So for example ‘product[ABC]’ matches:
- ‘productA’
- ‘productB’
- ‘productC’
It does not match ‘productAB’ or ‘productAC’.
Regex – Curly brackets < { } >
Curly brackets (or braces) repeat the preceding character a specific number of times. For some reason, these brackets are not mentioned in any Google Analytics documentation. Curly brackets can contain 2 numbers, separated with a comma like so {2,5}. The first number states the minimum times a character is repeated and the second number defines the maximum. The maximum setting is optional.
Regex – Dash/ hyphen < – >
The dash symbol is used in combination with the square brackets. Put them together and you can create powerful lists. For example ‘[A-Z]’ means every capital letter in the alphabet. It works the same for digits: ‘[0-9]’.
Regex – Backslash < \ >
The backslash functions as an escape in a regex. Need to use a symbol from the list above but not as a regex function? Escape it using de backslash. For example, you want to search for a string containing dots like an IP-address 123.456.10.10 you can escape the function of the dot with the backslash like so: 123\.456\.10\.10
Below you will find a few Shorthand Character Classes (Regex Shortcuts) that are supported in Google Analytics and Google Data Studio. Shorthands are a simple, more human-readable shortcuts for specific regex characters.
Check out the combo’s to see examples.
Regex – Any word < \w >
The ‘\w‘ matches any ASCII character so letters, digits, and the underscore. It is the shortcut for ‘[A-Za-z0-9_]‘.
Regex – Any number < \d >
The ‘\d‘ matches any number (digit). It’s short for ‘[0-9]’.
Regex – Space < \s >
The ‘\s’ matches a space. It can be used to separate or count words.
Check out this combo to learn how to use these shortcuts.
Regex combo’s
And now, brace yourself for the real power of Regex: combo’s!
Dot Asterisk < .* >
This is the Regex combo I use the most in Data Studio or Analytics. The dot means any character and the asterisk says that the dot can happen 0 or more times. So basically this means any character and any number of characters. For example:
‘/sign-up/.*/thank-you’ matches both
/sign-up/newsletter/thank-you
as well as
/sign-up/mailing/thank-you
or any other text or numbers in between the forward slashes.
Number of words with regex
Say you want to filter on any number of words used as a search query on your website. Using the regex shortcuts in combination with the carrot and dollar sign you can do just that!
Match 1 word:
^\w*$
I’ll skip two words because I’m a rebel! Matching 3 words:
^\w*\s\w*\s\w*$
Now you can create badass segments filtering short- and longtail search queries and be able to see the impact on your website goals.
Handling IP-addresses
There are a few tricks to matching IP-addresses with regex. First, you will need to escape the dots which are used in every IP-address.
123\.456\.789\.10
If you need to match a range of different IP-addresses you can use the list and the curly brackets quantifier together for a powerful combo. For example, you want to match all IP-addresses from 123.456.789.0 up to 123.456.789.99
123\.456\.789\.[0-9]{1,2}
That was easy right? But what if you only need to match 0-25 and 55-70?
123\.456\.789\.([0-9]|1[0-9]|2[0-5]|5[5-9]|6[0-9]|70)
Finally, if you’re looking to match any number in the last octet of the IP-address you can also use the shortcut ‘\d’
123\.456\.789\.\d
Using regex to filter IP-addresses? Check if you are using IP anonymization. If you are, always block ‘0’ at the end of every IP-address. Or use ‘\d’ in the last octet.
Matching e-mail addresses
Correctly matching e-mail addresses can be done with the regex below.
^([\w-]+(?:\.[\w-]+)*)@((?:[\w-]+\.)*\w[\w-]{0,66})\.([a-z]{2,6}(?:\.[a-z]{2})?)$
28 comments
Hi Mark, how are you doing?
I need some help with Data Studio and REGEXP…
Is there a way to get additive conditions using regexp_match?
Ex: lets pretend I have the Path an user follow navigating on my website stored in the field “Path”.
Can I filter only the users who visited BOTH page 1 and page 3?
Path A = page1/page4/page3/page6 –> MATCH
Path B = page2/page4/page3/page6 –> DOESN’T MATCH
Plus: Can I filter only the users who visited page 1 before page 3? I mean… Is there any way to filter only people who followed a specific page order?
Path C = page1/page4/page3/page6 –> MATCH
Path D = page3/page4/page1/page6 –> DOESN’T MATCH
Thank you in advance,
Henrique
Hey Henrique,
I’m not entirely sure I understand what you mean by additive conditions, but here’s an answer to your question that I think might work just fine.
To match only page 1 and 3, and in that specific order, you could simply filter using the condition CONTAINS:
page1.*page3
You could extend this for example to only people who entered on page 1 and visited page 3 or 6 after:
^page1.*page(3|6)
Hope this helps!
Hi! I’m extremely new to this Regex stuff. Here is what I’m wanting to do:
I have a dashboard in data studio with Page as a dimension. It’s displaying a long URL with the name of products. I’m wanting to create a custom dimension that just shows the product name in the url. Here is an example of the URl when I have Page as a dimension
products/peanut-butter-chunky
products/peanut-butter-chunky/
I’m not sure why sometimes a product URL shows with a / at the end and other times it doesn’t. Here is the code I’m trying to write and not getting the results I want:
REGEXP_EXTRACT(Page, “([*products])”)
Any help would be much appreciated. Thank you!
Hey Cody,
I think you have a few options.
Hope this helps!
Hi Mark! Hope you are doing well! Great article 🙂
I have this data studio problem at hand, where I have custom weeks : My week starts on Friday and ends on Thursday. I tried using custom date ranges using CASE WHEN, however, is there a way I can use a between clause or any other clause with CASE WHEN so that I can select all the dates within a date range and group them as WK1, WK2 and so on , instead of having to write 365 days individually? Please let me know 🙂
Thanks!
Hi Sasmita,
Maybe this article will help you.
Let me know if this works.
Hi,
I need help regarding REGEXPR_EXTRACT
I want to extract data speparated by “;;” like: ;;;;;;
I’m try the following regular expression but without success:
For the first: REGEXP_EXTRACT(Event Param Value (String),'(.*);;.*’)
For the second: REGEXP_EXTRACT(Event Param Value (String),’.*;;(.*);;.*’)
For the third: REGEXP_EXTRACT(Event Param Value (String),’.*;;.*;;.*;;(.*)’)
can you help me to find the right regular expression?
Thanks,
Philippe
Hi Philippe,
Try these:
1. REGEXP_EXTRACT(Event Param Value , ‘(.*);;(.*);;.*$’)
2. REGEXP_EXTRACT(Event Param Value , ‘;;(.*);;.*$’)
3. REGEXP_EXTRACT(Event Param Value , ‘^.*;;(.*)+’)
Hope these work for you!
Hi Mark,
The expressions provided does not work better.
Doing more test I have seen that the expression matchs from the end
For example with data “Field 1;;Field 2;;Field 3;;Field 4;;Field 5”
The expression REGEXP_EXTRACT(Event Param Value (String),'(.*);;.*’) return “Field 1;;Field 2;;Field 3;;Field 4”
I want to extract “Field 1” and the number of fields are variable.
How to have the expression doing matching from the start instead of from the end?
Thansk in advance,
Philippe
Hi Mark,
For your information I have found how to get the first field separated by “;;” whatever the number of field is: REGEXP_EXTRACT(Event Param Value (String),'(.*?);;.*’)
Thanks for your support,
Philippe
Awesome! Thx for sharing!
Hey Mark! Is there a way to create filtered metrics / fields in Data Studio? Say I have transactions as a metric, but I also want to create a filtered version that reflects the number of transactions that are a certain type, such as subscription transactions. This way I can then analyze percent of total and measure take-rates or upsell rates, etc.
Thanks!
Hi Aaron,
Sorry for the late response. Maybe you already found a solution, if not, I think blending data and filter the transactions there should help you out. In a way, you create a new data source, based on the existing one. You add the existing data source two times, each with transactions as a metric. Then you filter one of them using standard filtering options. That leaves you with a data source containing two metrics, total transaction (unfiltered) and subscription transactions (filtered). You can do calculations with these metrics to get ratios.
Hope this helps.
Hi
In google analytics, what is the maximum no of products that can go in Regex if we need to apply ceratin filter.
Thanks
Hi Sanchi,
‘Products’ meaning characters? If so, I think it differs on where you input the regex in GA. I’ve read that some people experience a max. of 255 characters.
If you mean how many products (within a dimension) a regex can return; this could vary from 0 to 100%. There is no limit.
Hi Mark,
I’m trying to rename Event Labels being pulled from GA into Data Studio but any event label that has “+” in the label doesn’t register. How would I account for the “+” in my expression so it registers? Sample below.
Thanks in advance
WHEN REGEXP_MATCH(Event Label, “.*(?i)(Why FastSpring: Eliminate Complexity + Risk).*”) THEN “Eliminate Complexity + Risk”
Hi Hamilton,
Try escaping the plus with the backslash symbol. This sets the next character to be interpreted literally.
So:
WHEN REGEXP_MATCH(Event Label, “.*(?i)(Why FastSpring: Eliminate Complexity \+ Risk).*”) THEN “Eliminate Complexity + Risk”
Hi Mark,
I’ve got a bit of a puzzle. I’m trying to pull GA sessions to specific pages and directories using a calculated field in GDS. Here’s the formula I’m using in GDS:
CASE
WHEN REGEXP_MATCH(Page, “/pricing.*”) THEN “/pricing”
WHEN REGEXP_MATCH(Page, “/integrations.*”) THEN “/integrations/”
WHEN REGEXP_MATCH(Page, “/accounting.*”) THEN “/accounting/”
WHEN REGEXP_MATCH(Page, “/blog.*”) THEN “/blog/”
WHEN REGEXP_MATCH(Page, “^/$|^/\?.*”) THEN “home page”
END
*Note: I’m also applying the following filter in GDS:
Include Source/Medium Contains organic
When I check the results of that formula (a stacked column graph) against GA, all of them work except the home page. In data studio, it shows 962 Sessions. In GA the following shows 799 sessions:
Acquisition > All Traffic > Source Medium
-Secondary Dimension: Page
-Filter: Include Source/Medium Containing organic
-Filter: Include Page MatchingRegExp (^/$|^/\?.*$)
Any idea what might be causing the discrepancy?
Thanks,
-Jared
Hi Jared,
It’s difficult to test/verify these solutions remotely but the first thing I noticed is the difference in de Regex you use in GA vs. GDS:
In theory it should not create this discrepancy but you could try to match the two or leave the querystring out of the equation and see if your problem lies there?
I also experienced these type of issues with capital vs. lower case. You could try to set the source/medium filter to a regex (instead of ‘containing’).
Let me know! If it doesn’t work I will try to reproduce your issue.
Hi Mark,
I’m all new to Regex. Last month I created a regex in GDS to group urls for different product categories that worked well at start. This month however, it seems that only the first 5 categories are displaying and the rest have just disappeared. Could you possibly help me with why this is?
CASE
WHEN REGEXP_MATCH(Page Title,”.*(?i)Atopisk.*”) THEN “Atopic Skin”
WHEN REGEXP_MATCH(Page Title,”.*(?i)kliande.*”) THEN “Atopic Skin”
WHEN REGEXP_MATCH(Page Title,”.*(?i)Atocontrol.*”) THEN “Atopic Skin”
WHEN REGEXP_MATCH(Page Title,”.*(?i)Akne.*”) THEN “Acne”
WHEN REGEXP_MATCH(Page Title,”.*(?i)Dermopurifyer.*”) THEN “Acne”
WHEN REGEXP_MATCH(Page Title,”.*(?i)Dermopurifier.*”) THEN “Acne”
WHEN REGEXP_MATCH(Page Title,”.*(?i)Hyperpigmentering.*”) THEN “Hyperpigmentation”
WHEN REGEXP_MATCH(Page Title,”.*(?i)pigmentfläckar.*”) THEN “Hyperpigmentation”
WHEN REGEXP_MATCH(Page Title ,”.*(?i)Anti-Pigment.*”) THEN “Hyperpigmentation”
WHEN REGEXP_MATCH(Page Title,”.*(?i)Hyperkänslig.*”) THEN “Hypersensitive Skin”
WHEN REGEXP_MATCH(Page Title,”.*(?i)AntiREDNESS.*”) THEN “Hypersensitive Skin”
WHEN REGEXP_MATCH(Page Title,”.*(?i)UltraSensitive.*”) THEN “Hypersensitive Skin”
WHEN REGEXP_MATCH(Page Title,”.*(?i)rosacea.*”) THEN “Hypersensitive Skin”
WHEN REGEXP_MATCH(Page Title,”.*(?i)åldrande.*”) THEN “Anti-age”
WHEN REGEXP_MATCH(Page Title,”.*(?i)anti-age.*”) THEN “Anti-age”
WHEN REGEXP_MATCH(Page Title,”.*(?i)Hyaluron.*”) THEN “Anti-age”
WHEN REGEXP_MATCH(Page Title,”.*(?i)Elasticity.*”) THEN “Anti-age”
WHEN REGEXP_MATCH(Page Title,”.*(?i)C-vitamin.*”) THEN “Anti-age”
WHEN REGEXP_MATCH(Page Title,”.*(?i)åldersbekämpande|.*”) THEN “Anti-age”
WHEN REGEXP_MATCH(Page Title,”.*(?i)hudsprickor.*”) THEN “Aquaphor”
WHEN REGEXP_MATCH(Page Title,”.*(?i)laserbehandling.*”) THEN “Aquaphor”
WHEN REGEXP_MATCH(Page Title,”.*(?i)Aquaphor.*”) THEN “Aquaphor”
WHEN REGEXP_MATCH(Page Title,”.*(?i)spruck.*”) THEN “Aquaphor”
WHEN REGEXP_MATCH(Page Title,”.*(?i)brännskador.*”) THEN “Aquaphor”
WHEN REGEXP_MATCH(Page Title,”.*(?i)torr.*”) THEN “Dry Skin”
WHEN REGEXP_MATCH(Page Title,”.*(?i)torra.*”) THEN “Dry Skin”
WHEN REGEXP_MATCH(Page Title,”.*(?i)uttorkad.*”) THEN “Dry Skin”
WHEN REGEXP_MATCH(Page Title,”.*(?i)uttorkade.*”) THEN “Dry Skin”
WHEN REGEXP_MATCH(Page Title,”.*(?i)pH5.*”) THEN “Dry Skin”
WHEN REGEXP_MATCH(Page Title,”.*(?i)keratosis.*”) THEN “Dry Skin”
WHEN REGEXP_MATCH(Page Title,”.*(?i)UreaRepair.*”) THEN “Dry Skin”
WHEN REGEXP_MATCH(Page Title,”.*(?i)xeros.*”) THEN “Dry Skin”
WHEN REGEXP_MATCH(Page Title,”.*(?i)psoriasis.*”) THEN “Dry Skin”
WHEN REGEXP_MATCH(Page Title,”.*(?i)aquaporin.*”) THEN “Dry Skin”
WHEN REGEXP_MATCH(Page Title,”.*(?i)sun.*”) THEN “Sun”
WHEN REGEXP_MATCH(Page Title,”.*(?i)sol.*”) THEN “Sun”
WHEN REGEXP_MATCH(Page Title,”.*(?i)högenergiljus.*”) THEN “Sun”
WHEN REGEXP_MATCH(Page Title,”.*(?i)solskydd.*”) THEN “Sun”
WHEN REGEXP_MATCH(Page Title,”.*(?i)Om huden.*”) THEN “General Skin”
WHEN REGEXP_MATCH(Page Title,”.*(?i)Hudvård.*”) THEN “General Skin”
WHEN REGEXP_MATCH(Page Title,”.*(?i)rengöring.*”) THEN “Cleansing”
WHEN REGEXP_MATCH(Page Title,”.*(?i)DermatoCLEAN.*”) THEN “Cleansing”
WHEN REGEXP_MATCH(Page Title,”.*(?i)läppvård.*”) THEN “Other Product”
WHEN REGEXP_MATCH(Page Title,”.*(?i)lip.*”) THEN “Other Product”
WHEN REGEXP_MATCH(Page Title,”.*(?i)svettning.*”) THEN “Other Product”
WHEN REGEXP_MATCH(Page Title,”.*(?i)deo.*”) THEN “Other Product”
ELSE “Misc”
END
Thank you in advance! Anna
Hi Anna,
You’re on a roll 🙂 Nice work with the CASE statement! I’ve heard of someone that had 1,200 rows in one CASE, don’t know if it true or not…
Anyhow, I don’t think your REGEX is the problem here, it looks fine to me. You say that it worked well at the start, is it still functioning when you select this first period?
Please let me know!
HI,
Im constantly looking for Datastudio Regex Articles, and there is still a lot to be done 🙂
first of all, thank you very much for this great work.
i want to change the names of my “video” fields, into shorter ones.
my video names include: text, numbers and charcters.
i created this regex, to make this happen:
CASE
WHEN REGEXP_MATCH(Video, “.*Piko 1.*”) THEN “Video Piko 1 – xxx”
WHEN REGEXP_MATCH(Video, “.*Piko 11.*”) THEN “Video Piko 11 – xxx”
Problem: “piko 11” is recognized as Piko 1, since the logic doenst consider that there are 2 digits
Looking forward to your support
Best Regards
Poli
Hi Poli,
Thanks for reaching out! I get your issue, I’ve dealt with this also.
I’m guessing there is more in de Video name after ‘Piko 1’ because of the dot-asterisk? Is there a way to recognizes this? Maybe a space (‘\s’), or any letter (‘[a-zA-Z]’) after the space.
So the regex for Piko 1 would be: “.*Piko 1\s.*” or “.*Piko 1\s[a-zA-Z].*”
Hi
I need to extract the date from text with this format “MEX | SJD | 2020/04/26 | 2020/04/29 | 1 | 2 | 1”. I tried using REGEXP_EXTRACT but I didn’t know to set the parameter that I have to use. Do you have any idea?
Hi Juan,
I think you could use CONCAT in combination with REGEXP_EXTRACT. The CONCAT combines the three dimensions to a date.
CONCAT(,"^.*\|.*\|.([0-9]{4})/[0-9]{2}/[0-9]{2}.*"), ,"^.*\|.*\|.[0-9]{4}/([0-9]{2})/[0-9]{2}.*"), ,"^.*\|.*\|.[0-9]{4}/[0-9]{2}/([0-9]{2}).*")
REGEXP_EXTRACT(
REGEXP_EXTRACT(
REGEXP_EXTRACT(
)
If you need the second date, add a
.*\|
I didn’t put it to the test so it might need some tweaking. But I’m sure you’ll manage! If not, let me know.
Many thanks it works.
Hi there,
I’m looking for a workaround to solve this problem.
I want to categorize my internal search queries.
WHEN REGEXP_MATCH(Search Term, “(book|toys)” THEN “Products”
How can I match the query “which book to buy” within “products” but don’t match “which bookstore?”
Do you know a solution?
Tom
Hi Tom,
I think one solution is to account for a space after ‘book’ (or ‘books’) which would not match ‘bookstore’.
WHEN REGEXP_MATCH(Search Term, “(books?\s|toys)” THEN “Products”
I’ve seen long lists of When/Then to account for a lot of search queries. Hope this helps!