Fixing Spam Referrals and Ghost Sessions within Google Analytics and Google Tag Manager

vickeryhillAnalytics

“HOUSTON, WE’VE HAD A PROBLEM HERE!”

In early 2015, we identified a surge in malware infecting some of our clients’ websites that run on WordPress. WordPress is an open-source tool that allows developers to build simple content management websites quickly and efficiently. Like any open-source tool, WordPress is prone to malicious attacks. The majority of these attacks are quickly identified by the online community and updates to WordPress and plugins are released.

At the same time coincidentally there were spikes in spam referral website sessions recorded in these same sites’ Google Analytics accounts (as well as just about all others for that matter). In some instances this fake web traffic was as much as 2x the sites’ total actual visits, and up to 12x the total actual website referrals.

Blue represents all sessions. Orange represents custom segment created to view actual traffic. 

While we have yet to prove with 100% certainty that the malware infections and spam bot traffic were directly tied together, we have definitively concluded that this traffic is 100% fraudulent. This fraudulent traffic, if read as actual traffic, could steer business owners, web developers and marketers in the wrong direction.

In this article we’ll review:

  • the scope of the problem including what spam and ghost referrals are,
  • why they are a problem,
  • who is to blame,
  • a multi-level solution using filters in Google Analytics (GA),
  • an advanced solution using Google Tag Manager (GTM) to block Ghost referrers,
  • and some ideas on how Google could protect and authenticate your tracking code better for the future.

 



 

BOT BUZZ-WORDS. DEFINED.

We’re all familiar with Spambots, Twitterbots and Blogerbots that invade our digital worlds with unwanted spam content, spam links, and spam posts. Now lets define a few new players: Referralbots and Ghost Referralbots.

Referralbots & Ghost Referralbots: What do they do?

Why these spammers use this method:

  1. To get their sites, or the sites that they are getting paid to advertise for, registered/recorded in websites’ public access logs that can then be crawled by legitimate search engine bots (like Google or Bing). The hope is that this will improve their (or their clients’) organic search placement and performance.
  2. Advertise to unsuspecting Google Analytics users by displaying among actual sites in GA’s referral website reporting.
  3. And after taking out your data’s knees, and getting the ability to convince you to visit their website to figure out what the site is that is sending you so much traffic, now comes the scary part. Once on these sites malicious malware could then be downloaded onto your computer. And if you have access to a websites’ analytics account, it’s probable that you have some type of FTP connection to that website, and now you’ve created a backdoor for a crafty program to do who knows what. Never click on unknown referral urls before researching what they are first.

What are Referralbots

Referralbots, aka spam referrers, are computer programs that generate fake HTTP referrer header data using fabricated URLS.

There are sophisticated and automated ways to mock this data but anyone with a little time on their hands could clear browser cache and manually send 300 fake referrals to your website during the time it would take you to watch one 55 minute episode of Game of Thrones.

Without writing the cookbook for bored teenage boys to mess with ‘the man’, let me hint and say that I was able to set up and send 6 fake referrals from a fake domain (i-am-spam.com) in less than a minute using just a web browser and without touching a single line of code.

This scares me because it’s so easy to fake the same amount of referral traffic that you pay good money to get legitimately through linkage and paid advertising.

What are Ghost Referralbots

Ghost Referralbots take advantage of how Google Analytics stores and passes your data to their servers. By injecting random numbers into GA scripts on their servers (shown below in red), these bots can fake sessions with the same HTTP header data without ever requesting HTML from your web server or visiting your website. It’s also possible for them randomly select Google Tag Manager container IDs and use the same method to mess with your data.

Below is a diagram that simplifies how these bots operate.

Spam Referralbots (in orange) transmit session data in the same way that real referral traffic (in green). Ghost Referralbots (in red) make no connection to your site or requests for HTML data from your web server. They pass data directly to Google’s data centers.

 

Example scripts that Ghosts use to plunder your analytics data.

Universal Google Analytics Script

<!– Start Universal Analytics Tracking Script –><script>  (function(i,s,o,g,r,a,m){i[‘GoogleAnalyticsObject’]=r;i[r]=i[r]||function(){  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),  m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)  })(window,document,’script’,’//www.google-analytics.com/analytics.js’,’ga’);
ga(‘create’, ‘UA-XXXXXXXX-1’, ‘i-am-spam.com’);  ga(‘send’, ‘pageview’);</script><!– End Universal Analytics Tracking Script –>

Classic Google Analytics Tracking Script

<!– Start Classic Analytics Tracking Script –><script type=”text/javascript”>  var _gaq = _gaq || [];  _gaq.push([‘_setAccount’, ‘UA-XXXXXXXX-1’]);  _gaq.push([‘_setDomainName’, ‘vickeryhill.com’]);  _gaq.push([‘_trackPageview’]);
(function() {    var ga = document.createElement(‘script’); ga.type = ‘text/javascript’; ga.async = true;    ga.src = (‘https:’ == document.location.protocol ? ‘https://ssl’ : ‘http://www’) + ‘.google-analytics.com/ga.js’;    var s = document.getElementsByTagName(‘script’)[0]; s.parentNode.insertBefore(ga, s);  })();</script><!– End Classic Analytics Tracking Script –>

Google Tag Manager Script which can conatin either GA Script

<!– Start Google Tag Manager Script –><noscript><iframe src=”//www.googletagmanager.com/ns.html?id=GTM-XXXX”height=”0″ width=”0″ style=”display:none;visibility:hidden”></iframe></noscript><script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({‘gtm.start’:new Date().getTime(),event:’gtm.js’});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!=’dataLayer’?’&l=’+l:”;j.async=true;j.src=’//www.googletagmanager.com/gtm.js?id=’+i+dl;f.parentNode.insertBefore(j,f);})(window,document,’script’,’dataLayer’,’GTM-XXXX’);</script><!– End Google Tag Manager Script –>

How do you know the difference?

An easy way to tell the difference between the Ghost Referrers and Spam Referrers is to look at the hostname of the source of the session. Since Ghosts are just randomly running analytics account IDs they don’t have a clue of what hostname (your site) they are attempting to mimic.

This is a major flaw in the way that this GA transfers and authenticates data and approved hosts.

The first column is a sample of spam referrers and the second column is the hostname recorded at the session. Note that the Ghosts (in red) have no idea what hostname they should transmit to Google and they don’t attempt to hide that.


WHAT TO DO? WHERE TO START?

Scan your referral data for spikes in referral traffic over the last 12 months. I’ve noticed that many sites had peaks in March of 2015, and August of 2014. If you see lots of sites you don’t recognize, ‘Google’ their domain along with the words ‘spam referral’ (‘badwebsite.com spam referral’). If it comes up with a few results it’s probably no good. Some of these sites appear to be reputable, but if they are mucking up your data– they can go screw.

Record a list of these domains for you to exclude with filters in Google Analytics. If there are more than 20, do what you can and start with the ones that generate the most sessions.

Take your list and add it to a list of other known offenders. Ben Travis’s article, Removing Referral Spam from Google Analytics is a good read and has a great list of these domains, which I used as the base starting point to add my sites list upon.

Generate regular expressions rules using this list and test using a RegEx tester like RegExPal.com. This list you’ll break into 255 max character rules that you’ll use to create exclude filters in all your GA views. It’s very important that you test this to make sure that you’re not excluding any real referral sources like you own domains, Google, Facebook, news sites, etc..

Once you’ve tested in your favorite Regular Expressions tester, it’s time to create your filters.
Make sure you have a ‘Testing View’ & ‘Wide Open View’ (totally unfiltered)
By having at least these three views you’ll will be best suited to maintain the integrity of your ‘real user data’. Use your ‘Testing’ view before adding the rules into your ‘Real Data’ view. Keep your ‘WideOpen’ view unfiltered at all times.
 

GOOGLE ANALYTICS FILTERS

GA Filter #1: Include Only My Domain (no ghost hosts)

Here you want to make sure that you are including every hostname you are tracking your analytics script on. This can include 3rd party sites who have different domains than yours. It’s crucial that you include these in your RegEx rule.

GA Filter #2: Exclude Referral Spam Sources 1

Since GA only allows for a rule to be 255 characters max, this is where you’ll add your first list. If your lists is less than 255 characters including the RegEx your probably missing many.

GA Filter #3: Exclude Referral Spam Sources 2

Here will be the remainder of whatever you couldn’t fit into your first exclusion rule. Repeat this until you have all your (known) spammers accounted for. I had 3 exclude filters with over 50 known spammers by the time I finish writing this. I’ll likely be on my 4th or 5th by the time many of you read this post.

GA Settings: Exclude all hits from known bots and spiders

In July of 2014, GA introduced this new filter. Essentially it does as it sounds it should, BUT it doesn’t do it well enough (at least yet). Don’t get me wrong, I’m grateful for the product and access to this spammer list that could cost thousands a year to access, but the one click solution to nixing spammers isn’t a total solution. And as you would with any other filter, you should test in your ‘Testing’ view before rolling out into your ‘Real Data’ view.

Don’t forget that you only have one chance to record your data properly. Once data is in, you’ll have no chance of fixing it if it’s lost or attributed to incorrect sources.


ADVANCED GHOST REFERRAL FILTERING USING A CONTROLLER CUSTOM DEFINITION AND GTM

Credit for this idea came from a post from Sayf Sharif. He creates a cookie to validate the session. I’ve simplified the method a little to only require a specific dataLayer variable be matched in order for the data to track in your analytics account.
GTM dataLayer #1: Create a Custom Definition in your GA Account as a Controller
I used ‘GA_controller’ so it it was obvious to the reader, but I would make this something more ambiguous. Not that a bot is reading and interpreting your dataLayer, but it makes it a little more cryptic for interpretation.
A sophisticated bot could scan your code and mimic this variable. In this case, we’re only really adding another layer of protection for from ‘Dumb Ghost Referral Traffic‘ as Sayf Sharif coined it.
 
GTM dataLayer #2: Create Rule in GA Requiring dataLayer Value for User Session to Track
 
This value of ‘controller-c5Kz5’ we’ll use again in step #5. Again, you should also make this value cryptic if you can. I added ‘controller’ to the name so you could follow clearly in this example process.
 
GTM dataLayer #3: Create Macro in GTM Matching Name of Custom Definition
 
 
GTM dataLayer #4: Add Custom Dimension via Macro to GTM Page View Tag
 
 
GTM dataLayer #5: Add Custom Dimension to dataLayer Above GTM Script

GTM dataLayer #6: Apply Filter to Testing View, Roll into Main GA View When Ready, & Done

While this method is not fool proof, it does add an extra layer of protection against Ghost referrers who blindly attempt to use your Google Tag Manager ID or your Google Analytics ID.


CUSTOM EMAIL & SMS ALERT: POTENTIAL SPAM REFERRAL THREAT

In this final step, we’re going to create a custom alert in your main GA views to send you a text message and/or email alert when your referral traffic spikes over 30% from the previous week. For large websites that get lots of traffic this might not be applicable to you, but for small and medium size sites this could be a saving grace.

You may want to tweak and refine this to be more or less based on your existing referral traffic patterns, but 30% is a good place to start and either way will alert you of the good and the bad sources sending surging visitors to your website.


SO WHAT ABOUT ALL THE OLD INFECTED DATA

As I mentioned above, once the data is processed in GA it CAN NOT be reprocessed. Fortunately in this case the data is not allocated to incorrect sources, which is the most common problem. This issue is that it’s not real user data, therefor it can be easily filtered out. View our post about Custom Segmenting for Google Analytics to learn how, or get our latest segment here to apply and modify for your views.

Once Segment is applied to your reports you can compare actual referral traffic to your spam data.


NEXT STEPS & THOUGHTS

Google needs to implement a better method of tracking by API token and hostname to exclude malicious data like this. I’ve been racking my brain for months and don’t have the answers. Plus I’m fairly certain that Google probably has a crackpot team developing the next phase of analytics security as I type. But one could speculate that if they used the same method as they do for their maps API, some of this could be avoided. Until then we’ll keep updating our exclusion filters and potentially work on more complex methods of authenticating real user sessions using Custom Dimensions that do not have static values.

NEED HELP WITH YOUR ANALYTICS

Give us a call or email us for a free 30-minute consultation. Seriously. We love this stuff!