project Websites / Privacy Sensitive Analytics avatar

websites/privacy-sensitive-analytics#11: Capture information about 404s



Issue Information

Issue Type: issue
Status: closed
Reported By: btasker

Milestone: V0.3
Created: 29-Mar-22 07:37



Description

Currently, we don't load the analytics agent on error pages.

It does mean, though, that we currently aren't gaining any visibility on where there might be broken links (internal or external).

So, we should add a means to record (and report on) 404s (in particular).



Toggle State Changes

Activity


At it's simplest, we probably want to use the standard collection mechanism and just use a different state than normal (i.e. not PageView).

It would mean some of the graphing needs adjusting - some of the queries simply exclude video play events, we'd need to swap those over to being inclusive rather than exclusive.

Equally, there's an argument for introducing a status_code tag - whilst we're only really focused on 404's in this ticket, it'd allow easy future expansion into collecting for other status codes.

We'd still want to review graphing queries, but that's likely to be true either way (and we probably do want some of the graphs to include 404s).

The bit that does concern me though, is the cardinality implications.

page is a tag, so it's values are part of the series key. With that only containing valid page paths there's a finite (even if high) limit to the cardinality is can cause. If we include page for 404's then we're essentially opening that up.

Conversely, there really isn't an awful lot of point in collecting 404s if we don't record the path they were for as that prevents us from investigating and fixing links/implementing redirects etc.

What we might want to do then, is to record 404s under a different measurement - the downside of that is adding more complexity to the server side LUA.

There are advantages on either side.

If we mix the 404's in, then it's easier to run off a graph showing response statuses.

But, if we mix them in, we then have to update a bunch of graphs and add complexity to the downsampling script.

I think it'd be better to write into a seperate measurement. Beyond counts, I can't see it being information that we'd want to keep long term, so don't really want it mixed in with the standard downsample.

verified

mentioned in commit 5f3b3dcb01fab03339407c5e30ff0a4f11b39973

Commit: 5f3b3dcb01fab03339407c5e30ff0a4f11b39973 
Author: B Tasker                            
                            
Date: 2022-03-29T13:29:07.000+01:00 

Message

Add server side support for logging 404s for websites/privacy-sensitive-analytics#11

+115 -12 (127 lines changed)
verified

mentioned in commit ac1620d25415e7a3a293f1f6ecb367cbf5acdf7f

Commit: ac1620d25415e7a3a293f1f6ecb367cbf5acdf7f 
Author: B Tasker                            
                            
Date: 2022-03-29T13:35:21.000+01:00 

Message

Add agent support for reporting 404s for websites/privacy-sensitive-analytics#11

Error pages should set

window.is_error = true;

After the agent script has been embedded

+8 -1 (9 lines changed)

This can be enabled by including the following in the <head> of error pages

<script type="text/javascript" src="https://pfanalytics.bentasker.co.uk/agent.js"></script>
<script type="text/javascript">window.is_error = true;</script>

A count over time can be extracted with

from(bucket: "telegraf/autogen")
  |> range(start: v.timeRangeStart)
  |> filter(fn: (r) => r._measurement == "pf_analytics_test_404s")
  |> filter(fn: (r) => r._field == "response_time")
  |> group(columns: ["domain"])
  |> aggregateWindow(every: v.windowPeriod, fn: count)

The last 50 404's can be listed with

from(bucket: "telegraf/autogen")
  |> range(start: v.timeRangeStart)
  |> filter(fn: (r) => r._measurement == "pf_analytics_test_404s")
  |> filter(fn: (r) => r._field == "response_time")
  |> filter(fn: (r) => r.domain == v.domain)
  |> keep(columns: ["_time","domain","page"])
  |> group()
  |> sort()
  |> limit(n: 50)
verified

mentioned in commit 8f08e25d19ff8c0c70b60cefbdf37c78ac541e3c

Commit: 8f08e25d19ff8c0c70b60cefbdf37c78ac541e3c 
Author: B Tasker                            
                            
Date: 2022-03-29T16:27:52.000+01:00 

Message

Downsample 404 stats for websites/privacy-sensitive-analytics#11

+13 -0 (13 lines changed)

404 counts are now included in downsampling and reflected in the historic dashboard