#13 Downsample Search Information : websites/privacy-sensitive-analytics#13

btasker Permalink
29-Mar-22 07:48

assigned to @btasker

btasker Permalink
29-Mar-22 07:51

It probably isn't possible to downsample the search terms themselves

Actually, on reflection, I'd go further. It isn't really desirable to downsample the terms themselves - some users search for some horrible terms, I don't really want those in a long-lived database.

What'd be nice is if we could find a good way to categorise those terms and record a count for each. Of course, that's easier said than done (at least in any kind of reliable manner) so may be better left as an aspiration.

btasker Permalink
29-Mar-22 14:44

So, yeah, I think we probably want to capture the following (per domain)

Number of searches
Avg num results per search

btasker Permalink
29-Mar-22 14:51

We can pull number of searches per domain with

from(bucket: "telegraf/autogen")
  |> range(start: v.timeRangeStart)
  |> filter(fn: (r) => r._measurement == "pf_analytics_test_search_terms")
  |> filter(fn: (r) => r._field == "search_term")
  |> drop(columns: ["sess_id"])
  |> aggregateWindow(every: 15m, fn: count)
  |> map(fn: (r) => ({r with _field: "search_count", action: "search"}))

btasker Permalink
29-Mar-22 14:55

verified

mentioned in commit 3bd0662599f124b73f0cd5c3cd1c50ac9c152ad6

Commit: 3bd0662599f124b73f0cd5c3cd1c50ac9c152ad6 
Author: B Tasker                            
                            
Date: 2022-03-29T15:54:24.000+01:00

Message

Downsample search terms into a count of searches over time for websites/privacy-sensitive-analytics#13

+20 -2 (22 lines changed)

btasker Permalink
29-Mar-22 14:57

Result counts are much simpler

from(bucket: "telegraf/autogen")
  |> range(start: v.timeRangeStart)
  |> filter(fn: (r) => r._measurement == "pf_analytics_test_search_terms")
  |> filter(fn: (r) => r._field == "result_count")
  |> drop(columns: ["sess_id"])
  |> aggregateWindow(every: 15m, fn: mean)
  |> group()

btasker Permalink
29-Mar-22 14:59

verified

mentioned in commit 124ee63effbe739afe0a4bcd39689cd3e24ef97b

Commit: 124ee63effbe739afe0a4bcd39689cd3e24ef97b 
Author: B Tasker                            
                            
Date: 2022-03-29T15:58:42.000+01:00

Message

Record average number of search results for websites/privacy-sensitive-analytics#13

+12 -0 (12 lines changed)

btasker Permalink
29-Mar-22 15:23

We now have count and average result counts - don't think there's really anything else we want to keep long term.

I'll keep this open for now, on the offchance I feel inspired to try some kind of categorisation but I suspect I won't (if nothing else, the categories are likely to be quite site specific).

One option might be to "downsample" them into a dedicated DB whilst I try and decide what to do, that way they can easily be dropped if I decide I don't want to do anything with them

btasker Permalink
29-Mar-22 16:32

verified

mentioned in commit a570a6b783c43d03f81f51c6483329c6759557a2

Commit: a570a6b783c43d03f81f51c6483329c6759557a2 
Author: B Tasker                            
                            
Date: 2022-03-29T17:28:32.000+01:00

Message

Capture search terms into a seperate bucket for websites/privacy-sensitive-analytics#13

This is so that we can preserve seaarch information whilst trying to find a reliable way to categorise search inputs

+24 -0 (24 lines changed)

btasker Permalink
29-Mar-22 16:38

I've implemented capture of search terms into bucket search_terms/autogen.

There's a wiki page detailing how to list them

websites/privacy-sensitive-analytics#13: Downsample Search Information

Issue Information

Activity