project Websites / Privacy Sensitive Analytics avatar

websites/privacy-sensitive-analytics#13: Downsample Search Information



Issue Information

Issue Type: issue
Status: closed
Reported By: btasker
Assigned To: btasker

Milestone: V0.3
Created: 29-Mar-22 07:48



Description

We implemented search term capture in #9 but the information isn't currently downsampled.

It probably isn't possible to downsample the search terms themselves, but we should at least record a count of searches over time.



Toggle State Changes

Activity


assigned to @btasker

It probably isn't possible to downsample the search terms themselves

Actually, on reflection, I'd go further. It isn't really desirable to downsample the terms themselves - some users search for some horrible terms, I don't really want those in a long-lived database.

What'd be nice is if we could find a good way to categorise those terms and record a count for each. Of course, that's easier said than done (at least in any kind of reliable manner) so may be better left as an aspiration.

So, yeah, I think we probably want to capture the following (per domain)

  • Number of searches
  • Avg num results per search

We can pull number of searches per domain with

from(bucket: "telegraf/autogen")
  |> range(start: v.timeRangeStart)
  |> filter(fn: (r) => r._measurement == "pf_analytics_test_search_terms")
  |> filter(fn: (r) => r._field == "search_term")
  |> drop(columns: ["sess_id"])
  |> aggregateWindow(every: 15m, fn: count)
  |> map(fn: (r) => ({r with _field: "search_count", action: "search"}))

verified

mentioned in commit 3bd0662599f124b73f0cd5c3cd1c50ac9c152ad6

Commit: 3bd0662599f124b73f0cd5c3cd1c50ac9c152ad6 
Author: B Tasker                            
                            
Date: 2022-03-29T15:54:24.000+01:00 

Message

Downsample search terms into a count of searches over time for websites/privacy-sensitive-analytics#13

+20 -2 (22 lines changed)

Result counts are much simpler

from(bucket: "telegraf/autogen")
  |> range(start: v.timeRangeStart)
  |> filter(fn: (r) => r._measurement == "pf_analytics_test_search_terms")
  |> filter(fn: (r) => r._field == "result_count")
  |> drop(columns: ["sess_id"])
  |> aggregateWindow(every: 15m, fn: mean)
  |> group()
verified

mentioned in commit 124ee63effbe739afe0a4bcd39689cd3e24ef97b

Commit: 124ee63effbe739afe0a4bcd39689cd3e24ef97b 
Author: B Tasker                            
                            
Date: 2022-03-29T15:58:42.000+01:00 

Message

Record average number of search results for websites/privacy-sensitive-analytics#13

+12 -0 (12 lines changed)

We now have count and average result counts - don't think there's really anything else we want to keep long term.

I'll keep this open for now, on the offchance I feel inspired to try some kind of categorisation but I suspect I won't (if nothing else, the categories are likely to be quite site specific).

One option might be to "downsample" them into a dedicated DB whilst I try and decide what to do, that way they can easily be dropped if I decide I don't want to do anything with them

verified

mentioned in commit a570a6b783c43d03f81f51c6483329c6759557a2

Commit: a570a6b783c43d03f81f51c6483329c6759557a2 
Author: B Tasker                            
                            
Date: 2022-03-29T17:28:32.000+01:00 

Message

Capture search terms into a seperate bucket for websites/privacy-sensitive-analytics#13

This is so that we can preserve seaarch information whilst trying to find a reliable way to categorise search inputs

+24 -0 (24 lines changed)

I've implemented capture of search terms into bucket search_terms/autogen.

There's a wiki page detailing how to list them