project Websites / Privacy Sensitive Analytics avatar

websites/privacy-sensitive-analytics#9: Search term capture



Issue Information

Issue Type: issue
Status: closed
Reported By: btasker
Assigned To: btasker

Milestone: 0.2
Created: 15-Mar-22 08:30



Description

I'd like to give the agent (and database) the ability to record terms entered into site search.

The agent shouldn't attempt to automatically bind to any fields, it should just provide a function which can be called to record the search term.

This will allow better targetting of content, as it'll help indicate what users are actively looking for.



Toggle State Changes

Activity


assigned to @btasker

There are 2 different "modes" where this might be used.

At the top of www.bentasker.co.uk is a search form - this is actually just a form that submits to www.duckduckgo.com in order to do a site search. So, in this mode, we'd want submission to happen before the form submits properly (with appropriate care to make sure that exceptions do not break submission).

On snippets.bentasker.co.uk search is built into the site (though it all runs client-side), so rather than submitting search terms at time of form submission, it can be done when the results have loaded (potentially, in this mode we could even include a count of the number of results displayed).

I don't think we need anything overly complex in terms of schema, something along the lines of

pf_analytics_searches,domain=www.bentasker.co.uk search_term="analytics implementation"
pf_analytics_searches,domain=snippets.bentasker.co.uk search_term="LUA",num_results=5

Given it's a free text field, I don't think it makes sense to have the search term be a tag - it's unlikely there'll be much commonality between searches, so we'd just be looking at insane cardinality for no gain.

This isn't a particularly small change - on the server side we'd need to bypass validate_write() (and implement validation specific to this) as well as switching of the measurement depending on whether it's a searchterms write or not.

I think the best way to go at this would be to submit search terms to a different endpoint - whilst the underlying LUA file would still be the same one, we can then switch based on a variable set in the nginx conf rather than having to parse the request path.

However, that has implications for the agent - it currently writes everything to /write. Changing that, though, should just be a case of updating calls to submit() (there should only be one) to include a path.

It's not like there aren't going to need to be other changes to the agent anyway, so this sounds like the right route.

OK, we're going to have writes go into /write_search_term

Within the LUA, we'll switch behaviour based on the value of Nginx config var "$mode" - it should be one of write, write_search.

verified

mentioned in commit 09efd7416426fa807ac74216db34c2a98aa05cc1

Commit: 09efd7416426fa807ac74216db34c2a98aa05cc1 
Author: B Tasker                            
                            
Date: 2022-03-15T08:50:37.000+00:00 

Message

Prepare Nginx config for accepting search term submissions (see websites/privacy-sensitive-analytics#9)

+35 -0 (35 lines changed)
verified

mentioned in commit ed0bb2aebcd9e7882d4a7ee2b63080de8062e674

Commit: ed0bb2aebcd9e7882d4a7ee2b63080de8062e674 
Author: B Tasker                            
                            
Date: 2022-03-15T08:53:45.000+00:00 

Message

Write search terms into a different measurement websites/privacy-sensitive-analytics#9

+1 -1 (2 lines changed)
verified

mentioned in commit 581296ada3758f8649ce97a12f68472d4176e384

Commit: 581296ada3758f8649ce97a12f68472d4176e384 
Author: B Tasker                            
                            
Date: 2022-03-15T09:01:11.000+00:00 

Message

Implement search term submission processing for websites/privacy-sensitive-analytics#9

+60 -3 (63 lines changed)

The POST body should contain

  • domain
  • search_term
  • result_count (integer, optional)
  • sess_id (optional)

It should (if the LUA were live) now be possible to do

curl -X POST \
https://pfanalytics.bentasker.co.uk/write_search_term \
-d 'domain=www.bentasker.co.uk&search_term=testing+analytics'
verified

mentioned in commit 9d7ea5fd30bfca4bba559138787427d4bab90f9c

Commit: 9d7ea5fd30bfca4bba559138787427d4bab90f9c 
Author: B Tasker                            
                            
Date: 2022-03-15T11:58:25.000+00:00 

Message

Adjust agent so that the write path is provided when calling submit(). This is in prep for websites/privacy-sensitive-analytics#9

+6 -7 (13 lines changed)
verified

mentioned in commit 60725650534f4ecb4ad1e99cd66072ed23a93079

Commit: 60725650534f4ecb4ad1e99cd66072ed23a93079 
Author: B Tasker                            
                            
Date: 2022-03-15T12:07:40.000+00:00 

Message

Create function to record a search term and (optional) result count - for websites/privacy-sensitive-analytics#9

+15 -0 (15 lines changed)

Have implemented a function in the agent

recordSearchTerm(term, resultcount)

The param resultcount is optional, but if provided will be treated as an integer.

So, it should be possible to hook into it with something along the following lines (for www.bentasker.co.uk)

eles = document.getElementsByTagName('form');
for (var i=0; i<eles.length; i++){
    eles[i].addEventListener('submit', function(){
    try {
        console.log("yeehaw");
        recordSearchTerm(document.forms[0].q.value); 
        return true;
    }
    catch(err){
        return true;
    }

    });
}

For snippets (and recipes) we'd need to update search.js to do something like

try {
    recordSearchTerm(getWin().srcht, counter);
} catch {}

at the end of writeResultsTable (counter would need adding and incrementing every time a result is written in).

Technically I could set the hooks live before a release is made - the try would prevent it breaking anything - but I'd still prefer not to.

The other end of this process then, is the longer term storage of search terms - the stuff entered will be written into a database with a 7 day retention policy.

Do we want longer storage than that? There isn't really a good way to downsample the data, so we'd be storing points at whatever regularity searches happen (which might include someone trying to fuzz).

What'd be nice - though we probably can't implement it yet - is if we could build something which would "downsample" some stuff into categories, a bit like I did when I did the analysis at https://www.bentasker.co.uk/posts/blog/general/695-an-analysis-of-search-terms-used-on-bentasker-co-uk.html

I think the answer is probably to do nothing for now, and see what we actually end up capturing - we can then design an approach based on actual data.

In which case, let's do an initial deployment with an eye to doing a release.

With a little bit of playing around, the hook we want on www.bentasker.co.uk is

for (var i=0; i<document.forms.length; i++){
    document.forms[i].addEventListener('submit', function(e){
        e.preventDefault();
        try {
            recordSearchTerm(e.target[4].value);
            document.forms[0].q.value = e.target[4].value;
            setTimeout(function() {document.forms[0].submit()}, 500);
            return false;
        } catch {
            return e.target.submit();
        }
    });
}

mentioned in commit sysconfigs/domains.d@6be8fc07ea2a6a0967908fe35ff3bf877f50e9ed

Commit: sysconfigs/domains.d@6be8fc07ea2a6a0967908fe35ff3bf877f50e9ed 
Author: root                            
                            
Date: 2022-03-15T09:09:48.000+00:00 

Message

Set live the new PFanalytics code for testing in websites/privacy-sensitive-analytics#9

+95 -5 (100 lines changed)

mentioned in commit BEN@2230bd1d7cd1c59ec18c45c1f714c40e47a93067

Commit: BEN@2230bd1d7cd1c59ec18c45c1f714c40e47a93067 
Author: root                            
                            
Date: 2022-03-15T14:17:53.000+00:00 

Message +17 -0 (17 lines changed)

All looks good, so closing as Done and including in the next release.

mentioned in issue #13

mentioned in issue Gitlab-Issue-Listing-Script#41

mentioned in issue BEN#11