jira-projects/ADBLK#1: Replacement of lists



Issue Information

Issue Type: issue
Status: closed
Reported By: btasker
Assigned To: btasker

Milestone: v2
Created: 08-Jun-22 16:00



Description

I've been toying with the idea of discontinuing the original lists and instead releasing new simpler ones.

This issue is being raised to lay out the options and track such a decisions



Toggle State Changes

Activity


assigned to @btasker

Currently, I provide quite a range of options at https://www.bentasker.co.uk/adblock/:

  • Full Autolist (Unbound format)
  • Manually blocked zones / Unbound format
  • ABP/Ublock compatible list of blocked domains/zones
  • add Manual Blocks to ABP
  • ABP/Ublock compatible list of blocked domains/zones without Social Media tracker domains
  • add Manual Blocks (no SM) to ABP
  • Modified version of EasyList (ABP/Ublock compatible)
  • add Modified Easylist Blocks to ABP
  • Modified version of EasyList (ABP/Ublock compatible) without Social Media tracker domains
  • add Modified Easylist Blocks (no SM) to ABP
  • ABP compatible list of Social Media tracker domains
  • Add Social Media Trackers to ABP
  • Pi-Hole compatible blocklist
  • Pi-Hole compatible blocklist with Social Media tracking domains

(There are some Greasemonkey scripts too, but I'll ignore those as they're largely static).


Workflow

These scripts are refreshed, and often compiled, by a fairly clunk workchain, initially triggered by update_addomains.sh.

The workflow was quickly hacked together quite some time ago, and hasn't really had the attention it needs to be less crap.

It has had some minor improvements, like the amendment to allow blocks to be broken out into dedicated files (allowing categorisation of domains), but as workflows go it's still pretty shoddy.


Delivery

My adblock lists are delivered via www.bentasker.co.uk using my standard CDN resources. This is unusual, in that most adblock lists tend to be delivered straight from Github.

The problem with delivering via www.bentasker.co.uk is that the cache needs to be invalidated whenever an update is made to the lists. This makes automation hard (as the server either needs creds to interact with the CDN, or you have to have quite short TTLs).


Management

Blocks are managed via a number of config files, and then compiled into the published files.

But, there's some legacy cruft from the old management approach, so lines may not always be consistently added in the correct place.

This could probably be addressed independently if needed, but worth including in the context of wider changes.

Option 1: Discontinue Lists entirely

Had it been just me using the lists, then it's possible that I might have considered this option (and instead locally hosted something for pihole to consume).

However, looking at my access log, there are thousands upon thousands of requests a month for the adblock lists.

Whilst I don't want to continue the lists in their current form, it feels like offering an alternative would be the decent thing

Option 2: Discontinue (most) automation

This is currently my preferred option.

In this option, we'd spin out a new set of adblock lists, but with certain elements of the automation removed.

In particular, we'd no longer retrieve and rewrite easylist lists (so the config files easylist_append_lines.txt, easylist_strip.txt and easylist_strip_absolute.txt would be deprecated and removed).

Configuration would be revised, but wouldn't be that dissimilar to now.

However, the project would no longer be reliant on a cron script somewhere - individual lists would be compiled, on commit, by a git hook.

This change in build process means that external lists (such as the cname-trackers list) shouldn't be included. Whilst they could trivially be pulled in during a hook run, it doesn't make logical sense to present a "complete" list knowing that it'll only be updated when I add something to my own lists - inclusion of third party scripts needs to have a regular refresh cadence, which I don't want to commit to in this project.

Delivery of the lists would be via Github rather than my CDN

The idea here is that I should be able to more quickly publish updates to lists without changes in my own infra necessarily impacting it.

The existing adblock files would be left where they are - where possible redirects would be added to the new version, but only where the new version is directly compatible with what the user-agent thinks they're requesting (i.e. if they're requesting something with no social media domains in it, we can't direct to a generic list).

Option 3: Do Nothing

The final option is to carry on as we are now - do nothing.

But, it's likely to impact the cadence of updates: I've had a few infra changes this year, and the cron to retrieve and publish updates needs revising (in part due to my move to using nikola for my main site).

My intention is to look at moving to Option 2.

Ideally, I'll continue to track changes in this project, and given the option I'd like commits to be made into this project too (potentially with some sort of dual-remote setup in order to publish into Github).

The logical approach would be to fork the existing repo to make the necessary changes so that (where possible) a record of when domains were added (and why) is retained in the commit history.

verified

mentioned in commit 81eeba99bd0de0ac00ed36edd9f3cd8f88f91195

Commit: 81eeba99bd0de0ac00ed36edd9f3cd8f88f91195 
Author: B Tasker                            
                            
Date: 2022-06-08T17:31:02.000+01:00 

Message

Create hook to update hooks after git-pull in preparation for jira-projects/ADBLK#1

+4 -0 (4 lines changed)
verified

mentioned in commit 9900a5f2641dd6d4e8ab7371c42b5075025dfbb1

Commit: 9900a5f2641dd6d4e8ab7371c42b5075025dfbb1 
Author: B Tasker                            
                            
Date: 2022-06-08T17:33:18.000+01:00 

Message

Add a post-commit hook in support of jira-projects/ADBLK#1

This will ensure that when a commit is made, it's pushed to both Gitlab and Github.

This may be refined later to only trigger when on the master branch - so that complex changes can be performed in a branch and only pushed once they're merged

+13 -0 (13 lines changed)
verified

mentioned in commit 13b3199017936d9c237bf5aefd67ed0259928930

Commit: 13b3199017936d9c237bf5aefd67ed0259928930 
Author: B Tasker                            
                            
Date: 2022-06-08T17:37:18.000+01:00 

Message

Remove files that are defunct under jira-projects/ADBLK#1

These files will continue to be hosted at https://www.bentasker.co.uk/adblock/ but are not part of v2 of this project

+0 -446 (446 lines changed)

OK, the easy bit is done - the next stage is to look at writing a pre-commit hook that can build lists for us.

We first need to define what lists we want to continue to provide. To a certain extent, that's going to be defined by which are actually in use.

  • The unbound compatible autolist.txt gets a few hundred requests a month, should probably port that over
  • adblock_compiled.txt (ABP/Ublock compatible list of blocked domains/zones)
  • blockeddomains.txt Pihole compatible list
  • manualzones.txt: Used for regex blocks in Pihole

Not being ported:

  • zoneblocks.unbound.txt: Unbound format version of manualzones.txt
  • adblock_compiled_no_sm.txt
  • easylist_modified.txt - Modified easylist support is being deprecated
  • easylist_modified_no_sm.txt - Modified easylist support is being deprecated
  • social_media_trackers.txt - very limited use

It doesn't look like there's any particular current interest in lists that identify social media trackers separately (or those that exclude them).

So, V2 will generate a much simpler subset of lists, which can be summarised as

  • Unbound format list of domains
  • ABP/Ublock compatible list of domains
  • Pihole compatible list of domains
  • List of zones blocked for use in Pihole regexes
verified

mentioned in commit af75c5b9128769922004031e09c89f4a9e040621

Commit: af75c5b9128769922004031e09c89f4a9e040621 
Author: B Tasker                            
                            
Date: 2022-06-08T18:03:49.000+01:00 

Message

Start creating new list building script for jira-projects/ADBLK#1

This'll eventually be triggered as part of a hook.

Although I'm tweaking a little as I go, initially it probably won't be much less clunky than the original as I'm using that as the basis

+68 -0 (68 lines changed)
verified

mentioned in commit 58202238c9736382c38d1f32c73b0069b8c56ddb

Commit: 58202238c9736382c38d1f32c73b0069b8c56ddb 
Author: B Tasker                            
                            
Date: 2022-06-08T18:14:50.000+01:00 

Message

Add support for list of zones for jira-projects/ADBLK#1

This is missing some functionality from the original, should probably add that later

+28 -2 (30 lines changed)
verified

mentioned in commit 60feb062857d8ab6069dce2029a0d899b4d4e700

Commit: 60feb062857d8ab6069dce2029a0d899b4d4e700 
Author: B Tasker                            
                            
Date: 2022-06-08T18:20:37.000+01:00 

Message

Add AdblockPlus compatability for jira-projects/ADBLK#1

This generates a file compatible with ABP and UBlock Origin

+34 -0 (34 lines changed)

We now have our 4 formats - the script is much simplified compared to it's predecessor (largely because we're not having to mess about with modifying easylist).

verified

mentioned in commit a2c9cfc1e38e24a85446360c08290f90b1706838

Commit: a2c9cfc1e38e24a85446360c08290f90b1706838 
Author: B Tasker                            
                            
Date: 2022-06-08T18:22:37.000+01:00 

Message

As of jira-projects/ADBLK#1 social media trackers will no longer be accounted for in seperate blocklists.

Move into the general block config

+0 -0 (0 lines changed)
verified

mentioned in commit eefcd76f6abad9f6c0503991c6e1ca6401296bbc

Commit: eefcd76f6abad9f6c0503991c6e1ca6401296bbc 
Author: B Tasker                            
                            
Date: 2022-06-08T18:23:16.000+01:00 

Message

The easylist overrides are defunct as of jira-projects/ADBLK#1

Remove them

+0 -13 (13 lines changed)
verified

mentioned in commit 5ea126941f5ea5c8c3507e592c3b71b34e95405a

Commit: 5ea126941f5ea5c8c3507e592c3b71b34e95405a 
Author: B Tasker                            
                            
Date: 2022-06-08T18:34:07.000+01:00 

Message

Update script to install/publish the generated lists for jira-projects/ADBLK#1

+40 -35 (75 lines changed)
verified

mentioned in commit cae4954b315ff09e295996158c63b41201308d4c

Commit: cae4954b315ff09e295996158c63b41201308d4c 
Author: B Tasker                            
                            
Date: 2022-06-08T19:36:59.000+01:00 

Message

Have commit hooks rebuild and publish the lists for jira-projects/ADBLK#1

The process is a little contrived: pre-commit will let you stage files, but not add them into the current commit.

So, instead, we write to a lockfile which post-commit checks for. If present, it'll rebuild the lists and amend the commit to include them.

The lockfile is used because otherwise the commit --amend will re-trigger post-commit giving an infinite loop

+25 -2 (27 lines changed)

The list directory in the repository contains more or less a single adblock list published in a number of different formats formats

The list of blocked zones can be used with a parser to generate regexes to feed into PiHole.

Still TODO

  • Update original page/README
  • Site news post?
  • Add redirects for old lists to new

Need to think carefully before doing any redirects, as we don't want to accidentally remove people's existing protection.

For example, because the original rewrites easylist (and pulls in a list of miner domains and cname trackers), there are quite a few entries in blockeddomains.txt

$ wc -l files/adblock/blockeddomains.txt 
30664 files/adblock/blockeddomains.txt

Whereas the repo version has much fewer

$ wc -l lists/blockeddomains.txt 
755 lists/blockeddomains.txt

So, it'd be unwise to redirect blockdomains.txt away as it'd have the effect of unexpectedly removing ~30K domains from people's blocklists.

The same logic applies to the ABP format (5676 vs 1019).

The unbound format (previously autolist.txt) is less clear - the previous version had 394, the current has 390. Should look into why that is.

Similarly the list of blockedzones is close (195 vs 197), again, we should look into why

The list of regex blocks can be redirected - they're identical.

The unbound format (previously autolist.txt) is less clear - the previous version had 394, the current has 390. Should look into why that is.

The missing lines are

local-zone: "pecult.com" redirect
local-data: "pecult.com A 127.0.0.1"
local-zone: "vibuin.com" redirect
local-data: "vibuin.com A 127.0.0.1"

It looks like the build script is skipping it because it'll be blocked as a zone, but then the zone block never makes it into the file...

That happens because the check runs

egrep -v -e "^${domain#*.}|^$domain" $blocked_zones

Which returns true.

There's no good reason for the inversion (-v) there, so I'm removing that.

It returns true because the statement evaluates to

$ egrep -e "^com|^pecult.com" lists/zones.txt 
commoncannon.com

The initial check was intended to strip a subdomain off (so that a block for foo.bar.com wouldn't collide with a zone block for bar.com), obviously that doesn't work well if there is no subdomain in the name.

It's not trivially addressable either: we could check for a depth of 2, but foo.co.uk would fail that check and still return the same result.

I'll spin out a seperate ticket to track that one (#2)

mentioned in issue #2

#2 is resolved. It also accounts for (and has corrected) the difference between blockedzones.txt and zones.txt

The new lists now have the same content as their predecessors, except for the third-party entries pulled in. So, I think we're good in that respect.

It's occurred to me that implementing any redirects is probably unwise. A number of my scripts do something like

curl -s "https://www.bentasker.co.uk/adblock/regex_blocks.txt"

Those would be broken by a redirect (because curl won't follow it by default). I can fix my own scripts, but that'd still leave the risk of breaking other people's deployments.

So, no redirects to implement.

Writeup published at https://www.bentasker.co.uk/posts/blog/general/replacing-my-ad-block-lists-with-a-newer-version.html.

I've disabled all automation around version 1 - I think we're done.