FKAMP-5: See if we can find a way to handle AMP on Google News



Issue Information

Issue Type: Improvement
 
Priority: Major
Status: Closed

Reported By:
Ben Tasker
Assigned To:
Ben Tasker
Project: Anti-AMP Scripts (FKAMP)
Resolution: Done (2019-06-11 18:18:55)
Target version: v1.4.21,

Created: 2019-06-11 17:40:06
Time Spent Working


Description
FKAMP-4 ultimately implemented a script to redirect Google News to Bing news as a workaround for Google having made Google News extremely hostile.

It's not a great long-term solution though, so really need to look at trying to get AMP detection working on Google News


Issue Links

Toggle State Changes

Activity


So, looking at this some more (as I don't think redirects to alternate services are a good long term solution), I've noticed a couple of things:

- If you force a link to open in a new tab, Google then redirects you to the proper page
- Just like in the search results (FKAMP-2) the actual AMP content is served in an iframe directed at ampproject.org

So, there are potentially two options here.

- We could iterate over all links on a page and add target=_blank to them (remembering to also add rel="noopener noreferrer" so that the new page doesn't have access to the Google search tab via window.opener)
- Or, as was originally implemented as a possible fix for FKAMP-2, we could search for the iframe and then try and trigger a redirect if that's present

The first feels a bit messier, but the second has a couple of additional drawbacks. Firstly, it means a request still has to go out to foo-bar-sed.cdn.ampproject.org so you've got additional latency there, plus if *.cdn.ampproject has been blocked in the user's browser we'll never get the info back.

For some reason, even with all my scripts turned off, the iframe removes itself after a few seconds if I open developer tools. Anyway the outer HTML for the iframe is
<iframe allowfullscreen="true" allow="autoplay" class="YQnTXe" src="https://www-bbc-co-uk.cdn.ampproject.org/v/s/www.bbc.co.uk/news/amp/uk-politics-48598760?amp_js_v=0.1#origin=https%3A%2F%2Fnews.google.com&amp;prerenderSize=1&amp;visibilityState=visible&amp;paddingTop=0&amp;history=0&amp;p2r=0&amp;horizontalScrolling=0&amp;storage=1&amp;development=0&amp;log=0&amp;cap=cid&amp;csi=0&amp;cid=1" width="100%" height="100%"></iframe>


So, what we may want to look at doing is checking for iframe elements, for any that are found check whether their src contains cdn.ampproject and if it does rewrite the window location to be that value (so the normal triggers can fire).

That's not perfect, but should work in principle

Repo: RemoveAMP
Commit: c104941e180d239ee9cfa53b250dd67f3a6dbd12
Author: B Tasker <github@<Domain Hidden>>

Date: Tue Jun 11 17:41:07 2019 +0100
Commit Message: FKAMP-5 Update the Googlesearch hook to also work (sort of) with Google News

Introduces logic into AMPCheck to look for iframe's referencing the AMP project CDN. If found, it updates the page to point to that URL so that the normal anti-AMP scripts can fire.

The downside of this is it means there are a couple of page loads before you eventually land on the full-fat page, so there's definitely some room for improvement



Modified (-)(+)
-------
greasemonkey_hook_googlesearch.user.js




Webhook User-Agent

GitHub-Hookshot/d408d22


View Commit

The downside of the implementation in c104941 is that we end up with several page loads

- Google News page (technically rewritten with javascript)
- AMP CDN page
- (optional) Publishers own AMP page
- Proper HTML page

That's, very far from ideal, especially given the reasons noted above for why I don't really want requests to have to go out to cdn.ampproject.org at all.

It's a pity it isn't as simple as search results to work around, but the only the AMP paths are written into the markup sent back from their servers.

Seems my earlier result was wrong, opening in a new tab isn't enough to force the page to redirect you to a proper result.
Looking at the URL's in FKAMP-2, along with clicking around Google news, it does look like the URL structure for Amp project is fairly consistent:

- https://www-theregister-co-uk.cdn.ampproject.org/v/s/www.theregister.co.uk/AMP/2017/05/19/open_source_insider_google_amp_bad_bad_bad/
- https://www-bbc-co-uk.cdn.ampproject.org/v/s/www.bbc.co.uk/news/amp/uk-politics-48598760

We can boil that down to
https://({domain_name}.replace('.','-')).cdn.ampprojects.org/v/s/{domain_name}/{page path}


So, we could skip the hop via the AMP cdn by parsing the relevant sections out of the URL. It'll break if they change their URL structure, but we'll burn that bridge when we come to it

Repo: RemoveAMP
Commit: 791707a121b5c66b1a354e51e7749057bd82355c
Author: B Tasker <github@<Domain Hidden>>

Date: Tue Jun 11 18:03:12 2019 +0100
Commit Message: FKAMP-5 Remove need to go to ampproject CDN before being redirected onto the original publisher

This removes one hop from the redirect chain, and subsequent ones are much faster as you tend to speak to the same domain name for a publishers copy of the AMP as you would for the real page, so DNS is already done and there's a connection open already.

This change means that the Anti-AMP functionality still works on Google News with cdn.ampproject.org blocked in my adblocker



Modified (-)(+)
-------
greasemonkey_hook_googlesearch.user.js




Webhook User-Agent

GitHub-Hookshot/d408d22


View Commit

Although this works, it doesn't account for situations where the origin page requires a query-string.

But, clicking through Google News I've not been able to locate any examples of that to see how Amp Project encode it into their URLs.

Google's documentation though - https://developers.google.com/amp/cache/overview#query-parameter-example - seems to specify that it'll be part of the original query string. Problem is, if we include the original (taken from the iframe) there's all sorts of gumph in there, some of which we may specifically not want to send to the origin server. Lets break the QS down:
amp_js_v=0.1#origin=https%3A%2F%2Fnews.google.com&amp;prerenderSize=1&amp;visibilityState=visible&amp;paddingTop=0&amp;history=0&amp;p2r=0&amp;horizontalScrolling=0&amp;storage=1&amp;development=0&amp;log=0&amp;cap=cid&amp;csi=0&amp;cid=1


Ah, actually, that's not so bad.

Looks like all the AMP specific stuff is pushed into the URL fragment (so is never seen by the AMP CDN, and must just be handled in JS). So, we could just split up to the fragment in order to leave the original query string in place.

Repo: RemoveAMP
Commit: 14f9a8e250e383c35146d56fe5fcbf08a590a1ab
Author: B Tasker <github@<Domain Hidden>>

Date: Tue Jun 11 18:13:43 2019 +0100
Commit Message: FKAMP-5 Split on the fragment rather than the start of the query string



Modified (-)(+)
-------
greasemonkey_hook_googlesearch.user.js




Webhook User-Agent

GitHub-Hookshot/d408d22


View Commit

OK, as this is now working, I'm going to remove Google news from the redirect script, and then look at doing a release (so that I can bump version numbers in the scripts)
btasker changed status from 'Open' to 'Resolved'
btasker added 'Done' to resolution
btasker changed status from 'Resolved' to 'Closed'