JILS-41: Add support for downstream re-validation



Issue Information

Issue Type: New Feature
 
Priority: Major
Status: Closed

Reported By:
Ben Tasker
Assigned To:
Ben Tasker
Project: Jira Issue Listing Script (JILS)
Resolution: Done (2016-04-29 14:58:32)
Affects Version: 0.01b,
Target version: 0.01b,
Labels: Caching, Headers,

Created: 2016-04-20 12:23:49
Time Spent Working
Estimated:
  
90 minutes
Remaining:
 
0 minutes
Logged:
 
154 minutes


Description
The aim is to add support for a downstream client/cache to revalidate a copy of a page they've previously obtained.

As an initial step, should add the following headers to page responses

- Last-Modified
- E-Tag

(These two headers have recently been added to Issue pages and attachments/thumbs)

With the ultimate aim being to add support for conditional GETs.


Issue Links

Adding headers to Issue Page (Github)
Adding headers to Attachments and Thumbs (Github)
RFC 7232
Toggle State Changes

Activity


btasker changed timespent from '0 minutes' to '10 minutes'
A little bit of background to help shape requirements:

I currently have an install of Sphider (http://www.sphider.eu/) which crawls JILS (and other sites) periodically. Sphider's bot is, being generous, less than clever.

The way it performs revalidations is to fetch the entire page, generate an MD5 of the content and then compare that to a stored hash to see whether it needs to reprocess.

This is less than efficient as it means the origin has to serve the entire page either way, which really starts to matter when your JIRA database grows beyond a certain (arbitrary) size.

Before it places a GET, the bot places a GET in order to check for non-200 results. So I've patched it to also extract Last-Modified from the headers (may well add ETag later). This is now compared against a date stored in the database (no changes there, the date was already stored), and the GET is changed if the date remains the same.

So, for Sphider's purposes, as of commits 81c7055 and 5d22f2d revalidation is now possible on Issue pages and attachments (and thumbs, though it doesn't fetch those).

However - true revalidation still isn't possible. If (for an example) a caching NGinx reverse proxy were to be placed downstream, it'd periodically need to re-fetch the entire content, as it (correctly) uses conditional requests to revalidate the content rather than placing a HEAD first.

So, the final step in this issue is to add support for a more normal use-case - conditional GETS (i.e. If-Modified-Since etc).

As no page currently returns a cache-control header, also need to implement a configuration option to allow one to be set if desired (the alternative is forcing a caching age either at Apache or on the downstream proxy). The configuration option should allow a per-class setting to be used, so that different ages can be set for Issue pages, Attachments, Issue indexes (i.e. project home pages, versions, components etc).
The other element that needs considering is whether to add Last-Modified/ETag headers to Project, Version and Component indexes.

If so, what's the best way to do so? The idea of adding the headers is to make revalidation's cheap, so the SQL statement used to get the necessary data needs to be as simple as possible, otherwise it may well be cheaper to simply re-enumerate the issues for that page.

But, there's also the question of how smart to be. Which of the following questions do we ask?

- Has any issue linked to from this page changed in some way?
- Has any issue linked to from this page changed in such a way it'll have caused this page to change?

The difference being that the former is (largely) just a raw check of the jiraaction table. The second involves looking for certain events (change of Issue status/resolution, change of issue title, change of assignee, Change of priority, Change of Type, new issue creation etc).

The latter would give a more accurate result, but comes at a cost

- The SQL query will be more complex
- It ties the query to the current layout, making future changes to the layout more complex (will have to remember to update the query)

Going with the simpler route, though, means that the page will need to re-indexed whenever any change is made to an issue within a project - even if that "change" is simply that a comment has been added, or an additional watcher has been added (not even displayed on the issue page currently). But, it does avoid the risk of forgetting to update the query in the future and having new changes not be picked up when they should.
btasker changed timespent from '10 minutes' to '26 minutes'
I think the above needs some additional thought put into it, so for the time being will look at getting support for Conditional Requests dropped into the Issue and Attachment pages.

So will look for and honour

- If-Modified-Since
- If-None-Match
btasker changed status from 'Open' to 'In Progress'

Repo: Jira-Issue-Listing
Commit: 3d3684c14f127bb11fde441bcc8558b6d3c9f7e7
Author: Ben Tasker <github@<Domain Hidden>>

Date: Wed Apr 20 13:30:41 2016 +0100
Commit Message: Added support for conditional requests to Issues and Attachments, see JILS-41



Modified (-)(+)
-------
attachment.php
issue_page.php
utils.class.php




Webhook User-Agent

GitHub-Hookshot/cd33156


View Commit

Support for conditional requests has been introduced by means of two new utility functions:
function evaluateConditionalRequest($mtime,$etag)
function returnNotModified($etag,$lastmod=false)


As required by the RFC, If-None-Match takes precedence over If-Modified-Since.

However, strictly speaking, the implementation isn't technically RFC compliant:
A recipient MUST ignore the If-Modified-Since header field if the received field-value is not a valid HTTP-date, or if the request method is neither GET nor HEAD.


We're not currently checking that it's a valid HTTP date, simply that it's a valid date. Also not currently checking the request method - the system only uses GET/HEAD (ideally Apache should be configured to return a 405 for anything else), but really should insert a check just to be safe

Repo: Jira-Issue-Listing
Commit: 8680b6694c143562f44a197ebd7817d2141cb965
Author: Ben Tasker <github@<Domain Hidden>>

Date: Wed Apr 20 13:37:50 2016 +0100
Commit Message: Conditionals only honoured for GET/HEAD. JILS-41



Modified (-)(+)
-------
utils.class.php




Webhook User-Agent

GitHub-Hookshot/cd33156


View Commit


Repo: Jira-Issue-Listing
Commit: 6f5ea106d4dca2298ed724bb6cdb66126f2d0174
Author: Ben Tasker <github@<Domain Hidden>>

Date: Wed Apr 20 13:44:48 2016 +0100
Commit Message: Implemented validation of If-Modified-Since headers. See JILS-41



Modified (-)(+)
-------
utils.class.php




Webhook User-Agent

GitHub-Hookshot/cd33156


View Commit


Repo: Jira-Issue-Listing
Commit: f064384a7bc8d74094b5542939c861daad6dbdb0
Author: Ben Tasker <github@<Domain Hidden>>

Date: Wed Apr 20 13:45:13 2016 +0100
Commit Message: Removed debug headers used in JILS-41



Modified (-)(+)
-------
utils.class.php




Webhook User-Agent

GitHub-Hookshot/cd33156


View Commit

Have implemented most of the missing checks to make the implementation compliant with the requirements in the RFC. For example
$ GET -Ssed -H "If-Modified-Since: Fri, 05 Dec 2014 19:35:11 GMT" http://testbox/browse/attachments/thumbs/11514/TESTPROJ-120:_thumb_11514.png
GET http://testbox/browse/attachments/thumbs/11514/TESTPROJ-120:_thumb_11514.png
304 Not Modified


But changing the date format:
$ GET -Ssed -H "If-Modified-Since:  05 Dec 2014 19:35:11 GMT" http://testbox/browse/attachments/thumbs/11514/TESTPROJ-120:_thumb_11514.png
GET http://testbox/browse/attachments/thumbs/11514/TESTPROJ-120:_thumb_11514.png
200 OK


The only thing it doesn't do correctly yet is check the timezone, so
$ GET -Ssed -H "If-Modified-Since: Fri, 05 Dec 2014 20:35:11 BST" http://testbox/browse/attachments/thumbs/11514/TESTPROJ-120:_thumb_11514.png
GET http://testbox/browse/attachments/thumbs/11514/TESTPROJ-120:_thumb_11514.png
304 Not Modified


So, it's not 2616 (https://tools.ietf.org/html/rfc2616#section-3.3) compliant yet
This now works as expected
$ GET -Ssed -H "If-Modified-Since: Fri, 05 Dec 2014 20:35:11 BST" http://testbox/browse/attachments/thumbs/11514/TESTPROJ-120:_thumb_11514.png
GET http://testbox/browse/attachments/thumbs/11514/TESTPROJ-120:_thumb_11514.png
200 OK


And with the correct TZ
$ GET -Ssed -H "If-Modified-Since: Fri, 05 Dec 2014 19:35:11 GMT" http://testbox/browse/attachments/thumbs/11514/TESTPROJ-120:_thumb_11514.png
GET http://testbox/browse/attachments/thumbs/11514/TESTPROJ-120:_thumb_11514.png
304 Not Modified

btasker changed status from 'In Progress' to 'Open'
btasker changed timespent from '26 minutes' to '86 minutes'

Repo: Jira-Issue-Listing
Commit: 3f911cf48a1eaadd4365fb52698360858a941799
Author: Ben Tasker <github@<Domain Hidden>>

Date: Wed Apr 20 13:55:19 2016 +0100
Commit Message: Forced HTTP Dates to be GMT. See JILS-41



Modified (-)(+)
-------
utils.class.php




Webhook User-Agent

GitHub-Hookshot/cd33156


View Commit

Have verified that a NGinx downstream cache can now revalidate pages.

Used the following in the NGinx configuration to allow revalidation and force a short cache-age (to make testing quicker)
proxy_cache_revalidate on;
proxy_cache static-cache;
proxy_cache_valid  200 302  1m;


The ran a couple of timed requests against the NGinx box whilst tailing logs on the Apache origin
$ GET -Ssed http://testnginxbox/browse/JILS-41.html; sleep 65; GET -Ssed http://testnginxbox/browse/JILS-41.html;

# Resulting Log entries
192.168.4.5 - - [21/Apr/2016:11:03:14 +0100] "GET /browse/JILS-41.html HTTP/1.0" 200 34958
192.168.4.5 - - [21/Apr/2016:11:04:19 +0100] "GET /browse/JILS-41.html HTTP/1.0" 304 -


And for attachments
$ GET -Ssed http://testnginxbox/browse/attachments/11606/EXAMPLE-137:20150101_example.pdf; sleep 65; GET -Ssed http://testnginxbox/browse/attachments/11606/EXAMPLE-137:20150101_example.pdf

# Resulting Log Entries
192.168.4.5 - - [21/Apr/2016:11:07:01 +0100] "GET /browse/attachments/11606/EXAMPLE-137:20150101_example.pdf HTTP/1.0" 200 167450
192.168.4.5 - - [21/Apr/2016:11:08:07 +0100] "GET /browse/attachments/11606/EXAMPLE-137:20150101_example.pdf HTTP/1.0" 304 -


So it looks like that's working well.

Interestingly, although NGinx received a strong indicator (the E-Tag) in it's original CACHE_MISS:
HTTP/1.1 200 OK
Date: Thu, 21 Apr 2016 10:07:01 GMT
Server: Apache/2.2.15 (CentOS)
Content-Dispostion: attachment; filename=20150101_example.pdf
Content-Length: 167450
Last-Modified: Fri, 02 Jan 2015 02:25:10 GMT
E-tag: F-ae2d92d87e7ccc8f860226ec3c31665efd961bd3
Connection: close
Content-Type: application/pdf


It used the weaker indicator (Last-Modified) when going upstream to revalidate
GET /browse/attachments/11606/EXAMPLE-137:20150101_example.pdf HTTP/1.0
X-Forwarded-For: 10.16.8.4
Connection: close
If-Modified-Since: Fri, 02 Jan 2015 02:25:10 GMT
User-Agent: xxxxtest


So worth keeping in mind when making changes in the future

I think the best way forward for the project/version/component listing pages is to take a raw view of "has any issue displayed changed, at all".

It avoids future complications if the pages are changed, and also means that anyone wanting to tweak the layout of the pages in their own install won't need to go off and look up field names to do so.

The downside being that a page will be re-indexed if any change has occurred on an issue to linked to there (even if that's simply attaching a file), but also means that pages for old versions (for example) can be revalidated. If we explicitly avoid checking the jiraaction table, then this won't be true when comments are added.

At the moment, if a project is sitting there untouched, the listings pages will still need to be retrieved in full. This way we can get revalidation working for those to reduce overhead on the origin server.
btasker changed status from 'Open' to 'In Progress'

Repo: Jira-Issue-Listing
Commit: 7f9b54f99e0a0d8a56876865bcc310df30720be9
Author: Ben Tasker <github@<Domain Hidden>>

Date: Fri Apr 22 12:39:20 2016 +0100
Commit Message: Started implementing E-Tag/Last-Mod support for version pages. See JILS-41



Modified (-)(+)
-------
version-issues.php




Webhook User-Agent

GitHub-Hookshot/cd33156


View Commit

I think I've potentially chosen poorly by starting with the versions page.

JILS-36 broke the page into two sections - issues that are (or will be) fixed in this version, and "known issues" (i.e. issues that affect this version but are/will be fixed in a later version).

The current ETag/Last-mod implementation takes into account the former, but not the latter. So, once a version has been marked as released, any known issues will display as they were at time of release (so if they're fixed a week later, they'll still show as "open" on that page).

Simply adjusting it to include the latter is relatively simple, but means that most of the work involved in generating the page is already done. So aside from bandwidth savings we don't gain very much.

I think a better route would be to adjust the "Last-Modified" query to look more like the pre JILS-36 version than the current incarnation, so that a single query checks anything linked to the current version.


Repo: Jira-Issue-Listing
Commit: f333c2155e3aed45a9c68d8c9f04e6af313811eb
Author: Ben Tasker <github@<Domain Hidden>>

Date: Fri Apr 22 12:54:17 2016 +0100
Commit Message: Adjusted Last-Mod query to factor in 'Known Issues'. See JILS-41



Modified (-)(+)
-------
version-issues.php




Webhook User-Agent

GitHub-Hookshot/cd33156


View Commit


Repo: Jira-Issue-Listing
Commit: e554645b0e203f0c6f38cba233d0198bd7b8112c
Author: Ben Tasker <github@<Domain Hidden>>

Date: Fri Apr 22 12:56:50 2016 +0100
Commit Message: Simplified ETag generation. See JILS-41



Modified (-)(+)
-------
version-issues.php




Webhook User-Agent

GitHub-Hookshot/cd33156


View Commit

Next job, then, is implementing support for sending 304's

Repo: Jira-Issue-Listing
Commit: 55842ed1122a80d3079276ddce981e86e4b03e38
Author: Ben Tasker <github@<Domain Hidden>>

Date: Fri Apr 22 12:59:55 2016 +0100
Commit Message: Implemented revalidation support on versions page. See JILS-41



Modified (-)(+)
-------
version-issues.php




Webhook User-Agent

GitHub-Hookshot/cd33156


View Commit


Repo: Jira-Issue-Listing
Commit: b2c15d7226c346dcc1da416167f0832c1f745b77
Author: Ben Tasker <github@<Domain Hidden>>

Date: Fri Apr 22 13:08:42 2016 +0100
Commit Message: Added Etag/last-Mod/Reval support to component pages. See JILS-41



Modified (-)(+)
-------
component-issues.php




Webhook User-Agent

GitHub-Hookshot/cd33156


View Commit

I've added support for Last-Mod/Etag/Revalidation to component listing pages. Tests all run OK against them, and the component page doesn't have quite the same considerations that the versions page does.

Which leaves two classes of page to consider

- Project Pages
- Projects Index

As the latter is just a list of projects, it's currently pretty inexpensive to generate. Adding a query to check whether any issues have changed is going to increase cost significantly, so for the time being, I'm inclined to leave that one alone.

There's definitely some value to adding to the Project Pages though. As has been done with Components and Versions, need to take any changes to the project itself into account.
btasker changed status from 'In Progress' to 'Open'
btasker changed timespent from '86 minutes' to '146 minutes'

Repo: Jira-Issue-Listing
Commit: b249ebf714e127f699db54d6d04955bafd2b090a
Author: Ben Tasker <github@<Domain Hidden>>

Date: Fri Apr 22 13:18:20 2016 +0100
Commit Message: Introduced Etag/Last-mod/Reval support to project indexes. See JILS-41



Modified (-)(+)
-------
project-index.php




Webhook User-Agent

GitHub-Hookshot/cd33156


View Commit

Have added revalidation support to the project indexes.

The next task affects Projects/Version/Component pages:

At the moment, if the Project/Version/Component properties are changed (say an update to the description), the E-Tag will change accordingly. Last-Modified, however, will remain unchanged, which is clearly going to cause an issue if there's an NGinx cache downstream (given it only uses IMS).

Need to have a poke around the database and see whether the change gets recorded anywhere
btasker changed timespent from '146 minutes' to '154 minutes'
Irritatingly it doesn't look like it is. Running a diff on two database dumps taken either side of updating a version description, the only difference is the new value being inserted into projectversion.

Not sure if there's a way around it, but it's going to cause an issue. Where a downstream cache relies on IMS (it shouldn't, but does) it will incorrectly revalidate a page despite the information about that version/project/component having been updated.

The workaround, for the time being, is to ensure that an issue within that project/version/component gets manually updated (perhaps toggling to a different priority, then changing back).

I'm going to raise a separate bug for that so it can be tracked/listed as a known issue - JILS-42
As headers have been added to all applicable views, I'm going to mark this issue as complete
btasker changed status from 'Open' to 'Resolved'
btasker added 'Done' to resolution
btasker changed status from 'Resolved' to 'Closed'
Re-opening to assign to a version
btasker removed 'Done' from resolution
btasker changed status from 'Closed' to 'Reopened'
Assigning to v0.01b
btasker added '0.01b' to Version
btasker added '0.01b' to Fix Version
Re-Closing
btasker changed status from 'Reopened' to 'Resolved'
btasker added 'Done' to resolution
btasker changed status from 'Resolved' to 'Closed'

Work log


Ben Tasker
Permalink
2016-04-20 12:25:36

Time Spent: 10 minutes
Log Entry: Adding headers to Issue pages and Attachments, testing

Ben Tasker
Permalink
2016-04-20 12:42:04

Time Spent: 16 minutes
Log Entry: Theorising and waffling

Ben Tasker
Permalink
2016-04-20 13:56:56

Time Spent: 60 minutes
Log Entry: Initial implementation of support for Conditional Requests

Ben Tasker
Permalink
2016-04-22 13:12:28

Time Spent: 60 minutes
Log Entry: Adding revalidation support to Component/Version pages

Ben Tasker
Permalink
2016-04-22 13:21:11

Time Spent: 8 minutes
Log Entry: Adding Etag/Last-Mod/Reval support to project indexes. Testing