BUGGER-1: Decomission BUGGER



Issue Information

Issue Type: Task
 
Priority: Major
Status: Closed

Reported By:
Ben Tasker
Assigned To:
Ben Tasker
Project: BUGGER (BUGGER)
Resolution: Done (2017-05-06 11:14:37)

Created: 2014-11-12 02:42:47
Time Spent Working


Description
BUGGER has been moved to http://projects.bentasker.co.uk/archived_content/ until such a time as data can be extracted.

Once that's complete, the system needs to be decommissioned and removed.


Attachments

download_and_make_static.sh.txt

Subtasks

BUGGER-2: Create Data extraction script
BUGGER-3: Ensure codebase is preserved
BUGGER-4: Find and disable dependant modules
BUGGER-7: Extract and publish data
Toggle State Changes

Activity


Been looking at whether any of the projects can be removed prior to data extraction, there are a few that are either empty, nonpublic and not needed or now redundant. These are

- bentasker.co.uk. Project ID 22
- LottoPredict. Project ID 34
- ModGooPlusFeed. Project ID 41
- Property Monkey. Project ID 33
- RAuth. Project ID 36
- Whitepapers. Project ID 26

Removing the above with the following

mysql> DELETE FROM Documents WHERE Project IN (22,34,41,33,36,26);
Query OK, 11 rows affected (0.00 sec)

mysql> DELETE FROM Bugs WHERE Project IN (22,34,41,33,36,26);
Query OK, 3 rows affected (0.01 sec)

mysql> DELETE FROM Changelog WHERE Project IN (22,34,41,33,36,26);
Query OK, 32 rows affected (0.00 sec)

mysql> DELETE FROM Comments WHERE Project IN (22,34,41,33,36,26);
Query OK, 0 rows affected (0.01 sec)

mysql> DELETE FROM Releases WHERE Project IN (22,34,41,33,36,26);
Query OK, 4 rows affected (0.01 sec)

mysql> DELETE FROM Tasks WHERE Project IN (22,34,41,33,36,26);
Query OK, 4 rows affected (0.00 sec)

mysql> DELETE FROM Projects WHERE ProjectRef IN (22,34,41,33,36,26);
Query OK, 6 rows affected (0.00 sec)


Finally found a little bit of time to think about taking a look at this. Tinkering about to see whether we can get away with just creating a mirror using wget and various text manipulation tools:

ben@milleniumfalcon:/tmp/$ wget -k -p -m http://projects.bentasker.co.uk/archived_content/


There are several base page types to process

ben@milleniumfalcon:/tmp/projects.bentasker.co.uk/archived_content$ ls -1 | grep -o -P "(.*)\.php" | sort | uniq
buglist.php
docview.php
Download.php
License.php
projsummary.php
relview.php
viewbug.php


Each, in reality, has a querystring defining the content:
Download.php?id=102&type=rel


So, need to design a schema for each to rename it to something not using a query string, and then update all references to it

For example:
buglist.php?Project=1 should become buglist_Project_1.html


Making a sandbox to tinker in
ben@milleniumfalcon:/tmp/projects.bentasker.co.uk$ cp -r archived_content/ archived_content_adjusted
ben@milleniumfalcon:/tmp/projects.bentasker.co.uk$ cd archived_content_adjusted/


Renaming buglists:
for fname in buglist.php*
do
projnum=$(echo "$fname" | grep -o -P "[0-9]+")
mv "$fname" buglist_Project_${projnum}.html
# Update all references
sed -i -e "s/buglist\.php?Project=${projnum}\"/buglist_Project_${projnum}.html\"/g" *.php* *.html
sed -i -e "s/buglist\.php?Project=${projnum}\'/buglist_Project_${projnum}.html\'/g" *.php* *.html
done

Could probably have avoided using the double-seds by using regex, but it doesn't take long to run so couldn't see much point wasting the effort.

Need to work through the others now. Downloads etc will need a slightly different approach, but will deal with those later
Exactly the same approach for the docview pages
for fname in docview.php*
do
projnum=$(echo "$fname" | grep -o -P "0-9]+")
mv "$fname" docview_Project_${projnum}.html
# Update all references
sed -i -e "s/docview\.php?Project=${projnum}\"/docview_Project_${projnum}.html\"/g" *.php* *.html
sed -i -e "s/docview\.php?Project=${projnum}\'/docview_Project_${projnum}.html\'/g" *.php* *.html
done


And release views
for fname in relview.php*
do
projnum=$(echo "$fname" +")
mv "$fname" relview_Project_${projnum}.html
# Update all references
sed -i -e "s/relview\.php?Project=${projnum}\"/relview_Project_${projnum}.html\"/g" *.php* *.html
sed -i -e "s/relview\.php?Project=${projnum}\'/relview_Project_${projnum}.html\'/g" *.php* *.html
done
License page
for fname in License.php*
do
projnum=$(echo "$fname" | grep -o -P "[0-9]+")
mv "$fname" License_${projnum}.html
# Update all references
sed -i -e "s/License\.php?ref=${projnum}\"/License_${projnum}.html\"/g" *.php* *.html
sed -i -e "s/License\.php?ref=${projnum}\'/License_${projnum}.html\'/g" *.php* *.html
done
Project Summary pages
for fname in projsummary.php*
do
projnum=$(echo "$fname" | grep -o -P "[0-9]+")
mv "$fname" Project_${projnum}.html
# Update all references
sed -i -e "s/projsummary\.php?Project=${projnum}\"/Project_${projnum}.html\"/g" *.php* *.html
sed -i -e "s/projsummary\.php?Project=${projnum}\'/Project_${projnum}.html\'/g" *.php* *.html
done
A quick test of progress so far in a browser looks hopeful.

There are, however, some pages that are only referenced via Javascript, so wget hasn't included them - tasks.php being one example (there will almost certainly be others). Once the current mirror has been reformatted, will need to think of a good means to grab the missing pages

Next up, then, is the Bug pages, which means extracting two different integers from the query string
Where link methods have changed over time, there are a handful of duplicated bug pages, removing the older versions
for i in 112 114 116 137 139 148 157 186 187 188 243; do rm viewbug.php\?BUGID\=${i}; done

There's a bug ID that doesn't have a corresponding copy with project number attached, so manually move it
mv viewbug.php\?BUGID\=230 viewbug.php\?BUGID\=230\&Project=4
sed -i -e "s/viewbug\.php?BUGID=230\"/viewbug\.php?BUGID=230\&Project=4\"/g" *.php* *.html
OK, so as a quick hack, with viewbug I'm doing a replacement so that the project number stays part of the querystring. It won't actually be used, but keeps my sed statement simple
for fname in viewbug.php*
do
projnum=$(echo "$fname" | grep -o -P "BUGID=0-9]+" +" )
# Update all references
sed -i -e "s/viewbug\.php?BUGID=${exactnum}\&/viewbug_${projnum}.html?/g" *.php* *.html
sed -i -e "s/viewbug\.php?BUGID=${exactnum}\"/viewbug_${projnum}.html\"/g" *.php* *.html
sed -i -e "s/viewbug\.php?BUGID=${exactnum}\'/viewbug_${projnum}.html\'/g" *.php* *.html
done


Based on a quick clickthrough in the browser, that seems to have worked
Just found an issue with one of the sed's in each - doesn't work with the single quotes escaped, so the new block is

(Scrubbed - see below for the updated block)
OK, so of the files currently downloaded by wget, the only type left to do are the downloads themselves, which is decided non-trivial.

I think I'll take a look at identifying the pages that the mirror will have missed first
OK, so the projsummary loop now needs to be run first (as it fetches new files)
for fname in projsummary.php*
do

projnum=$(echo "$fname" | grep -o -P "[0-9]+")

mv "$fname" Project_${projnum}.html

# Update all references

sed -i -e "s/projsummary\.php?Project=${projnum}\"/Project_${projnum}.html\"/g" *.php* *.html
sed -i -e "s/projsummary\.php?Project=${projnum}'/Project_${projnum}.html'/g" *.php* *.html


# Grab the Task and Changelog views
wget http://projects.bentasker.co.uk/archived_content/changelogview.php?Project=${projnum} -O changelog_project_${projnum}.html
sed -i -e "s/changelogview\.php?Project=${projnum}'/changelog_project_${projnum}.html'/g" *.php* *.html
wget http://projects.bentasker.co.uk/archived_content/tasks.php?Project=${projnum} -O tasks_project_${projnum}.html
sed -i -e "s/tasks\.php?Project=${projnum}'/tasks_project_${projnum}.html'/g" *.php* *.html
done
New process (so far) then is

(scrubbed, see later comments for new block)

Clicking around, it looks like that's everything but the Downloads/Releases. So need to start thinking about how to address them
Because this is going to be a static archive, the filename extension for Downloads (and releases) needs to be correct, otherwise the content-type header is going to be wrong.

So, the first thing to do is to find out the correct filetype (from the Content-Disposition header). Unfortunately the backend doesn't handle HEAD requests (it just sits trying to send the data anyway).

Curl can be told to use Content-Disposition when naming a downloaded file:
ben@milleniumfalcon:/tmp$ curl -JLO "http://projects.bentasker.co.uk/archived_content/Download.php?id=45&type=doc"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  3475  100  3475    0     0  12058      0 --:--:-- --:--:-- --:--:-- 12065
curl: Saved to filename '20100327_2_BUGGER_AutoUpdate_API_V1.0.txt'


So that's probably the route to take, something like
curl -JLO "http://projects.bentasker.co.uk/archived_content/Download.php?id=45&type=doc" 2>&1 | grep Saved | grep -o -P "'[^\']+'" | sed "s/'//g"
20100327_2_BUGGER_AutoUpdate_API_V1.0.txt

To get the filename
The following seems to have done it
mkdir downloads

# Downloads
for fname in Download.php*
do

realfname=$(curl -JLO "http://projects.bentasker.co.uk/archived_content/$fname" 2>&1 | grep Saved | grep -o -P "'[^\']+'" | sed "s/'//g")

mv "$realfname" downloads/
sed -i -e "s/$fname\"/downloads\/$realfname\"/g" *.php* *.html
sed -i -e "s/$fname'/downloads\/$realfname'/g" *.php* *.html

htmld=$(echo "$fname" | sed 's/\&/\&/g')
sed -i -e "s/$htmld\"/downloads\/$realfname\"/g" *.php* *.html
sed -i -e "s/$htmld'/downloads\/$realfname'/g" *.php* *.html

done


There're a couple of files that don't have a file extension, so they just download in the browser. Looking at them, they're both text files, so it'd probably be helpful to update their references.

The loop above doesn't help (at all) with redirecting old URLs though, so need to think about how to expand it.
OK, the process so far is
for fname in projsummary.php*
do

projnum=$(echo "$fname" | grep -o -P "0-9]+")

mv "$fname" Project_${projnum}.html

# Update all references

sed -i -e "s/projsummary\.php?Project=${projnum}\"/Project_${projnum}.html\"/g" *.php* *.html
sed -i -e "s/projsummary\.php?Project=${projnum}'/Project_${projnum}.html'/g" *.php* *.html


# Grab the Task and Changelog views
wget http://projects.bentasker.co.uk/archived_content/changelogview.php?Project=${projnum} -O changelog_project_${projnum}.html
sed -i -e "s/changelogview\.php?Project=${projnum}'/changelog_project_${projnum}.html'/g" *.php* *.html
wget http://projects.bentasker.co.uk/archived_content/tasks.php?Project=${projnum} -O tasks_project_${projnum}.html
sed -i -e "s/tasks\.php?Project=${projnum}'/tasks_project_${projnum}.html'/g" *.php* *.html
done




for fname in buglist.php*
do
projnum=$(echo "$fname" +")
mv "$fname" buglist_Project_${projnum}.html
# Update all references
sed -i -e "s/buglist\.php?Project=${projnum}\"/buglist_Project_${projnum}.html\"/g" *.php* *.html
sed -i -e "s/buglist\.php?Project=${projnum}'/buglist_Project_${projnum}.html'/g" *.php* *.html
done


for fname in docview.php*
do
projnum=$(echo "$fname" | grep -o -P "0-9]+")
mv "$fname" docview_Project_${projnum}.html
# Update all references
sed -i -e "s/docview\.php?Project=${projnum}\"/docview_Project_${projnum}.html\"/g" *.php* *.html
sed -i -e "s/docview\.php?Project=${projnum}'/docview_Project_${projnum}.html'/g" *.php* *.html
done

for fname in relview.php*
do
projnum=$(echo "$fname" +")
mv "$fname" relview_Project_${projnum}.html
# Update all references
sed -i -e "s/relview\.php?Project=${projnum}\"/relview_Project_${projnum}.html\"/g" *.php* *.html
sed -i -e "s/relview\.php?Project=${projnum}'/relview_Project_${projnum}.html'/g" *.php* *.html
done



for fname in License.php*
do
projnum=$(echo "$fname" | grep -o -P "0-9]+")
mv "$fname" License_${projnum}.html
# Update all references
sed -i -e "s/License\.php?ref=${projnum}\"/License_${projnum}.html\"/g" *.php* *.html
sed -i -e "s/License\.php?ref=${projnum}'/License_${projnum}.html'/g" *.php* *.html
done







for i in 112 114 116 137 139 148 157 186 187 188 243; do rm viewbug.php\?BUGID\=${i}; done
mv viewbug.php\?BUGID\=230 viewbug.php\?BUGID\=230\&Project=4


sed -i -e "s/viewbug\.php?BUGID=230\"/viewbug\.php?BUGID=230\&Project=4\"/g" *.php* *.html



for fname in viewbug.php*
do
projnum=$(echo "$fname" +" | sed 's/=/_/g')
mv "$fname" viewbug_${projnum}.html
exactnum=$(echo $projnum | grep -o -P "0-9]+" )
# Update all references
sed -i -e "s/viewbug\.php?BUGID=${exactnum}\&/viewbug_${projnum}.html?/g" *.php* *.html
sed -i -e "s/viewbug\.php?BUGID=${exactnum}\"/viewbug_${projnum}.html\"/g" *.php* *.html
sed -i -e "s/viewbug\.php?BUGID=${exactnum}'/viewbug_${projnum}.html'/g" *.php* *.html
done


#Replace references to the old index
sed -i -e 's/index.php/index.html/g' *.php* *.html


# Start work on the downloads

mkdir downloads

# Downloads
for fname in Download.php*
do

realfname=$(curl -JLO "http://projects.bentasker.co.uk/archived_content/$fname" 2>&1 +'" | sed "s/'//g")

mv "$realfname" downloads/
sed -i -e "s/$fname\"/downloads\/$realfname\"/g" *.php* *.html
sed -i -e "s/$fname'/downloads\/$realfname'/g" *.php* *.html

htmld=$(echo "$fname" | sed 's/\&/\&/g')
sed -i -e "s/$htmld\"/downloads\/$realfname\"/g" *.php* *.html
sed -i -e "s/$htmld'/downloads\/$realfname'/g" *.php* *.html



done


Calling it a day for now
OK, the download block now generates redirects to be dropped into the old system
# Start work on the downloads

mkdir downloads
> downloads/index.html

# Downloads
for fname in Download.php*
do


echo "Updating $fname"
realfname=$(curl -JLO "http://projects.bentasker.co.uk/archived_content/$fname" 2>&1 | grep Saved | grep -o -P "'^\']+'" +" | sed 's/id=//g' )



mv "$realfname" downloads/${doc_id}_${realfname}
sed -i -e "s/$fname\"/downloads\/${docid}_$realfname\"/g" *.php* *.html
sed -i -e "s/$fname'/downloads\/${docid}_$realfname'/g" *.php* *.html

htmld=$(echo "$fname" | sed 's/\&/\&/g')
sed -i -e "s/$htmld\"/downloads\/${docid}_$realfname\"/g" *.php* *.html
sed -i -e "s/$htmld'/downloads\/${docid}_$realfname'/g" *.php* *.html

echo "redirect 301 /archived_content/$fname /archived_content/downloads/${doc_id}_${realfname}" >> redirects.txt

done
Attaching a copy of the eventual script used to build a static archive. It's not particularly refined or pretty, but based on testing so far, does the job.

It's a shame really, I remember the original CGI based BUGGER looked pretty good, and worked pretty well. I just never really got there with the template in the rebuild (though the functionality worked fairly well). Times change though, and I've got better tools available, so it's not really worth the effort anymore.
btasker added 'download_and_make_static.sh.txt' to Attachments
btasker removed 'download_and_make_static.sh.txt' from Attachment
btasker added 'download_and_make_static.sh.txt' to Attachments
Thinking about it, the generated redirects likely won't work - I don't think you can include a query string in a path passed to Apache's redirect.

It's either going to be a case of generating an apache rewrite, or not bothering to redirect those old pages. Taking a quick skim over the access logs, it does, unfortunately look like the old URLs see some requests (why??) so I guess I'm going to have to have it generate the rewrites. Shouldn't actually be too hard to do.
I've adjusted the downloads block to generate rewrite blocks instead, and as it was bugging me, added file extensions to the two files which were missing them

mkdir downloads
> downloads/index.html

# Downloads
for fname in Download.php*
do


echo "Updating $fname"
realfname=$(curl -JLO "http://projects.bentasker.co.uk/archived_content/$fname" 2>&1 | grep Saved | grep -o -P "'^\']+'" +" | sed 's/id=//g' )
doc_type=$(echo "$fname" | grep -o -P "type=a-z]+" 


EOM


done

rm Download.php*

# Tidy up the two file-extension less files

sed -i 's/31_Protocol_Documentation/31_Protocol_Documentation.txt/g' *.html *.txt
sed -i 's/57_INSTALL/57_INSTALL.txt/g' *.html *.txt

mv downloads/57_INSTALL downloads/57_INSTALL.txt
mv downloads/31_Protocol_Documentation downloads/31_Protocol_Documentation.txt
The only thing left, then is to look at generating rewrites for the old URLs.
Rewrites are in place
RewriteCond %{REQUEST_URI}  ^/archived_content/index.php$
RewriteRule ^(.*)$ /archived_content_static/? [R=301,L]

RewriteCond %{REQUEST_URI}  ^/archived_content/buglist\.php$
RewriteCond %{QUERY_STRING} ^Project=([0-9]+)
RewriteRule ^(.*)$ /archived_content_static/buglist_Project_%1.html? [R=301,L]

RewriteCond %{REQUEST_URI}  ^/archived_content/changelogview\.php$
RewriteCond %{QUERY_STRING} ^Project=([0-9]+)
RewriteRule ^(.*)$ /archived_content_static/changelog_project_%1.html? [R=301,L]

RewriteCond %{REQUEST_URI}  ^/archived_content/docview\.php$
RewriteCond %{QUERY_STRING} ^Project=([0-9]+)
RewriteRule ^(.*)$ /archived_content_static/docview_Project_%1.html? [R=301,L]

RewriteCond %{REQUEST_URI}  ^/archived_content/License\.php$
RewriteCond %{QUERY_STRING} ^ref=([0-9]+)
RewriteRule ^(.*)$ /archived_content_static/License_%1.html? [R=301,L]

RewriteCond %{REQUEST_URI}  ^/archived_content/projsummary\.php$
RewriteCond %{QUERY_STRING} ^Project=([0-9]+)
RewriteRule ^(.*)$ /archived_content_static/Project_%1.html? [R=301,L]

RewriteCond %{REQUEST_URI}  ^/archived_content/relview\.php$
RewriteCond %{QUERY_STRING} ^Project=([0-9]+)
RewriteRule ^(.*)$ /archived_content_static/relview_Project_%1.html? [R=301,L]

RewriteCond %{REQUEST_URI}  ^/archived_content/tasks\.php$
RewriteCond %{QUERY_STRING} ^Project=([0-9]+)
RewriteRule ^(.*)$ /archived_content_static/tasks_project_%1.html? [R=301,L]

RewriteCond %{REQUEST_URI}  ^/archived_content/viewbug\.php$
RewriteCond %{QUERY_STRING} ^BUGID=([0-9]+)
RewriteRule ^(.*)$ /archived_content_static/viewbug_BUGID_%1.html? [R=301,L]
That should now be the old site decomissioned. I've made sure that 301's are cacheable on the edge.

Will give it a few days and then check whether there have been requests for any other pages on the old version (i.e. anything not resulting in a 301).

Will also upload the updated conversion script.
btasker removed 'download_and_make_static.sh.txt' from Attachment
btasker added 'download_and_make_static.sh.txt' to Attachments
In the logs, I can see some requests to auto.php, but that'll primarily be because I haven't finished BUGGER-4 yet.

Unrelated to the changes made so far, but I can see some requests for projsummary et al in the document root - that'll be from old links (as BUGGER used to occupy the entire subdomain). Probably worth adjusting the redirects to account for those too, they'll have been broken for quite a while, but given they're obviously still actively linked to somewhere it'd be nice to unbreak those links.

I should probably also remove any links to login.php from the static archive, as it's always going to be a 404
I've added a redirect for the auto.php's in case I miss a module somewhere (also useful to know the last state)
RewriteCond %{REQUEST_URI}  ^/archived_content/auto\.php$
RewriteCond %{QUERY_STRING} ^action=changelog&Project=([0-9]+)
RewriteRule ^(.*)$ /archived_content_static/auto_Proj_%1.json? [R=301,L]
I've dropped a redirect in to catch requests for the old, old location
# There	are also legacy	links pointing at the docroot
RewriteCond %{REQUEST_URI}  ^/projsummary\.php$
RewriteCond %{QUERY_STRING} ^Project=([0-9]+)
RewriteRule ^(.*)$ /archived_content_static/Project_%1.html? [R=301,L]
I can't see any more requests going into the old install and not getting redirected, so I'm going to move the codebase out of the way in preparation for completing BUGGER-3 and BUGGER-7
OK redirects are in place, the codebase has been preserved in a git repo, and a static archive of the private projects has been created.

All that remains now, then, is to remove the dynamic code from the origin server... Done

BUGGER is now decomissioned.
btasker changed status from 'Open' to 'Resolved'
btasker added 'Done' to resolution
btasker changed status from 'Resolved' to 'Closed'