Been looking at whether any of the projects can be removed prior to data extraction, there are a few that are either empty, nonpublic and not needed or now redundant. These are
- bentasker.co.uk. Project ID 22
- LottoPredict. Project ID 34
- ModGooPlusFeed. Project ID 41
- Property Monkey. Project ID 33
- RAuth. Project ID 36
- Whitepapers. Project ID 26
Removing the above with the following
mysql> DELETE FROM Documents WHERE Project IN (22,34,41,33,36,26);
Query OK, 11 rows affected (0.00 sec)
mysql> DELETE FROM Bugs WHERE Project IN (22,34,41,33,36,26);
Query OK, 3 rows affected (0.01 sec)
mysql> DELETE FROM Changelog WHERE Project IN (22,34,41,33,36,26);
Query OK, 32 rows affected (0.00 sec)
mysql> DELETE FROM Comments WHERE Project IN (22,34,41,33,36,26);
Query OK, 0 rows affected (0.01 sec)
mysql> DELETE FROM Releases WHERE Project IN (22,34,41,33,36,26);
Query OK, 4 rows affected (0.01 sec)
mysql> DELETE FROM Tasks WHERE Project IN (22,34,41,33,36,26);
Query OK, 4 rows affected (0.00 sec)
mysql> DELETE FROM Projects WHERE ProjectRef IN (22,34,41,33,36,26);
Query OK, 6 rows affected (0.00 sec)
Finally found a little bit of time to think about taking a look at this. Tinkering about to see whether we can get away with just creating a mirror using wget and various text manipulation tools:
A quick test of progress so far in a browser looks hopeful.
There are, however, some pages that are only referenced via Javascript, so wget hasn't included them - tasks.php being one example (there will almost certainly be others). Once the current mirror has been reformatted, will need to think of a good means to grab the missing pages
Next up, then, is the Bug pages, which means extracting two different integers from the query string
OK, so as a quick hack, with viewbug I'm doing a replacement so that the project number stays part of the querystring. It won't actually be used, but keeps my sed statement simple
for fname in viewbug.php*
do
projnum=$(echo "$fname" | grep -o -P "BUGID=[0-9]+" | sed 's/=/_/g')
mv "$fname" viewbug_${projnum}.html
exactnum=$(echo $projnum | grep -o -P "[0-9]+" )
# Update all references
sed -i -e "s/viewbug\.php?BUGID=${exactnum}\&/viewbug_${projnum}.html?/g" *.php* *.html
sed -i -e "s/viewbug\.php?BUGID=${exactnum}\"/viewbug_${projnum}.html\"/g" *.php* *.html
sed -i -e "s/viewbug\.php?BUGID=${exactnum}\'/viewbug_${projnum}.html\'/g" *.php* *.html
done
Based on a quick clickthrough in the browser, that seems to have worked
Because this is going to be a static archive, the filename extension for Downloads (and releases) needs to be correct, otherwise the content-type header is going to be wrong.
So, the first thing to do is to find out the correct filetype (from the Content-Disposition header). Unfortunately the backend doesn't handle HEAD requests (it just sits trying to send the data anyway).
Curl can be told to use Content-Disposition when naming a downloaded file:
ben@milleniumfalcon:/tmp$ curl -JLO "http://projects.bentasker.co.uk/archived_content/Download.php?id=45&type=doc"
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 3475 100 3475 0 0 12058 0 --:--:-- --:--:-- --:--:-- 12065
curl: Saved to filename '20100327_2_BUGGER_AutoUpdate_API_V1.0.txt'
So that's probably the route to take, something like
mkdir downloads
# Downloads
for fname in Download.php*
do
realfname=$(curl -JLO "http://projects.bentasker.co.uk/archived_content/$fname" 2>&1 | grep Saved | grep -o -P "'[^\']+'" | sed "s/'//g")
mv "$realfname" downloads/
sed -i -e "s/$fname\"/downloads\/$realfname\"/g" *.php* *.html
sed -i -e "s/$fname'/downloads\/$realfname'/g" *.php* *.html
htmld=$(echo "$fname" | sed 's/\&/\&/g')
sed -i -e "s/$htmld\"/downloads\/$realfname\"/g" *.php* *.html
sed -i -e "s/$htmld'/downloads\/$realfname'/g" *.php* *.html
done
There're a couple of files that don't have a file extension, so they just download in the browser. Looking at them, they're both text files, so it'd probably be helpful to update their references.
The loop above doesn't help (at all) with redirecting old URLs though, so need to think about how to expand it.
Attaching a copy of the eventual script used to build a static archive. It's not particularly refined or pretty, but based on testing so far, does the job.
It's a shame really, I remember the original CGI based BUGGER looked pretty good, and worked pretty well. I just never really got there with the template in the rebuild (though the functionality worked fairly well). Times change though, and I've got better tools available, so it's not really worth the effort anymore.
Thinking about it, the generated redirects likely won't work - I don't think you can include a query string in a path passed to Apache's redirect.
It's either going to be a case of generating an apache rewrite, or not bothering to redirect those old pages. Taking a quick skim over the access logs, it does, unfortunately look like the old URLs see some requests (why??) so I guess I'm going to have to have it generate the rewrites. Shouldn't actually be too hard to do.
I've adjusted the downloads block to generate rewrite blocks instead, and as it was bugging me, added file extensions to the two files which were missing them
mkdir downloads
> downloads/index.html
# Downloads
for fname in Download.php*
do
echo "Updating $fname"
realfname=$(curl -JLO "http://projects.bentasker.co.uk/archived_content/$fname" 2>&1 | grep Saved | grep -o -P "'[^\']+'" | sed "s/'//g")
doc_id=$(echo "$fname" | grep -o -P "id=[0-9]+" | sed 's/id=//g' )
doc_type=$(echo "$fname" | grep -o -P "type=[a-z]+" | sed 's/type=//g' )
mv "$realfname" downloads/${doc_id}_${realfname}
sed -i -e "s/$fname\"/downloads\/${doc_id}_$realfname\"/g" *.php* *.html
sed -i -e "s/$fname'/downloads\/${doc_id}_$realfname'/g" *.php* *.html
htmld=$(echo "$fname" | sed 's/\&/\&/g')
sed -i -e "s/$htmld\"/downloads\/${doc_id}_$realfname\"/g" *.php* *.html
sed -i -e "s/$htmld'/downloads\/${doc_id}_$realfname'/g" *.php* *.html
#echo "redirect 301 /archived_content/$fname /archived_content/downloads/${doc_id}_${realfname}" >> redirects.txt
cat << EOM >> redirects.txt
RewriteCond %{REQUEST_URI} ^/archived_content/Download\.php\$
RewriteCond %{QUERY_STRING} ^id=$doc_id\&type=${doc_type}
RewriteRule ^(.*)$ /archived_content_static/downloads/${doc_id}_${realfname}? [R=301,L]
EOM
done
rm Download.php*
# Tidy up the two file-extension less files
sed -i 's/31_Protocol_Documentation/31_Protocol_Documentation.txt/g' *.html *.txt
sed -i 's/57_INSTALL/57_INSTALL.txt/g' *.html *.txt
mv downloads/57_INSTALL downloads/57_INSTALL.txt
mv downloads/31_Protocol_Documentation downloads/31_Protocol_Documentation.txt
In the logs, I can see some requests to auto.php, but that'll primarily be because I haven't finished BUGGER-4 yet.
Unrelated to the changes made so far, but I can see some requests for projsummary et al in the document root - that'll be from old links (as BUGGER used to occupy the entire subdomain). Probably worth adjusting the redirects to account for those too, they'll have been broken for quite a while, but given they're obviously still actively linked to somewhere it'd be nice to unbreak those links.
I should probably also remove any links to login.php from the static archive, as it's always going to be a 404
I've dropped a redirect in to catch requests for the old, old location
# There are also legacy links pointing at the docroot
RewriteCond %{REQUEST_URI} ^/projsummary\.php$
RewriteCond %{QUERY_STRING} ^Project=([0-9]+)
RewriteRule ^(.*)$ /archived_content_static/Project_%1.html? [R=301,L]
I can't see any more requests going into the old install and not getting redirected, so I'm going to move the codebase out of the way in preparation for completing BUGGER-3 and BUGGER-7
Activity
2014-11-28 05:01:57
- bentasker.co.uk. Project ID 22
- LottoPredict. Project ID 34
- ModGooPlusFeed. Project ID 41
- Property Monkey. Project ID 33
- RAuth. Project ID 36
- Whitepapers. Project ID 26
Removing the above with the following
2017-05-01 17:28:20
There are several base page types to process
Each, in reality, has a querystring defining the content:
So, need to design a schema for each to rename it to something not using a query string, and then update all references to it
For example:
Making a sandbox to tinker in
Renaming buglists:
Could probably have avoided using the double-seds by using regex, but it doesn't take long to run so couldn't see much point wasting the effort.
Need to work through the others now. Downloads etc will need a slightly different approach, but will deal with those later
2017-05-01 17:30:42
And release views
2017-05-01 17:32:08
2017-05-01 17:33:52
2017-05-01 17:42:09
There are, however, some pages that are only referenced via Javascript, so wget hasn't included them - tasks.php being one example (there will almost certainly be others). Once the current mirror has been reformatted, will need to think of a good means to grab the missing pages
Next up, then, is the Bug pages, which means extracting two different integers from the query string
2017-05-01 17:51:11
2017-05-01 17:56:40
2017-05-01 18:13:15
Based on a quick clickthrough in the browser, that seems to have worked
2017-05-01 18:22:14
(Scrubbed - see below for the updated block)
2017-05-01 18:30:04
I think I'll take a look at identifying the pages that the mirror will have missed first
2017-05-01 18:36:38
2017-05-01 18:49:37
(scrubbed, see later comments for new block)
Clicking around, it looks like that's everything but the Downloads/Releases. So need to start thinking about how to address them
2017-05-01 19:26:47
So, the first thing to do is to find out the correct filetype (from the Content-Disposition header). Unfortunately the backend doesn't handle HEAD requests (it just sits trying to send the data anyway).
Curl can be told to use Content-Disposition when naming a downloaded file:
So that's probably the route to take, something like
To get the filename
2017-05-01 19:36:50
There're a couple of files that don't have a file extension, so they just download in the browser. Looking at them, they're both text files, so it'd probably be helpful to update their references.
The loop above doesn't help (at all) with redirecting old URLs though, so need to think about how to expand it.
2017-05-01 19:37:28
Calling it a day for now
2017-05-01 23:38:31
2017-05-01 23:51:53
It's a shame really, I remember the original CGI based BUGGER looked pretty good, and worked pretty well. I just never really got there with the template in the rebuild (though the functionality worked fairly well). Times change though, and I've got better tools available, so it's not really worth the effort anymore.
2017-05-01 23:51:53
2017-05-01 23:54:59
2017-05-01 23:55:21
2017-05-02 00:02:21
It's either going to be a case of generating an apache rewrite, or not bothering to redirect those old pages. Taking a quick skim over the access logs, it does, unfortunately look like the old URLs see some requests (why??) so I guess I'm going to have to have it generate the rewrites. Shouldn't actually be too hard to do.
2017-05-02 23:19:33
2017-05-02 23:22:38
2017-05-02 23:37:08
2017-05-02 23:44:56
Will give it a few days and then check whether there have been requests for any other pages on the old version (i.e. anything not resulting in a 301).
Will also upload the updated conversion script.
2017-05-02 23:45:10
2017-05-02 23:45:31
2017-05-04 17:14:34
Unrelated to the changes made so far, but I can see some requests for projsummary et al in the document root - that'll be from old links (as BUGGER used to occupy the entire subdomain). Probably worth adjusting the redirects to account for those too, they'll have been broken for quite a while, but given they're obviously still actively linked to somewhere it'd be nice to unbreak those links.
I should probably also remove any links to login.php from the static archive, as it's always going to be a 404
2017-05-04 17:49:33
2017-05-04 17:52:48
2017-05-06 10:48:31
2017-05-06 11:13:46
All that remains now, then, is to remove the dynamic code from the origin server... Done
BUGGER is now decomissioned.
2017-05-06 11:14:37
2017-05-06 11:14:37
2017-05-06 11:14:41