The project doesn't currently rely on an external database.
During the initial design it was decided that, to get up and running quickly, rather than worrying about which database to use, we'd just implement something - the idea being that it'd be better to switch once the needs of the project were better understood.
However, the need for that switch hasn't (yet) arisen. Plus, messing about building a storage engine has probably helped keep me engaged.
This page details the inbuilt storage engine.
File Structure
The database relies on an indexed and hashed structure.
The primary key for a resource is it's URL.
In order to navigate the file structure, a hex-encoded SHA256 has of the URL is generated. The subdirectory path is calculated by taking the first and second pairs of characters (for example a080df7b239d9792fb041906cbcf563e095400a1e974f9c6e7766ccea85f32e8
would be found under a0/80
).
Storage directories are created in a directory called store
.
For resource types that support thumbnails, a thumbnail will be stored under the thumbs
directory, using the same subdirectory structure:
/
index
index.tomb
store/
a0/
80/
a080df7b239d9792fb041906cbcf563e095400a1e974f9c6e7766ccea85f32e8
tags
thumbs/
a0/
80/
a080df7b239d9792fb041906cbcf563e095400a1e974f9c6e7766ccea85f32e8.jpg
On Disk Index
The index is a simple text file, with one entry per line.
Each entry consists of 4 fields separated by the string (,!,
)
The format is
KEY,!,HASH,!,MIMETYPE,!,TITLE
For example
https://projects.bentasker.co.uk/,!,73518e88ec4fabb0f24c62f79a45e29479e5d278061fd824064f0af951ea1255,!,text/html,!,BTasker Projects - projects.bentasker.co.uk
Although the hash can (obviously) be derived from the key, it's calculated and stored when building indexes to ensure that index loads do not incur unnecessary overhead.
Index Tombstone
The index tombstone is an append-only list of keys.
If an item has been deleted (whether with the delete utility, or as the result of a crawl), its key will be written into the tombstone. The next time search loads the index it will ignore any entries with keys that exist in the tombstone.
This allows deletions to have (almost) instant effect, rather than needing to wait for the index to be rebuilt.
Tags index
The Tags index is gzipped text, introduced in Issue 10
Each line is an entry for a single tag, with the second field being a JSON encapsulated array of keys using that tag.
tag!#![json array]
For example:
facebook!#!["https://www.bentasker.co.uk/posts/blog/privacy/84-social-networking-and-children.html", "https://www.bentasker.co.uk/posts/blog/opinion/481-scre
enshot-don-t-embed.html", "https://www.bentasker.co.uk/posts/blog/privacy/103-are-companies-being-too-loose-with-our-privacy.html", "https://www.bentasker.co
.uk/posts/blog/general/playing-around-with-automating-syndication.html", "https://www.bentasker.co.uk/posts/blog/opinion/instagrams-account-appeal-process-re
ally-isnt-great.html", "https://www.bentasker.co.uk/posts/blog/116-republished-freedom4all/639-religious-group-harnesses-power-of-facebook-to-ban-gig.html",
"https://www.bentasker.co.uk/posts/blog/privacy/713-so-long-whatsapp.html", "https://www.bentasker.co.uk/posts/blog/116-republished-freedom4all/646-judge-rul
es-privacy-controls-on-facebook-insufficient.html", "https://www.bentasker.co.uk/posts/blog/security/playing-around-with-bings-ai-chatbot.html"]
There is no Tombstoning of the tags
index - the way that it's merged into the main index at load time renders them unnecessary.
Indexing of tags can be disabled by setting environment variable TAGS_ENABLED
to n
(to enable, set to y
or leave empty).
In Memory Index
At index load time, the on-disk indexes are read and transformed into the in-memory index.
First, the tags index (if active) is read in, with two actions being performed for each tag
- The tag is assigned a numerical identifier
- An index mapping tag names to numerical identifiers is updated
- Index keys associated with the tag are retrieved from the second field of the entry
- A temporary index mapping main index keys to numerical identifiers is built
After this stage, we essentially have two indexes:
# Tag to id
INDEXES['TAGS'] = {
"foo" : 1,
"bar" : 2,
"sed" : 3,
}
# Site to ids
url_tags = {
"https://example.com/somepage" : [1, 3],
"https://example.com/anotherpage" : [2]
}
If the tags index isn't active, both of these indexes will be empty.
Next, the main index tombstone (if any) is read in.
The main index is then read in, skipping any entry who's key appears in the loaded tombstone.
Each entry has the following form
idxitem = {
"u" : index_l[0], # url
"h" : index_l[1], # index hash
"t" : index_l[2], # content-type
"n" : index_l[3], # title/name
"i" : idx_num, # Index number
"p" : pth, # path
"d" : dom, # domain
"k" : [], # tag IDs
}
The idx_num
is a calculated number used to shard the index, allowing parallelisation at search time.
pth
and dom
are derived from the key, allowing searches to perform fast prefix and domain matching the index alone.
The index item's key is used to look up associated tag ids (if any) in url_tags
so that attribute k
can be populated.
The result is something like
idxitem = {
"u" : "https://example.com/somepage", # url
"h" : "cdbe3bcae7beadf19f97c277ab08fe7a716e482eff43d5a21cacbdf5e2a04721", # index hash
"t" : "text/html", # content-type
"n" : "My example Page", # title/name
"i" : 1, # Index number
"p" : "/somepage", # path
"d" : "example.com", # domain
"k" : [1, 3], # tag IDs
}
Once the index has been fully loaded, the in-memory copy of the tombstone and url_tags
are discarded.
Storage Files
Storage files are gzipped text files containing headers and a payload:
KEY:<key>
HASH:<hash>
DOMAIN:<domain>
PATH:<path>
FILE:<filename>
CONTENTTYPE:<mimetype>
TITLE:<title>
<JSON payload>
Headers are primarily consumed when building the index and allow index builds to occur without needing to unnecessarily consume potentially large payloads.
If, during, reindexing, it's considered that a file is corrupt, it will be moved to have a file extension of .invalid