Wiki: Database/Utilities / File Location Listing

The project doesn't currently rely on an external database.

During the initial design it was decided that, to get up and running quickly, rather than worrying about which database to use, we'd just implement something - the idea being that it'd be better to switch once the needs of the project were better understood.

However, the need for that switch hasn't (yet) arisen. Plus, messing about building a storage engine has probably helped keep me engaged.

This page details the inbuilt storage engine.

File Structure

The database relies on an indexed and hashed structure.

The primary key for a resource is it's URL.

In order to navigate the file structure, a hex-encoded SHA256 has of the URL is generated. The subdirectory path is calculated by taking the first and second pairs of characters (for example a080df7b239d9792fb041906cbcf563e095400a1e974f9c6e7766ccea85f32e8 would be found under a0/80).

Storage directories are created in a directory called store.

For resource types that support thumbnails, a thumbnail will be stored under the thumbs directory, using the same subdirectory structure:

/
  index
  index.tomb
  store/
    a0/
      80/
       a080df7b239d9792fb041906cbcf563e095400a1e974f9c6e7766ccea85f32e8
  tags
  thumbs/
    a0/
      80/
       a080df7b239d9792fb041906cbcf563e095400a1e974f9c6e7766ccea85f32e8.jpg

On Disk Index

The index is a simple text file, with one entry per line.

Each entry consists of 4 fields separated by the string (,!,)

The format is

KEY,!,HASH,!,MIMETYPE,!,TITLE

For example

https://projects.bentasker.co.uk/,!,73518e88ec4fabb0f24c62f79a45e29479e5d278061fd824064f0af951ea1255,!,text/html,!,BTasker Projects - projects.bentasker.co.uk

Although the hash can (obviously) be derived from the key, it's calculated and stored when building indexes to ensure that index loads do not incur unnecessary overhead.

Index Tombstone

The index tombstone is an append-only list of keys.

If an item has been deleted (whether with the delete utility, or as the result of a crawl), its key will be written into the tombstone. The next time search loads the index it will ignore any entries with keys that exist in the tombstone.

This allows deletions to have (almost) instant effect, rather than needing to wait for the index to be rebuilt.

Tags index

The Tags index is gzipped text, introduced in Issue 10

Each line is an entry for a single tag, with the second field being a JSON encapsulated array of keys using that tag.

tag!#![json array]

For example:

facebook!#!["https://www.bentasker.co.uk/posts/blog/privacy/84-social-networking-and-children.html", "https://www.bentasker.co.uk/posts/blog/opinion/481-scre
enshot-don-t-embed.html", "https://www.bentasker.co.uk/posts/blog/privacy/103-are-companies-being-too-loose-with-our-privacy.html", "https://www.bentasker.co
.uk/posts/blog/general/playing-around-with-automating-syndication.html", "https://www.bentasker.co.uk/posts/blog/opinion/instagrams-account-appeal-process-re
ally-isnt-great.html", "https://www.bentasker.co.uk/posts/blog/116-republished-freedom4all/639-religious-group-harnesses-power-of-facebook-to-ban-gig.html", 
"https://www.bentasker.co.uk/posts/blog/privacy/713-so-long-whatsapp.html", "https://www.bentasker.co.uk/posts/blog/116-republished-freedom4all/646-judge-rul
es-privacy-controls-on-facebook-insufficient.html", "https://www.bentasker.co.uk/posts/blog/security/playing-around-with-bings-ai-chatbot.html"]

There is no Tombstoning of the tags index - the way that it's merged into the main index at load time renders them unnecessary.

Indexing of tags can be disabled by setting environment variable TAGS_ENABLED to n (to enable, set to y or leave empty).

In Memory Index

At index load time, the on-disk indexes are read and transformed into the in-memory index.

First, the tags index (if active) is read in, with two actions being performed for each tag

The tag is assigned a numerical identifier
An index mapping tag names to numerical identifiers is updated
Index keys associated with the tag are retrieved from the second field of the entry
A temporary index mapping main index keys to numerical identifiers is built

After this stage, we essentially have two indexes:

# Tag to id
INDEXES['TAGS'] = {
  "foo" : 1, 
  "bar" : 2, 
  "sed" : 3, 
}

# Site to ids
url_tags = {
   "https://example.com/somepage" : [1, 3],
   "https://example.com/anotherpage" : [2]
}

If the tags index isn't active, both of these indexes will be empty.

Next, the main index tombstone (if any) is read in.

The main index is then read in, skipping any entry who's key appears in the loaded tombstone.

Each entry has the following form

idxitem = {
    "u" : index_l[0], # url
    "h" : index_l[1], # index hash
    "t" : index_l[2], # content-type
    "n" : index_l[3], # title/name
    "i" : idx_num, # Index number
    "p" : pth, # path
    "d" : dom, # domain
    "k" : [], # tag IDs
}

The idx_num is a calculated number used to shard the index, allowing parallelisation at search time.

pth and dom are derived from the key, allowing searches to perform fast prefix and domain matching the index alone.

The index item's key is used to look up associated tag ids (if any) in url_tags so that attribute k can be populated.

The result is something like

idxitem = {
    "u" : "https://example.com/somepage", # url
    "h" : "cdbe3bcae7beadf19f97c277ab08fe7a716e482eff43d5a21cacbdf5e2a04721", # index hash
    "t" : "text/html", # content-type
    "n" : "My example Page", # title/name
    "i" : 1, # Index number
    "p" : "/somepage", # path
    "d" : "example.com", # domain
    "k" : [1, 3], # tag IDs
}

Once the index has been fully loaded, the in-memory copy of the tombstone and url_tags are discarded.

Storage Files

Storage files are gzipped text files containing headers and a payload:

KEY:<key>
HASH:<hash>
DOMAIN:<domain>
PATH:<path>
FILE:<filename>
CONTENTTYPE:<mimetype>
TITLE:<title>

<JSON payload>

Headers are primarily consumed when building the index and allow index builds to occur without needing to unnecessarily consume potentially large payloads.

If, during, reindexing, it's considered that a file is corrupt, it will be moved to have a file extension of .invalid