project Utilities / File Location Listing avatar

utilities/file_location_listing#1: Design Project



Issue Information

Issue Type: issue
Status: closed
Reported By: btasker
Assigned To: btasker

Milestone: proof-of-concept
Created: 28-Dec-23 12:09



Description

This project is the follow on from misc/Python_Web_Crawler#12

I discontinued that project because I no longer had need for full-text search.

What I do continue to have a need for, though, is identifying where I stored a file - i.e. searching by filename, path etc.

The aim of this project is to stand up a simple crawler and web portal which allows me to search for files by location

  • Crawler should be designed to run as a Kube Cron
  • Portal should be extremely lightweight

Although I'm not sure that it'll scale (in fact, I'm certain that it won't), I'd like the initial implementation to function without reliance on a traditional database - the focus should be on getting the crawler and information collection up and running.

The crawler should read a list of predefined domains from config and crawl pages on those domains. It should store

  • scheme
  • domain
  • path
  • Filename
  • Last modified (if provided)
  • Content-type (if provided)

Nice to haves

  • image support (including thumbnailing)


Toggle State Changes

Activity


assigned to @btasker

Raised utilities/file_location_listing#2 for the crawler

OK, we have a working proof of concept.

The next thing to do will be to productise it:

  • Crawler needs to be able to read in a list of domains
  • System needs wrapping in a container
  • The /search/ API should be documented so that I can call it from CLI utilities

So far, it performs pretty well using the text storage and indexes - I suspect that'll cease to be true once we've done a full crawl though.