#1 Design Project : utilities/file_location

Issue Type: issue

Status: closed

Reported By: btasker

Assigned To: btasker

Project: Utilities / File Location Listing

Milestone: proof-of-concept

Created: 28-Dec-23 12:09

Labels: Fixed/Done Task

Description

This project is the follow on from misc/Python_Web_Crawler#12

I discontinued that project because I no longer had need for full-text search.

What I do continue to have a need for, though, is identifying where I stored a file - i.e. searching by filename, path etc.

The aim of this project is to stand up a simple crawler and web portal which allows me to search for files by location

Crawler should be designed to run as a Kube Cron
Portal should be extremely lightweight

Although I'm not sure that it'll scale (in fact, I'm certain that it won't), I'd like the initial implementation to function without reliance on a traditional database - the focus should be on getting the crawler and information collection up and running.

The crawler should read a list of predefined domains from config and crawl pages on those domains. It should store

scheme
domain
path
Filename
Last modified (if provided)
Content-type (if provided)

Nice to haves

image support (including thumbnailing)

Toggle State Changes

Activity

btasker Permalink
28-Dec-23 12:09

assigned to @btasker

btasker Permalink
28-Dec-23 12:11

Raised utilities/file_location_listing#2 for the crawler

btasker Permalink
28-Dec-23 17:52

OK, we have a working proof of concept.

The next thing to do will be to productise it:

Crawler needs to be able to read in a list of domains
System needs wrapping in a container
The /search/ API should be documented so that I can call it from CLI utilities

So far, it performs pretty well using the text storage and indexes - I suspect that'll cease to be true once we've done a full crawl though.

utilities/file_location_listing#1: Design Project

Issue Information

Activity