project Utilities / File Location Listing avatar

utilities/file_location_listing#25: Prefix dork



Issue Information

Issue Type: issue
Status: closed
Reported By: btasker
Assigned To: btasker

Milestone: v0.2.1
Created: 31-Dec-23 11:03



Description

Should add a new prefix operator to fu in order to allow results to be filtered by the beginning of the filepath

So, if we have

  • https://www.example.com/foo/bar/sed.html
  • https://www.example.com/aaa/sed.html

Searching for

sed prefix:/foo/bar

Should only return the 1st



Toggle State Changes

Activity


assigned to @btasker

The filepath isn't in the index as an individual field, so we have a few options here:

  1. Handle it like we currently do with the domain dork - check the index key contains the string, then check the exact prefix after loading candidates from disk
  2. Parse the key in order to extract the path and then check the prefix
  3. Extract the path when reading the index in and store as part of the in-memory index
  4. Add the path to the index

Thoughts:

  1. I don't think 2. is really a practical option - we'd need to parse every identified key, adding a lot of overhead to searches.
  2. Opt 3. achieves the same effect, but adds quite a lot of expense at index load time.
  3. Opt 4. will add quite a lot of size to the index - arguably unnecessarily given that that information is already in the index even if in a less accessible form.

I suspect the answer, ultimately, will be opt 3. (as that'll also improve performance for domain checks), but to begin with I think we go with 1. rather than reinventing the wheel

verified

mentioned in commit 664f3857349a41fd15d0305f416321cf8f2528f2

Commit: 664f3857349a41fd15d0305f416321cf8f2528f2 
Author: B Tasker                            
                            
Date: 2023-12-31T11:20:16.000+00:00 

Message

feat: implement prefix operator (utilities/file_location_listing#25)

+21 -3 (24 lines changed)
verified

mentioned in commit 311ea6fe2b162bba3e47d58236c6064d8310210c

Commit: 311ea6fe2b162bba3e47d58236c6064d8310210c 
Author: B Tasker                            
                            
Date: 2023-12-31T11:21:03.000+00:00 

Message

docs: update help text (utilities/file_location_listing#25)

+1 -0 (1 lines changed)

OK, so option 1 is in place.

I want to take a look at response times so that we can then put together a rough implementation of opt 3. and see how that does.

Searching without prefix filter

curl 'http://127.0.0.1:5000/search/' --data-raw '{"term":"a","type":"DOC"}'

Processed search
    Filters: {'AND': ['a'], 'OR': [], 'NOT': [], 'fu': {'matchtype': 'ANY', 'mode': 'AND'}}
    Search Type: DOC
    Search Time: 1.004521868s

Processed search
    Filters: {'AND': ['a'], 'OR': [], 'NOT': [], 'fu': {'matchtype': 'ANY', 'mode': 'AND'}}
    Search Type: DOC
    Search Time: 0.794224538s

Processed search
    Filters: {'AND': ['a'], 'OR': [], 'NOT': [], 'fu': {'matchtype': 'ANY', 'mode': 'AND'}}
    Search Type: DOC
    Search Time: 0.575431128s

mean: 0.791392511333s

Adding a prefix filter

curl 'http://127.0.0.1:5000/search/' --data-raw '{"term":"a prefix:World_Domination","type":"DOC"}'

Processed search
    Filters: {'AND': ['a'], 'OR': [], 'NOT': [], 'fu': {'matchtype': 'ANY', 'prefix': 'world_domination', 'mode': 'AND'}}
    Search Type: DOC
    Search Time: 0.930143945s

Processed search
    Filters: {'AND': ['a'], 'OR': [], 'NOT': [], 'fu': {'matchtype': 'ANY', 'prefix': 'world_domination', 'mode': 'AND'}}
    Search Type: DOC
    Search Time: 0.596798093s

Processed search
    Filters: {'AND': ['a'], 'OR': [], 'NOT': [], 'fu': {'matchtype': 'ANY', 'prefix': 'world_domination', 'mode': 'AND'}}
    Search Type: DOC
    Search Time: 1.35197711s

mean: 0.959639716s

It being a little longer makes sense - it's still got to load the same number of entries from storage even if it then returns fewer results

Ahead of hacking in opt 3. I've added a timings print to index loading:

diff --git a/lib/storage.py b/lib/storage.py
index 0bf354e..4171399 100644
--- a/lib/storage.py
+++ b/lib/storage.py
@@ -47,7 +47,7 @@ def loadIndex():
             "IMAGE" : [],
             "DOC" : [],
         }
-        
+    start = time.time_ns()
     tombstones = {}
     if os.path.exists(f"{DB_PATH}/index.tomb"):
         with open(f"{DB_PATH}/index.tomb", 'r') as f:
@@ -106,6 +106,9 @@ def loadIndex():

     LOADED_AT = time.time()
     INDEXES["last_mod"] = getIndexLastMod()
+    stop = time.time_ns()
+    print(f"Index loaded in {stop-start}ns")
+    


 def deleteStoredItem(url):

Restarted a few times to get a mean:

Index loaded in 14212539ns
Index loaded in 13492333ns
Index loaded in 13538847ns

Mean load time is 13747906.3333 nanoseconds (index is pretty small - it has 5257 entries)

Hacking opt 3. in with

diff --git a/lib/dosearch.py b/lib/dosearch.py
index d8e6834..95f352c 100644
--- a/lib/dosearch.py
+++ b/lib/dosearch.py
@@ -48,6 +48,7 @@ def searchIndexChunk(idx, filters, matchtypes, idx_num, limit):
                     and checkTerms(url, filters)
                     and checkNot(url, filters['NOT']) 
                     and checkContentType(e, filters['fu'])
+                    and checkIndexConstraint("prefix", e, filters['fu'])
                     and checkIndexConstraint("hastitle", e, filters['fu'])
                     and checkIndexConstraint("ext", e, filters['fu'])
                     ):
@@ -61,9 +62,7 @@ def searchIndexChunk(idx, filters, matchtypes, idx_num, limit):
                         break

                     info['matchtype'] = m[1]
-                    if (checkItemConstraint("domain", info, filters['fu'])
-                        and checkItemConstraint("prefix", info, filters['fu'])
-                        ):
+                    if checkItemConstraint("domain", info, filters['fu']):
                         if not limit or count <= limit:
                             results.append(info)
                             count += 1
@@ -221,6 +220,13 @@ def checkIndexConstraint(field, idxitem, fu):
         p_sp = idxitem["u"].split(".")
         ext = p_sp[-1].lower()
         return (ext == fu["ext"].lstrip("."))
+    
+    if field == "prefix":
+        # Strip the leading slash - the user might or might not
+        # have omitted it
+        pref = fu['prefix'].lstrip("/")
+        return (idxitem['p'].startswith(pref))
+    

 def checkItemConstraint(field, item, fu):
     ''' Check whether the stored item's recorded domain matches the 
@@ -229,13 +235,6 @@ def checkItemConstraint(field, item, fu):
     if field not in fu:
         # No filter applied
         return True
-            
-    if field == "prefix":
-        # Strip the leading slash - the user might or might not
-        # have omitted it
-        pth = item["path"].lstrip("/").lower()
-        pref = fu['prefix'].lstrip("/")
-        return (pth.startswith(pref))

     # Otherwise, there's a filter, see if it matches
     return (item[field].lower() == fu[field])
diff --git a/lib/storage.py b/lib/storage.py
index 0bf354e..8d7f8c6 100644
--- a/lib/storage.py
+++ b/lib/storage.py
@@ -47,7 +47,7 @@ def loadIndex():
             "IMAGE" : [],
             "DOC" : [],
         }
-        
+    start = time.time_ns()
     tombstones = {}
     if os.path.exists(f"{DB_PATH}/index.tomb"):
         with open(f"{DB_PATH}/index.tomb", 'r') as f:
@@ -69,6 +69,12 @@ def loadIndex():
             if index_l[0] in tombstones:
                 continue

+            
+            # Extract the file path from the key
+            linkparsed = urlparse(index_l[0])
+            dom = linkparsed.netloc.lower()
+            pth = linkparsed.path.lower().lstrip("/")
+            
             # Should we be populating the "ALL" index
             #
             # This is off by default - generally speaking we
@@ -81,6 +87,8 @@ def loadIndex():
                     "t" : index_l[2], # content-type
                     "n" : index_l[3], # title/name
                     "i" : idx_num, # Index number
+                    "p" : pth, # path
+                    "d" : dom, # domain
                     })

             if index_l[2] in ["image/jpeg", "image/png", "image/gif"]:
@@ -90,6 +98,8 @@ def loadIndex():
                     "t" : index_l[2], # content-type
                     "n" : index_l[3], # title/name
                     "i" : idx_num, # Index number
+                    "p" : pth, # path
+                    "d" : dom, # domain                    
                     })
             else:
                 INDEXES["DOC"].append({
@@ -98,6 +108,8 @@ def loadIndex():
                     "t" : index_l[2], # content-type
                     "n" : index_l[3], # title/name 
                     "i" : idx_num, # Index number
+                    "p" : pth, # path
+                    "d" : dom, # domain                    
                     })

             idx_num += 1
@@ -106,6 +118,9 @@ def loadIndex():

     LOADED_AT = time.time()
     INDEXES["last_mod"] = getIndexLastMod()
+    stop = time.time_ns()
+    print(f"Index loaded in {stop-start}ns")
+    


 def deleteStoredItem(url):

Same index file, load times:

Index loaded in 33773051ns
Index loaded in 33167107ns
Index loaded in 33147481ns

mean: 33362546.3333 nanoseconds

That's an increase of 19614640.0033 nanoseconds (about a 43% increase, fairly substantial).

What's the impact on query time though?

Searching without prefix filter

curl 'http://127.0.0.1:5000/search/' --data-raw '{"term":"a","type":"DOC"}'

Processed search
    Filters: {'AND': ['a'], 'OR': [], 'NOT': [], 'fu': {'matchtype': 'ANY', 'mode': 'AND'}}
    Search Type: DOC
    Search Time: 0.800105365s

Processed search
    Filters: {'AND': ['a'], 'OR': [], 'NOT': [], 'fu': {'matchtype': 'ANY', 'mode': 'AND'}}
    Search Type: DOC
    Search Time: 0.92191595s

Processed search
    Filters: {'AND': ['a'], 'OR': [], 'NOT': [], 'fu': {'matchtype': 'ANY', 'mode': 'AND'}}
    Search Type: DOC
    Search Time: 0.613058197s

mean: 0.778359837333s

Adding a prefix filter

curl 'http://127.0.0.1:5000/search/' --data-raw '{"term":"a prefix:World_Domination","type":"DOC"}'

Processed search
    Filters: {'AND': ['a'], 'OR': [], 'NOT': [], 'fu': {'matchtype': 'ANY', 'prefix': 'world_domination', 'mode': 'AND'}}
    Search Type: DOC
    Search Time: 0.038807647s

Processed search
    Filters: {'AND': ['a'], 'OR': [], 'NOT': [], 'fu': {'matchtype': 'ANY', 'prefix': 'world_domination', 'mode': 'AND'}}
    Search Type: DOC
    Search Time: 0.034738052s

Processed search
    Filters: {'AND': ['a'], 'OR': [], 'NOT': [], 'fu': {'matchtype': 'ANY', 'prefix': 'world_domination', 'mode': 'AND'}}
    Search Type: DOC
    Search Time: 0.031598047s

mean: 0.0350479153333s

That's a pretty significant improvement for that search type

The index load time has increased significantly, but we're still talking much less than a second.

The question, of course, is: how long does a larger index take to load?

I've built a (much) larger index:

$ wc -l index
1090880 index
$ ls -sh index
212M index

Load times

Index loaded in 7783185645ns
Index loaded in 6952869913ns
Index loaded in 6919514444ns

mean: 7218523334ns (7.21s).

Realistically, that's probably much larger than we expect the DB to grow and we'd want index format changes/improvements before we got there anyway.

Search time at that size isn't as bad as it could be though

Processed search
    Filters: {'AND': ['a'], 'OR': [], 'NOT': [], 'fu': {'matchtype': 'ANY', 'prefix': 'world_domination', 'mode': 'AND'}}
    Search Type: DOC
    Search Time: 4.377456318s

So, the question is: is it worth the tradeoff in index load times?

Realistically, I'm probably not going to be using prefix all that regularly. But, most of that extra time is spent parsing the key - we get more than just the path from that, so the domain dork can also benefit. I can well imagine that I'll be using that fairly regularly.

At the extreme end: although a 7s index load (really) isn't great, without the hack in place, that query likely wouldn't have completed in a meaningful timeframe

I think, with a bit of tidying, it's probably worth keeping in place.

verified

mentioned in commit 9401635de9513e43ee2d93c96c0594ff33bf12cb

Commit: 9401635de9513e43ee2d93c96c0594ff33bf12cb 
Author: B Tasker                            
                            
Date: 2023-12-31T12:10:29.000+00:00 

Message

feat: extract url components when reading index (utilities/file_location_listing#25)

+24 -11 (35 lines changed)

Closing as done - I'll raise a seperate ticket to track moving domain over to using it

mentioned in issue #26