#30 Apply least expensive search criteria first : utilities/file_location

Issue Type: issue

Status: closed

Reported By: btasker

Assigned To: btasker

Project: Utilities / File Location Listing

Milestone: v0.2.1

Created: 01-Jan-24 12:54

Labels: Fixed/Done Improvement

Description

We currently apply search criteria in the following order:

Does it contain necessary terms (via checkTerms())
Does it not contain excluded terms (via checkTerms())
Does it match any content-type filter (via _checkConstraints())
Does it match any domain filter (via _checkConstraints())
Does it match any prefix filter (via _checkConstraints())
Does it match any hastitle filter (via _checkConstraints())
Does it match any ext filter (via _checkConstraints())

Although logical, this is potentially less efficient than it can be.

Most of the constraints checked by _checkConstraints() perform simple logical comparisons and they all perform one match.

The same cannot be said for the constraints checked in checkTerms() - if multiple search terms have been provided, it'll run multiple substring searches.

So, if we take the following search string:

foo bar domain:www.somedomain.invalid

And assume we've indexed the following URLs

https://www.somedomain.invalid/foo/bar.html
https://sub1.somedomain.invalid/foo/blah/bar.html
https://sub2.somedomain.invalid/foo/blah/bar.html

We'll see the following operations get run for each URL

if foo in srchterm (via checkTerms())
if bar in srchterm (via checkTerms())
if domain == www.somedomain.invalid

Whatever order we do things in, the number of operations applied to https://www.somedomain.invalid/foo/bar.html will always be 3.

But, we've performed 3 operations against each of https://sub1.somedomain.invalid/foo/blah/bar.html and https://sub2.somedomain.invalid/foo/blah/bar.html when they could have been excluded with just one cheap one.

Toggle State Changes

Activity

btasker Permalink
01-Jan-24 12:54

assigned to @btasker

btasker Permalink
01-Jan-24 13:08

I was curious what the performance difference was, so kicked together a quick script to time applying a few functions to a list of 1 million strings:

#!/usr/bin/env python3

import time


def buildStrings():
    ''' Build a list of 1M strings
    '''
    n = 10000000000000000000000000
    l = []
    for x in range(1000000):
        n += 1
        l.append(str(n))
    return l

# Get some strings
l = buildStrings()

in_start = time.time_ns()
# run "in" checks
for i in l:
    if "42" in i:
        continue

in_stop = time.time_ns()

# run simple comparisons
c_start = time.time_ns()
for i in l:
    if i == "10000000000099999":
        continue
c_stop = time.time_ns()


# run startswith()
s_start = time.time_ns()
for i in l:
    if i.startswith("1"):
        continue
s_stop = time.time_ns()

e_start = time.time_ns()
for i in l:
    if i.endswith("9"):
        continue
e_stop = time.time_ns()


sp_start = time.time_ns()
for i in l:
    x = i.split(".")
    if x[0].lower() == "4":
        continue
sp_stop = time.time_ns()

print((
    f"in          : {in_stop - in_start}\n"
    f"==          : {c_stop - c_start}\n"
    f"startswith(): {s_stop - s_start}\n"
    f"endswith()  : {e_stop - e_start}\n"
    f"split()     : {sp_stop - sp_start}"
))

It's far from scientific, but works as a rough approximation.

They're all pretty fast, but there are clear differences

$ python3 ~/test.py 
in          : 38845076
==          : 30361763
startswith(): 77421262
endswith()  : 76586864
split()     : 157511522

The numbers obviously fluctuate between runs but, as expected, == is always quicker than in

btasker Permalink
01-Jan-24 13:15

I'm somewhat surprised at how slow startswith is in comparison to in. I thought that maybe it was because we were checking for quite a long prefix, but adjusting to test startswith("1") doesn't really change the numbers.

We use startswith() for the prefix dork, so we want to make sure that _checkConstraints() applies that after other constraints.

Extension checks use split() which is significantly more expensive, we definitely want that constraint applied last (and, actually, probably want it applied after checkTerms() has been called).

btasker Permalink
01-Jan-24 13:19

verified

mentioned in commit e511159b5cbbdddf4ba9241d3bc664203bc443b1

Commit: e511159b5cbbdddf4ba9241d3bc664203bc443b1 
Author: B Tasker                            
                            
Date: 2024-01-01T13:19:23.000+00:00

Message

fix: optimise order in which constraints are applied (utilities/file_location_listing#30)

+13 -4 (17 lines changed)

btasker Permalink
01-Jan-24 13:21

The order of application has been updated:

            if (filters['fu']['matchtype'] in ['ANY', m[1]] 
                    and _checkConstraints(e, filters)
                    and checkTerms(url, filters)
                    and checkIndexConstraint("ext", e, filters['fu'])

With _checkConstraints() having been updated to no longer trigger ext constraints and trigger prefix constraints last.

Where dorks have been used, this should allow us to exclude results as cheaply as possible.

utilities/file_location_listing#30: Apply least expensive search criteria first

Issue Information

Activity