We currently apply search criteria in the following order:
checkTerms())checkTerms())_checkConstraints())_checkConstraints())_checkConstraints())_checkConstraints())_checkConstraints())Although logical, this is potentially less efficient than it can be.
Most of the constraints checked by _checkConstraints() perform simple logical comparisons and they all perform one match.
The same cannot be said for the constraints checked in checkTerms() - if multiple search terms have been provided, it'll run multiple substring searches.
So, if we take the following search string:
foo bar domain:www.somedomain.invalid
And assume we've indexed the following URLs
https://www.somedomain.invalid/foo/bar.htmlhttps://sub1.somedomain.invalid/foo/blah/bar.htmlhttps://sub2.somedomain.invalid/foo/blah/bar.htmlWe'll see the following operations get run for each URL
if foo in srchterm (via checkTerms())
if bar in srchterm (via checkTerms())
if domain == www.somedomain.invalid
Whatever order we do things in, the number of operations applied to https://www.somedomain.invalid/foo/bar.html will always be 3.
But, we've performed 3 operations against each of https://sub1.somedomain.invalid/foo/blah/bar.html and https://sub2.somedomain.invalid/foo/blah/bar.html when they could have been excluded with just one cheap one.
Activity
01-Jan-24 12:54
assigned to @btasker
01-Jan-24 13:08
I was curious what the performance difference was, so kicked together a quick script to time applying a few functions to a list of 1 million strings:
It's far from scientific, but works as a rough approximation.
They're all pretty fast, but there are clear differences
The numbers obviously fluctuate between runs but, as expected,
==is always quicker thanin01-Jan-24 13:15
I'm somewhat surprised at how slow
startswithis in comparison toin. I thought that maybe it was because we were checking for quite a long prefix, but adjusting to teststartswith("1")doesn't really change the numbers.We use
startswith()for theprefixdork, so we want to make sure that_checkConstraints()applies that after other constraints.Extension checks use
split()which is significantly more expensive, we definitely want that constraint applied last (and, actually, probably want it applied aftercheckTerms()has been called).01-Jan-24 13:19
mentioned in commit e511159b5cbbdddf4ba9241d3bc664203bc443b1
Commit: e511159b5cbbdddf4ba9241d3bc664203bc443b1 Author: B Tasker Date: 2024-01-01T13:19:23.000+00:00Message
fix: optimise order in which constraints are applied (utilities/file_location_listing#30)
01-Jan-24 13:21
The order of application has been updated:
With
_checkConstraints()having been updated to no longer triggerextconstraints and triggerprefixconstraints last.Where dorks have been used, this should allow us to exclude results as cheaply as possible.