Searches are performed by POSTing a JSON payload to /search
{
"term":"a search term",
"type":"DOC",
"limit" : 300
}
Notes:
type
should be one ofDOC
orIMAGE
limit
can be omitted (no limit will be applied if not present)term
can be a word, or multiple words. It can also include dorks (see below)- The default operation is an
AND
search (see here for why we don't useOR
by default), the mode can be changed with a dork
Count Only
Since v0.2.5 it's possible to have the API return a count of results rather than the results themselves - this allows for much faster responses because it doesn't require loads from storage.
This mode is enabled by including a count-only
attribute in the POST'd payload and setting its value to true
:
$ curl -d '{"term":"SD", "type": "DOC", "count-only": true}' http://127.0.0.1:5000/search/
{
"results": {
"result_count": 56
}
}
Dorks
When needed, your search-fu can be improved by using dorks within your search term
content-type:<content-type>
(example:content-type:text/html
)domain:<domain>
ext:<filename extension>
hastitle:<y|n|true|false|0|1>
(whether results must have a title)matchtype:<title|url|any>
(field that results should match on)mode:<and|exact|or>
(set matching mode)prefix:<path>
For example, the following search would find HTML matches with foo
and bar
in the title or URL, but only if they have a title and only if they're under a path which starts with /docs
foo bar hastitle:y prefix:/docs content-type:text/html
The following would enforce the same constraints but would return results with foo
OR bar
foo bar hastitle:y prefix:/docs content-type:text/html mode:or
Example
There is an example CLI script in examples/search_cli.py
which communicates with the Search Portal in order to run searches and print them to a shell
The initial version of this CLI is also below
#!/usr/bin/env python3
#
# Run a search against file-listing and print the results in the CLI
#
#
import os
import requests
import sys
import shutil
SEARCH_URL = os.getenv("SEARCH_URL", False)
RESULT_LIMIT = int(os.getenv("RESULT_LIMIT", 0))
# Optional - setting these will add dorks to the search
SEARCH_PREFIX = os.getenv("SEARCH_PREFIX", False)
SEARCH_DOMAIN = os.getenv("SEARCH_DOMAIN", False)
def doSearch(query, url):
''' Make a call to file-location
'''
query_obj = {
"term" : query,
"type": "DOC"
}
if RESULT_LIMIT:
query_obj["limit"] = RESULT_LIMIT
if SEARCH_PREFIX:
query_obj["term"] += f" prefix:{SEARCH_PREFIX}"
if SEARCH_DOMAIN:
query_obj["term"] += f" domain:{SEARCH_DOMAIN}"
r = requests.post(url, json=query_obj)
return r.json()
def printResults(j):
''' Iterate through results and print them
'''
end_esc='\033[0m'
seen=[]
colwidth = shutil.get_terminal_size().columns
if colwidth > 90:
colwidth = 90
for r in j['results']:
url = r['key']
size = round(int(r['bytes']) / 1024 / 1024, 3)
if url in seen:
continue
if len(r['title']) > 0:
print(f'\33[1m\033[94m{r["title"]}{end_esc}')
print(f'\33[3m\033[92m{url}{end_esc}')
print('')
print(f'{size} MiB\n')
print(f'\33[90mIndexed At: {r["valid-at"]} {end_esc}')
print(f'\33[90mLast Mod: {r["last-mod"]} {end_esc}')
print('-' * colwidth)
seen.append(url)
res = doSearch(sys.argv[1], SEARCH_URL)
printResults(res)
Monitoring Result Counts
It may, sometimes, be desirable to set up monitoring of the number of results returned for a specific search (perhaps monitoring the number of files indexed on a given domain etc).
This can be achieved using Telegraf's HTTP input plugin:
[[inputs.http]]
interval = "15m"
# Set term to the search string to use
body = '{"count-only":true, "type":"DOC", "term":"mastodon"}'
urls = ["http://127.0.0.1:5000/search"]
method = "POST"
tagexclude = ["host"]
name_override = "file_location_results"
data_format = "json"
json_query = "results"
tag_keys = ["term"]
This will result in line protocol like the following
file_location_results,term=mastodon,url=http://127.0.0.1:5000/search result_count=1 1705753147000000000