Wiki: Searching/Utilities / File Location Listing

Searches are performed by POSTing a JSON payload to /search

{
   "term":"a search term",
   "type":"DOC",
   "limit" : 300
}

Notes:

type should be one of DOC or IMAGE
limit can be omitted (no limit will be applied if not present)
term can be a word, or multiple words. It can also include dorks (see below)
The default operation is an AND search (see here for why we don't use OR by default), the mode can be changed with a dork

Count Only

Since v0.2.5 it's possible to have the API return a count of results rather than the results themselves - this allows for much faster responses because it doesn't require loads from storage.

This mode is enabled by including a count-only attribute in the POST'd payload and setting its value to true:

$ curl -d '{"term":"SD", "type": "DOC", "count-only": true}' http://127.0.0.1:5000/search/
{
  "results": {
    "result_count": 56
  }
}

Dorks

When needed, your search-fu can be improved by using dorks within your search term

content-type:<content-type> (example: content-type:text/html)
domain:<domain>
ext:<filename extension>
hastitle:<y|n|true|false|0|1> (whether results must have a title)
matchtype:<title|url|any> (field that results should match on)
mode:<and|exact|or> (set matching mode)
prefix:<path>

For example, the following search would find HTML matches with foo and bar in the title or URL, but only if they have a title and only if they're under a path which starts with /docs

foo bar hastitle:y prefix:/docs content-type:text/html

The following would enforce the same constraints but would return results with foo OR bar

foo bar hastitle:y prefix:/docs content-type:text/html mode:or

Example

There is an example CLI script in examples/search_cli.py which communicates with the Search Portal in order to run searches and print them to a shell

The initial version of this CLI is also below

#!/usr/bin/env python3
#
# Run a search against file-listing and print the results in the CLI
#
# 

import os
import requests
import sys
import shutil

SEARCH_URL = os.getenv("SEARCH_URL", False)
RESULT_LIMIT = int(os.getenv("RESULT_LIMIT", 0))

# Optional - setting these will add dorks to the search
SEARCH_PREFIX = os.getenv("SEARCH_PREFIX", False)
SEARCH_DOMAIN = os.getenv("SEARCH_DOMAIN", False)

def doSearch(query, url):
    ''' Make a call to file-location
    '''
    query_obj = {
        "term" : query,
        "type": "DOC"
        }

    if RESULT_LIMIT:
        query_obj["limit"] = RESULT_LIMIT

    if SEARCH_PREFIX:
        query_obj["term"] += f" prefix:{SEARCH_PREFIX}"

    if SEARCH_DOMAIN:
        query_obj["term"] += f" domain:{SEARCH_DOMAIN}"

    r = requests.post(url, json=query_obj)

    return r.json()


def printResults(j):
    ''' Iterate through results and print them

    '''
    end_esc='\033[0m'

    seen=[]
    colwidth = shutil.get_terminal_size().columns

    if colwidth > 90:
        colwidth = 90

    for r in j['results']:
        url = r['key']
        size = round(int(r['bytes']) / 1024 / 1024, 3)

        if url in seen:
            continue


        if len(r['title']) > 0:
            print(f'\33[1m\033[94m{r["title"]}{end_esc}')

        print(f'\33[3m\033[92m{url}{end_esc}')
        print('')
        print(f'{size} MiB\n')
        print(f'\33[90mIndexed At: {r["valid-at"]} {end_esc}')
        print(f'\33[90mLast Mod: {r["last-mod"]} {end_esc}')
        print('-' * colwidth)

        seen.append(url)


res = doSearch(sys.argv[1], SEARCH_URL)
printResults(res)

Monitoring Result Counts

It may, sometimes, be desirable to set up monitoring of the number of results returned for a specific search (perhaps monitoring the number of files indexed on a given domain etc).

This can be achieved using Telegraf's HTTP input plugin:

[[inputs.http]]
  interval = "15m"

  # Set term to the search string to use
  body = '{"count-only":true, "type":"DOC", "term":"mastodon"}'

  urls = ["http://127.0.0.1:5000/search"]

  method = "POST"
  tagexclude = ["host"]
  name_override = "file_location_results"
  data_format = "json"
  json_query = "results"
  tag_keys = ["term"]

This will result in line protocol like the following

file_location_results,term=mastodon,url=http://127.0.0.1:5000/search result_count=1 1705753147000000000