I put a text file up overnight to act as a list of hints to paths on a domain that's indexable but doesn't have a navigable structure.
The text file got indexed, but none of the content it points towards was.
After some a little digging, I found a mistake made within a try block which prevents extraction of these URLs
def extractUrlsFromText(plaintext):
''' Regex out any URLs in a block of text
'''
try:
urls = re.findall("(https?://[^\s^\]^\)]+)", plaintext.decode())
outgoinglinks = {}
for link in urls:
if not shouldCrawlURL(url, quiet=True):
url isn't defined, that call should be using link.
The except section for this simply returns an empty dict, so none of the urls within the file get added for crawling
Activity
21-Jul-24 09:23
assigned to @btasker
21-Jul-24 09:24
Fixed by commit dade86cf0b32d9a34961b334b19d3ca9993cbe92
I've also updated the
exceptto print a warning if it triggers (along with the exception which triggered it)