I put a text file up overnight to act as a list of hints to paths on a domain that's indexable but doesn't have a navigable structure.
The text file got indexed, but none of the content it points towards was.
After some a little digging, I found a mistake made within a try
block which prevents extraction of these URLs
def extractUrlsFromText(plaintext):
''' Regex out any URLs in a block of text
'''
try:
urls = re.findall("(https?://[^\s^\]^\)]+)", plaintext.decode())
outgoinglinks = {}
for link in urls:
if not shouldCrawlURL(url, quiet=True):
url
isn't defined, that call should be using link
.
The except
section for this simply returns an empty dict, so none of the urls within the file get added for crawling
Activity
21-Jul-24 09:23
assigned to @btasker
21-Jul-24 09:24
Fixed by commit dade86cf0b32d9a34961b334b19d3ca9993cbe92
I've also updated the
except
to print a warning if it triggers (along with the exception which triggered it)