project Utilities / File Location Listing avatar

utilities/file_location_listing#61: URLs are not extracted from plaintext files



Issue Information

Issue Type: issue
Status: closed
Reported By: btasker
Assigned To: btasker

Milestone: v0.2.8
Created: 21-Jul-24 09:23



Description

I put a text file up overnight to act as a list of hints to paths on a domain that's indexable but doesn't have a navigable structure.

The text file got indexed, but none of the content it points towards was.

After some a little digging, I found a mistake made within a try block which prevents extraction of these URLs

def extractUrlsFromText(plaintext):
    ''' Regex out any URLs in a block of text
    '''
    try:
        urls = re.findall("(https?://[^\s^\]^\)]+)", plaintext.decode())
        outgoinglinks = {}
        for link in urls:

            if not shouldCrawlURL(url, quiet=True):

url isn't defined, that call should be using link.

The except section for this simply returns an empty dict, so none of the urls within the file get added for crawling



Toggle State Changes

Activity


assigned to @btasker

Fixed by commit dade86cf0b32d9a34961b334b19d3ca9993cbe92

I've also updated the except to print a warning if it triggers (along with the exception which triggered it)