#61 URLs are not extracted from plaintext files : utilities/file_location

Issue Type: issue

Status: closed

Reported By: btasker

Assigned To: btasker

Project: Utilities / File Location Listing

Milestone: v0.2.8

Created: 21-Jul-24 09:23

Labels: Bug Fixed/Done

Description

I put a text file up overnight to act as a list of hints to paths on a domain that's indexable but doesn't have a navigable structure.

The text file got indexed, but none of the content it points towards was.

After some a little digging, I found a mistake made within a try block which prevents extraction of these URLs

def extractUrlsFromText(plaintext):
    ''' Regex out any URLs in a block of text
    '''
    try:
        urls = re.findall("(https?://[^\s^\]^\)]+)", plaintext.decode())
        outgoinglinks = {}
        for link in urls:

            if not shouldCrawlURL(url, quiet=True):

url isn't defined, that call should be using link.

The except section for this simply returns an empty dict, so none of the urls within the file get added for crawling

Toggle State Changes

Activity

btasker Permalink
21-Jul-24 09:23

assigned to @btasker

btasker Permalink
21-Jul-24 09:24

Fixed by commit dade86cf0b32d9a34961b334b19d3ca9993cbe92

I've also updated the except to print a warning if it triggers (along with the exception which triggered it)

utilities/file_location_listing#61: URLs are not extracted from plaintext files

Issue Information

Activity