I'm in the process of deploying a public instance of this (jira-projects/CDN#65) and ran into an exception
Traceback (most recent call last):
File "/app/crawler/app/crawler.py", line 764, in <module>
crawlPage(site, override = True)
File "/app/crawler/app/crawler.py", line 577, in crawlPage
if not shouldCrawlURL(url):
^^^^^^^^^^^^^^^^^^^
File "/app/crawler/app/crawler.py", line 538, in shouldCrawlURL
if config.shouldSkipURL(url, parsed.netloc):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/crawler/app/../../lib/config.py", line 213, in shouldSkipURL
SITE_PERMIT_RULES = getPermitRules()
^^^^^^^^^^^^^^^^
File "/app/crawler/app/../../lib/config.py", line 176, in getPermitRules
for line in l:
^
UnboundLocalError: cannot access local variable 'l' where it is not associated with a value
It may be that I've missed some config rather than it being a code bug. Either way, it should be handled better.
Activity
31-May-24 15:25
assigned to @btasker
31-May-24 15:26
Yeah it's a bug:
The file
site-allowregexes.txt
doesn't exist, so we don't open it, but then try and iterate through the variable we would have used.31-May-24 15:27
mentioned in commit 4d3d8e40ef2e2134619e0d2319f18eef9c510b29
Message
fix: don't error out if allow regexes aren't provided (utilities/file_location_listing#52)
31-May-24 15:28
The workaround for released versions is simply to create the file
31-May-24 15:29
mentioned in issue jira-projects/CDN#65