MISC-36: Set up a Yacy instance



Issue Information

Issue Type: Task
 
Priority: Major
Status: Closed

Reported By:
Ben Tasker
Assigned To:
Ben Tasker
Project: Miscellaneous (MISC)
Resolution: Done (2019-12-31 17:03:48)
Labels: engine, indexing, search, yacy,

Created: 2019-12-22 09:35:28
Time Spent Working


Description
In LAN-119 I've been playing with Yacy on a raspberry pi as a potential replacement to Sphider.

I've fairly mixed feelings about it at the moment as I keep finding issues. What isn't clear though, is how many of those are caused by

- Being on a raspberry Pi and,or
- my having fumbled around to set it up

So, what I'd like to do is to get a "clean" public Yacy instance up and running to index the properties on the CDN. The idea in part being to get them into the SE results of anyone using a public Yacy instance, but also to see how it performs in that environment.

Longer term it could potentially be a small value-add - getting properties on the CDN automatically indexed (with fewer eyeballs, but more control than when just firing a sitemap at Google/Bing etc).


Issue Links

Notes (Extnotes)
Yacy Homepage
Toggle State Changes

Activity


-------------------------
From: git@rimmer.home
To: jira@chaos.home
Date: None
Subject: CDN-28 Add RR for yacy.bentasker.co.uk
-------------------------


Repo: chaos_dns
Host:hiyori

commit b273b4d0074806dc241a8a2029bb7e0379e84622
Author: root <root@gerbil.it>
Date: Sun Dec 22 10:08:58 2019 +0000

Commit Message: CDN-28 Add RR for yacy.bentasker.co.uk

bentasker.co.uk.zone | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)


View Commit | View Changes

Notes for this are in extnotes.

I have a public instance up and running, and with a bit of hackery it should reindex my stuff daily.

I'm not altogether sure I'm going to keep it though to be honest. Yacy is a nice concept with a lot of potential, but the implementation really leaves a lot to be desired.

It just seems like constant what-the-fucks and aggravation.

The Debian install instructions are here - https://wiki.yacy.net/index.php/En:DebianInstall - but you can't use them because their repo has been broken, using an invalid signature since 2017 - https://github.com/yacy/yacy_search_server/issues/124
W: GPG error: http://debian.yacy.net ./ Release: The following signatures were invalid: 8BD752501CB62448A30EA3EA1F968B3903D886E7
W: The repository 'http://debian.yacy.net ./ Release' is not signed.
N: Data from such a repository can't be authenticated and is therefore potentially dangerous to use.
N: See apt-secure(8) manpage for repository creation and user configuration details.


So, a manual install is required instead (meaning you get no init/unit files or "best-practice" preconfiguration).

Once you've got it installed, a lot of the concerns/criticisms I found in LAN-119 apply here too. Although I've formed a better understanding of how Yacy works in a P2P setting, I just don't have enough confidence in the results it gives.

The network design will, as far as I can make out, favour the most popular search terms (copies of those indexes will be well distributed), but searching for more niche terms will be extremely hit and miss - when doing a "remote" search, the local peer picks a handful of peers at random to ask. If they don't have it, you'll get no results even if the content is actually indexed within the network.

It also means that relatively new stuff may take a disproportionate amount of time to begin appearing in search results at the network level. Whenever a peer fetches results from a remote, it stores a copy of the index for that page - so in theory at least relevant pages should ultimately end up distributed across most of the network.

But, there are (currently) 392 peers in the network - each will pick just a small handful of peers for each remote query, so propogation is likely to take some time - particularly as the Peer-to-Peer Network reports the network as doing just 50 queries per hour (I don't know the granularity though, is it lower because it's the weekend?).

None of this is to say that yacy doesn't work, just that I may see very little benefit from having it online (versus the cost of maintaining the instance). I'm not going to be using it to search my stuff (I'll be using Sphider because that has access to my internal docs too), and if it's ability to export results out to the rest of the network is limited, then I'm unlikely to see much change in traffic levels to my properties.

Either way, I'll let it run for a week or so and keep tinkering, but my early impression on this is much like that under LAN-119 - it's a nice idea but doesn't really translate very well into the real world
Reindexing paused overnight, with the only notification of this being on the Crawl status page (even then it's quite inobtrusive)
pause reason: resource observer: not enough memory space


Easily resolved, of course, but I can imagine it could quite easily sit there for days not indexing because the state isn't clearly visible.

Also a little concerning - that's quite a lot of memory used (insert java joke here). The default setting is to reserve 600MB for Java, I've really not indexed all that much yet. Bumped to 1200, anyway.

The real concern though is that it means indexing just stops. Ideally, if a specific page/resource is going to require more RAM than can be allocated, you'd hope the crawler (or it's parent) would trap the exception and skip that page so that the crawl could continue (preferably raising an alert specifying the URL that was skipped).

This was a major concern with my LAN-119 experiment with using it internally too:
Crawls are incredibly unreliable because Yacy will silently (well, as good as, the notification is on one page and very inobtrusive) pause the crawler when it decides there's not enough RAM to index a document. A sane system would probably skip that document and continue on, but the real issue is that it means you cannot assume that crawls will actually be taking place as they should - they may very well be being abandonded part-way through. There doesn't appear to be a good way to monitor this other than to sit and look at the web interface either.
btasker changed Project from 'CDN' to 'Miscellaneous'
btasker changed Key from 'CDN-28' to 'MISC-36'
I've moved this from CDN-28 to MISC-36

I'll let the instance run a little longer so I can learn more about it, but realistically I'm not going to keep this instance in the longer term, and I'm certainly not going to even attempt integration into the CDN.

It's a nice idea, but Yacy just doesn't seem to be quite ready for headless unattended operation yet, and I don't really want the maintenance burden of keeping the node running and making sure it's actually indexing stuff. Especially given that the network design appears to mean that indexed pages may still not end up in the results of relevant searches (see comment above).

Admittedly, some of this might be operator error, but the project's wiki regularly times out and returns errors rather than content, so it's difficult to search for and find things within the (fairly limited) documentation. Lack of reliable doc searchability isn't a great look for a search engine project :( If it wasn't for the commit log in Github, I'd be inclined to think the project had been abandoned.

I'm also not particularly a fan of the fact that the web/search service is on the same port as the Peer-to-peer service (seems I'm not the only one - https://github.com/yacy/yacy_search_server/issues/315). I'd feel much more comfortable if there was some semblance of service isolation so that access could be restricted to the search service (reducing the attack surface there) whilst keeping open the bit that actually needs to be open to the world.
btasker added 'engine indexing search yacy' to labels
Actually on the topic of the Search and P2P being on the same port, there's a vaguely related tangent that strikes me as odd.

- Yacy supports performing it's P2P operations over HTTPS. But, it doesn't validate certificates.
- As a result, there's no protection against MiTM, but it does prevent casual observers from seeing the payloads

However, use of HTTPS for P2P is off by default. A vanilla install of yacy doesn't do HTTPS at all, you need to provide it with a cert and enable it. Presumably because users will get a cert error from their browser every time they go to the portal to do a search.

If P2P were on a separate port then a yacy install could do HTTPS by default, by generating itself a snake-oil cert on first run - it's not like other peers are going to validate the cert anyway (the wisdom of that is somewhat dubious too, particularly as there's no option to allow an admin to require validation - but I guess it probably relates to the usual java headaches about it's keystore when playing with HTTPS).
Under Index Export/Import there's the option to export the indexes in a few formats (including JSON so that you can post it straight into Elastisearch - handy).

One of the other options it'll give you though is a HTML list of all the URLs in the index. That's potentially handy for identifying non-SEF (or otherwise incorrect) paths that are being exposed by CMS's (like Joomla) - for a URL to get into the index it's got to be linked to from somewhere particularly as the instance only indexes my sites. There'll be a couple of others in there because I played around with search, but we can extract those.

Figured I'd have a play around with it

It's actually pretty sizeable:
ben@milleniumfalcon:~/tmp/MISC-36-Yacy_instance_setup$ ls -sh
total 41M
41M yacy_dump_f200710060000_l201912260527_n201912261114_c000000417800_tc.html


OK, lets strip out the stuff that isn't under bentasker.co.uk
ben@milleniumfalcon:~/tmp/MISC-36-Yacy_instance_setup$ egrep -e "html>|bentasker.co.uk" yacy_dump_f200710060000_l201912260527_n201912261114_c000000417800_tc.html > my_domains.html
ben@milleniumfalcon:~/tmp/MISC-36-Yacy_instance_setup$ ls -sh
total 50M
9.1M my_domains.html   41M yacy_dump_f200710060000_l201912260527_n201912261114_c000000417800_tc.html


OK.... maybe there are more than a few that came in as a result of my searches then.

Lets list out which domains are in there
ben@milleniumfalcon:~/tmp/MISC-36-Yacy_instance_setup$ grep -o -P 'href="https://[^\.]+\.bentasker.co.uk' my_domains.html | sed 's~href="https://~~g' | sort | uniq
dns.bentasker.co.uk
mailarchives.bentasker.co.uk
projects.bentasker.co.uk
projectsstatic.bentasker.co.uk
recipebook.bentasker.co.uk
snippets.bentasker.co.uk
www.bentasker.co.uk


Now, how many of each
ben@milleniumfalcon:~/tmp/MISC-36-Yacy_instance_setup$ grep -o -P 'href="https://[^\.]+\.bentasker.co.uk' my_domains.html | sed 's~href="https://~~g' | sort | uniq -c | sort -nr
  58732 mailarchives.bentasker.co.uk
   2658 www.bentasker.co.uk
    138 snippets.bentasker.co.uk
     88 projectsstatic.bentasker.co.uk
     84 recipebook.bentasker.co.uk
     10 dns.bentasker.co.uk
      5 projects.bentasker.co.uk


Only 5 under projects.bentasker.co.uk?

Actually under Index Browser I don't see it in the Host list at all (there are a lot of other domains there though, explains why the URL list was so bloody big), searching for it shows it's stored 4 pages. It's definitely listed for crawling, I can see this morning's crawl being triggered in the Process Scheduler

Triggering a manual crawl of it shows all the expected paths, and using Target Analysis on various URLs shows no sign of issue. That's really weird, there's probably a reason for it, but on the face of it it really reinforces some of my earlier concerns about the reliability of indexing/searches.

Anyway, back to my URL lists
ben@milleniumfalcon:~/tmp/MISC-36-Yacy_instance_setup$ egrep -e 'html>|href="https://www.bentasker.co.uk' my_domains.html > www_bentasker_co_uk.html


Which creates this - https://projectsstatic.bentasker.co.uk/MISC/MISC-36-Yacy_instance_setup/www_bentasker_co_uk.html?a

Let's see where the majority of URLs are pointing
ben@milleniumfalcon:~/tmp/MISC-36-Yacy_instance_setup$ grep -o -P '<a href="https://www.bentasker.co.uk/([^/,^\?,^%,^"]+)' www_bentasker_co_uk.html  | sed 's~<a href="https://www.bentasker.co.uk~~g' | sort | uniq -c | sort -nr
    567 /images
    544 /tags
    527 /component
    431 /blog
    331 /documentation
     41 /portfolio
     38 /projects
     30 /videos
     30 /shoparchives
     21 /all-whitepapers
     18 /projectnews
     11 /adblock
      9 /feeds
      9 /about-me
      7 /search
      7 /links
      6 /cookies
      4 /privacy-policy
      4 /14-site-information
      3 /your-stored-data
      3 /login
      3 /index.php
      2 /services
      2 /attachments
      2 /52-site-information
      1 /sitemaphtml
      1 /puzzles
      1 /photos-archive
      1 /misc
      1 /licensedetails
      1 /58-uncategorised
      1 /27-site-information

Makes sense, the highest numbers are for things that get re-used across posts: tags and images.

Those component paths are a sign of Joomla's SEF not rewriting stuff though, that's the sort of thing we're really interested in. Which components are we pointing at?
ben@milleniumfalcon:~/tmp/MISC-36-Yacy_instance_setup$ grep -o -P '<a href="https://www.bentasker.co.uk/component/([^/,^\?,^%,^"]+)' www_bentasker_co_uk.html  | sed 's~<a href="https://www.bentasker.co.uk~~g' | sort | uniq -c | sort -nr
    515 /component/content
     11 /component/jshopping
      1 /component/tags


jshopping is an extension I used to use, it shouldn't even be installed now. Content should be linked to via SEF URLs but isn't always. And a singular link to tags....
ben@milleniumfalcon:~/tmp/MISC-36-Yacy_instance_setup$ grep -o -P '<a href="https://www.bentasker.co.uk/component/(tags|jshopping)/([^"]+)' www_bentasker_co_uk.html  | sed 's~<a href="https://www.bentasker.co.uk~~g' | sort | uniq -c | sort -nr
      1 /component/tags/tag/135-republished
      1 /component/jshopping/%3Fcontroller=product%26task=view%26category_id=1%26product_id=9
      1 /component/jshopping/%3Fcontroller=product%26task=view%26category_id=1%26product_id=8
      1 /component/jshopping/%3Fcontroller=product%26task=view%26category_id=1%26product_id=7
      1 /component/jshopping/%3Fcontroller=product%26task=view%26category_id=1%26product_id=3
      1 /component/jshopping/%3Fcontroller=product%26task=view%26category_id=1%26product_id=24
      1 /component/jshopping/%3Fcontroller=product%26task=view%26category_id=1%26product_id=23
      1 /component/jshopping/%3Fcontroller=product%26task=view%26category_id=1%26product_id=22
      1 /component/jshopping/%3Fcontroller=product%26task=view%26category_id=1%26product_id=21
      1 /component/jshopping/%3Fcontroller=product%26task=view%26category_id=1%26product_id=20
      1 /component/jshopping/%3Fcontroller=product%26task=view%26category_id=1%26product_id=2
      1 /component/jshopping/%3Fcontroller=product%26task=view%26category_id=1%26product_id=1

There are 2 jobs then really, finding where those URLs are referenced and correcting, and making sure there's a redirect in place for them. I won't do that under this issue, as it belongs in the site-specific project.

But, continuing on in the spirit of this issue - we can find some information out by using Yacy itself. Using the front-end search to search for /component/tags/tag/135-republished we can see that it's referred to by the site's search results page when searching for republished - https://www.bentasker.co.uk/search?q=republished

It's curious though that that URL has been stored, because although the search results embed a link to /component/tags/tag/135-republished that URL correctly redirects to the SEF one: https://www.bentasker.co.uk/tags/135-republished

When using Target Analysis on it, in the interface it reports a 200. Turns out this is quite dishonest, can see in the edge logs that it got a 301 to the correct URL, and then got a 200
root@mikasa:~# tail -F /var/log/nginx/access.log | grep republished
209.97.139.101	-	-	[26/Dec/2019:11:53:53 +0000]	"GET /component/tags/tag/135-republished HTTP/1.1"	301	191	"-"	"yacybot (/global; amd64 Linux 4.9.0-11-amd64; java 1.8.0_232; Etc/en) http://yacy.net/bot.html"	"-"	"www.bentasker.co.uk"	CACHE_-	0.000	mikasa	-	"-"	"-"	"-"
209.97.139.101	-	-	[26/Dec/2019:11:53:53 +0000]	"GET /tags/135-republished HTTP/1.1"	200	11598	"-"	"yacybot (/global; amd64 Linux 4.9.0-11-amd64; java 1.8.0_232; Etc/en) http://yacy.net/bot.html"	"-"	"www.bentasker.co.uk"	CACHE_HIT	0.001	mikasa	-	"-"	"-"	"-"


Looking in Yacy's admin interface (http://yacy.bentasker.co.uk:8090/ViewFile.html?url=https://www.bentasker.co.uk/component/tags/tag/135-republished), we can see that the page was indexed under the original URL. So, Yacy (or maybe Solr under it) doesn't correctly handle indexing pages that result in a 301 - the new destination is stored under the original URL. Doing a quick check, the actual URL (i.e. the destination of the 301) doesn't exist in the index. That's a pretty significant screw-up :(

That rather negates what I've been doing so far too - although some sources (like search pages) might refer to a non-SEF URL, the index won't actually represent whether that non-SEF version has been correctly redirected to the SEF version. With 515 possibly affected URLs under /component/content that's a lot of cruft to have to sift out manually.

I'm going to take 1 of those URLs just to see if it's affected.

- https://www.bentasker.co.uk/component/content/article?id=471:the-curious-case-of-bitfi-and-secret-persistence
- Redirects to https://www.bentasker.co.uk/component/content/article/471-the-curious-case-of-bitfi-and-secret-persistence

Which actually, is incorrect, as it should end up at https://www.bentasker.co.uk/blog/security/471-the-curious-case-of-bitfi-and-secret-persistence. Search results claim that the HTML sitemap cites this URL, but it doesn't appear to
ben@milleniumfalcon:~/tmp/MISC-36-Yacy_instance_setup$ curl -s https://www.bentasker.co.uk/sitemaphtml | grep bitfi
<li><a href="/blog/security/471-the-curious-case-of-bitfi-and-secret-persistence" title="The Curious Case of BitFi and Secret Persistence">The Curious Case of BitFi and Secret Persistence</a></li>
</li><li><a href="/tags/323-bitfi" title="Bitfi">Bitfi</a>
<li><a href="/blog/security/471-the-curious-case-of-bitfi-and-secret-persistence" title="The Curious Case of BitFi and Secret Persistence">The Curious Case of BitFi and Secret Persistence</a></li>
<li><a href="/blog/security/471-the-curious-case-of-bitfi-and-secret-persistence" title="The Curious Case of BitFi and Secret Persistence">The Curious Case of BitFi and Secret Persistence</a></li>
<li><a href="/blog/security/471-the-curious-case-of-bitfi-and-secret-persistence" title="The Curious Case of BitFi and Secret Persistence">The Curious Case of BitFi and Secret Persistence</a></li>


Although it results a bunch of citations for this URL, I don't find a refernce to it anywhere. Also, one of the citing URLs listed is /blog/security?1'=1 - I'm curious now to see where that test SQLi attempt comes from.... Can't seem to extract anything from Yacy though - it won't return it in search results because even when quoted it seems to treat it as a pattern.

I'm not sure where those original component/content URLs are getting referenced from then as the information Yacy is giving doesn't appear to align with test results, but getting redirects pointing to the correct place belongs under the project for that site.

The recrawl of projects.bentasker.co.uk has finished, so dumping out another URL list to see whether the paths make it in
root@debian-yacy:~# grep projects.bentasker.co.uk /home/yacyuser/yacy/DATA/EXPORT/yacy_dump_f200710060000_l201912261207_n201912261218_c000000427096_tc.html | wc -l
1008

That'd suggest there's no reason the pages shouldn't have been in there in the first place (and they definitely were previously). Worrying.
The previous comment ended up being quite long, so to summarise

- Although it definitely was crawled, one of my subdomains disappeared from the index almost entirely (with just 5 pages left in there)
- Yacy doesn't correctly handle 301s correctly. It indexes the (new) destination under the URL of the old one. So if you redirect foo to bar then the content of bar will end up in the index, but under url foo (and bar won't make it in unless linked to from elsewhere). The Admin interface's target analysis will just show you that Yacy received a 200, when that's not the case
- It's somehow managed to get a list of non-SEF URLs for my site, but none of the citations it gives actually includes those (this may well be a Joomla issue rather than a Yacy one in fairness)
Raising an upstream bug for the 301 thing, and decided I want to see whether it occurs across domains (i.e. can I get content indexed under entirely the wrong domain)

Under my edge config for bentasker.uk I've added the following block
        # MISC-38
        # I want to prove whether Yacy will misindex across domains
        location = /misc38_test_file.html {
                return 200 '<html><body><a href="misc38_redirme.html">An old link</a></body></html>';
        }

        location = /misc38_redirme.html {
                return 301 https://projectsstatic.bentasker.co.uk/MISC/MISC-36-Yacy_instance_setup/test_file.html;
        }



Note that we're redirecting to an entirely different TLD, much less second level domain - bentasker.uk to bentasker.co.uk. The file it redirects to is just this simple HTML test file - https://projectsstatic.bentasker.co.uk/MISC/MISC-36-Yacy_instance_setup/test_file.html

So, in Yacy's admin interface, need to submit a crawl of http://bentasker.uk/misc38_test_file.html and treat it as a link list

Ah, it receives the redirect but doesn't follow it (presumably because it's in a different domain). What happens if we use https://bentasker.uk/misc38_redirme.html as the start point? Looks like the same thing.

So, it looks like it's not possible (or at least not trivial) to do cross domain poisoning at least

Raised upstream bug 320 - https://github.com/yacy/yacy_search_server/issues/320
I had been planning on killing the Yacy instance off today, but I'm going to leave it for now as I want to see whether projects disappears again or not
The records for projects.bentasker.co.uk were still in the index, so really not sure what happened before.

Instance has been killed off and the related DNS records removed
btasker changed status from 'Open' to 'Resolved'
btasker added 'Done' to resolution
btasker changed status from 'Resolved' to 'Closed'
-------------------------
From: git@rimmer.home
To: jira@chaos.home
Date: None
Subject: MISC-36 Remove RRs for yacy.bentasker.co.uk
-------------------------


Repo: chaos_dns
Host:hiyori

commit 0b29acef09f0a98b7c74d5d8feba44dc82238a48
Author: root <root@gerbil.it>
Date: Tue Dec 31 17:30:50 2019 +0000

Commit Message: MISC-36 Remove RRs for yacy.bentasker.co.uk

bentasker.co.uk.zone | 6 +-----
1 file changed, 1 insertion(+), 5 deletions(-)


View Commit | View Changes