DescriptionIn LAN-119 I've been playing with Yacy on a raspberry pi as a potential replacement to Sphider.
I've fairly mixed feelings about it at the moment as I keep finding issues. What isn't clear though, is how many of those are caused by
- Being on a raspberry Pi and,or
- my having fumbled around to set it up
So, what I'd like to do is to get a "clean" public Yacy instance up and running to index the properties on the CDN. The idea in part being to get them into the SE results of anyone using a public Yacy instance, but also to see how it performs in that environment.
Longer term it could potentially be a small value-add - getting properties on the CDN automatically indexed (with fewer eyeballs, but more control than when just firing a sitemap at Google/Bing etc).
Activity
2019-12-22 09:46:40
View Commit | View Changes
2019-12-23 11:30:51
I have a public instance up and running, and with a bit of hackery it should reindex my stuff daily.
I'm not altogether sure I'm going to keep it though to be honest.
It just seems like constant what-the-fucks and aggravation.
The Debian install instructions are here - https://wiki.yacy.net/index.php/En:DebianInstall - but you can't use them because their repo has been broken, using an invalid signature since 2017 - https://github.com/yacy/yacy_search_server/issues/124
So, a manual install is required instead (meaning you get no init/unit files or "best-practice" preconfiguration).
Once you've got it installed, a lot of the concerns/criticisms I found in LAN-119 apply here too. Although I've formed a better understanding of how Yacy works in a P2P setting, I just don't have enough confidence in the results it gives.
The network design will, as far as I can make out, favour the most popular search terms (copies of those indexes will be well distributed), but searching for more niche terms will be extremely hit and miss - when doing a "remote" search, the local peer picks a handful of peers at random to ask. If they don't have it, you'll get no results even if the content is actually indexed within the network.
It also means that relatively new stuff may take a disproportionate amount of time to begin appearing in search results at the network level. Whenever a peer fetches results from a remote, it stores a copy of the index for that page - so in theory at least relevant pages should ultimately end up distributed across most of the network.
But, there are (currently) 392 peers in the network - each will pick just a small handful of peers for each remote query, so propogation is likely to take some time - particularly as the
None of this is to say that
Either way, I'll let it run for a week or so and keep tinkering, but my early impression on this is much like that under LAN-119 - it's a nice idea but doesn't really translate very well into the real world
2019-12-24 09:39:38
Easily resolved, of course, but I can imagine it could quite easily sit there for days not indexing because the state isn't clearly visible.
Also a little concerning - that's quite a lot of memory used (insert
The real concern though is that it means indexing just stops. Ideally, if a specific page/resource is going to require more RAM than can be allocated, you'd hope the crawler (or it's parent) would trap the exception and skip that page so that the crawl could continue (preferably raising an alert specifying the URL that was skipped).
This was a major concern with my LAN-119 experiment with using it internally too:
2019-12-24 09:40:15
2019-12-24 09:40:15
2019-12-24 09:42:13
I'll let the instance run a little longer so I can learn more about it, but realistically I'm not going to keep this instance in the longer term, and I'm certainly not going to even attempt integration into the CDN.
It's a nice idea, but Yacy just doesn't seem to be quite ready for headless unattended operation yet, and I don't really want the maintenance burden of keeping the node running and making sure it's actually indexing stuff. Especially given that the network design appears to mean that indexed pages may still not end up in the results of relevant searches (see comment above).
Admittedly, some of this might be operator error, but the project's wiki regularly times out and returns errors rather than content, so it's difficult to search for and find things within the (fairly limited) documentation. Lack of reliable doc searchability isn't a great look for a search engine project :( If it wasn't for the commit log in Github, I'd be inclined to think the project had been abandoned.
I'm also not particularly a fan of the fact that the web/search service is on the same port as the Peer-to-peer service (seems I'm not the only one - https://github.com/yacy/yacy_search_server/issues/315). I'd feel much more comfortable if there was some semblance of service isolation so that access could be restricted to the search service (reducing the attack surface there) whilst keeping open the bit that actually needs to be open to the world.
2019-12-24 09:56:12
2019-12-24 12:14:59
- Yacy supports performing it's P2P operations over HTTPS. But, it doesn't validate certificates.
- As a result, there's no protection against MiTM, but it does prevent casual observers from seeing the payloads
However, use of HTTPS for P2P is off by default. A vanilla install of
If P2P were on a separate port then a yacy install could do HTTPS by default, by generating itself a snake-oil cert on first run - it's not like other peers are going to validate the cert anyway (the wisdom of that is somewhat dubious too, particularly as there's no option to allow an admin to require validation - but I guess it probably relates to the usual
2019-12-26 12:16:47
One of the other options it'll give you though is a HTML list of all the URLs in the index. That's potentially handy for identifying non-SEF (or otherwise incorrect) paths that are being exposed by CMS's (like Joomla) - for a URL to get into the index it's got to be linked to from somewhere particularly as the instance only indexes my sites. There'll be a couple of others in there because I played around with search, but we can extract those.
Figured I'd have a play around with it
It's actually pretty sizeable:
OK, lets strip out the stuff that isn't under bentasker.co.uk
OK.... maybe there are more than a few that came in as a result of my searches then.
Lets list out which domains are in there
Now, how many of each
Only 5 under
Actually under
Triggering a manual crawl of it shows all the expected paths, and using
Anyway, back to my URL lists
Which creates this - https://projectsstatic.bentasker.co.uk/MISC/MISC-36-Yacy_instance_setup/www_bentasker_co_uk.html?a
Let's see where the majority of URLs are pointing
Makes sense, the highest numbers are for things that get re-used across posts: tags and images.
Those
There are 2 jobs then really, finding where those URLs are referenced and correcting, and making sure there's a redirect in place for them. I won't do that under this issue, as it belongs in the site-specific project.
But, continuing on in the spirit of this issue - we can find some information out by using Yacy itself. Using the front-end search to search for
It's curious though that that URL has been stored, because although the search results embed a link to
When using
Looking in Yacy's admin interface (http://yacy.bentasker.co.uk:8090/ViewFile.html?url=https://www.bentasker.co.uk/component/tags/tag/135-republished), we can see that the page was indexed under the original URL. So, Yacy (or maybe Solr under it) doesn't correctly handle indexing pages that result in a 301 - the new destination is stored under the original URL. Doing a quick check, the actual URL (i.e. the destination of the
That rather negates what I've been doing so far too - although some sources (like search pages) might refer to a non-SEF URL, the index won't actually represent whether that non-SEF version has been correctly redirected to the SEF version. With 515 possibly affected URLs under
I'm going to take 1 of those URLs just to see if it's affected.
- https://www.bentasker.co.uk/component/content/article?id=471:the-curious-case-of-bitfi-and-secret-persistence
- Redirects to https://www.bentasker.co.uk/component/content/article/471-the-curious-case-of-bitfi-and-secret-persistence
Which actually, is incorrect, as it should end up at https://www.bentasker.co.uk/blog/security/471-the-curious-case-of-bitfi-and-secret-persistence. Search results claim that the HTML sitemap cites this URL, but it doesn't appear to
Although it results a bunch of citations for this URL, I don't find a refernce to it anywhere. Also, one of the citing URLs listed is
I'm not sure where those original
The recrawl of
That'd suggest there's no reason the pages shouldn't have been in there in the first place (and they definitely were previously). Worrying.
2019-12-26 12:24:25
- Although it definitely was crawled, one of my subdomains disappeared from the index almost entirely (with just 5 pages left in there)
- Yacy doesn't correctly handle 301s correctly. It indexes the (new) destination under the URL of the old one. So if you redirect
- It's somehow managed to get a list of non-SEF URLs for my site, but none of the citations it gives actually includes those (this may well be a Joomla issue rather than a Yacy one in fairness)
2019-12-26 12:57:42
Under my edge config for
Note that we're redirecting to an entirely different TLD, much less second level domain -
So, in Yacy's admin interface, need to submit a crawl of http://bentasker.uk/misc38_test_file.html and treat it as a link list
Ah, it receives the redirect but doesn't follow it (presumably because it's in a different domain). What happens if we use
So, it looks like it's not possible (or at least not trivial) to do cross domain poisoning at least
Raised upstream bug 320 - https://github.com/yacy/yacy_search_server/issues/320
2019-12-26 12:59:57
2019-12-31 17:03:42
Instance has been killed off and the related DNS records removed
2019-12-31 17:03:49
2019-12-31 17:03:49
2019-12-31 17:03:53
2019-12-31 17:06:36
View Commit | View Changes