Analyse websites for Search Engine Optimisation - in a decentralised fashion 8) See instructions why you may not see results.
Explore, whether it is possible to decentralise a search index ala Google.
This is a Technical Entry
One of our team members had this thought one morning after waking up.
Since the web is designed to be decentral, why do we have several services building up their own giant database instead of sharing it? So this entry explores the feasibility, whether it could be done in an alternative fashion. Something, everyone would be able to contribute to.
Also, this is not the first decentral search engine, because there's already YaCy. But it is blowing up the disk space fast.
So what about breaking up the concept of indexing the web for search?
But there's more. We need five different pillars (all united in this entry because of limitations with the rules, i.e. only a single repository allowed).
Since we are heading into an era of Internet of Things, we can utilise those resources as well. Therefore we divide the work into these areas:
So if you have low-level hardware, they could crawl the web. If you have cheap cloud storage, it can save the knowledge (Database-as-a-service or similiar). If you have highly specialised software (like Semantic Analysis) they can pick up certain messages.
Most important: No single instance has to do everything but all win.
So here's where Matrix shines: It's built on top of HTTP and JSON. Something (almost) every programming language understands.
Translating above steps means:
Since the knowledge will be spreaded in a distributed fashion, a common message format is needed. At the very minimum it should contain this:
null. I haven't looked to deeply into the RDF standard, but it could likely be applied here as well.
This way, a Directed Acyclic Graph can be built up to trace back the path of some knowledge. More importantly: All those messages can be treated as immutable. This gives us leeway to apply different topics of Computer Science on it.
If we ever come to have to audit machine learning algorithms, this would give us a way to reproduce the results. That is, same input should yield to the same output, right?
Next to it, scientists could do … interesting things with this small messages as well. Since the structure is well-defined it should be easy to develop tooling for it.
What a search engines has additional to just the websites we can observe is knowledge about behaviour of the people doing the search. Like jumping back from a page. Hovering over entries etc.
The initial idea was to recreate Screaming Frog SEO Spider. If you look at its presentation, it is a crawler with some tables (exportable as CSV) and some charts on top of it. This can be used for so-called OnPage optimisation.
After some while, backlink analysis and Domain Authority could be looked at also.
Just look at what is possible to derive (click to enlarge)
There always some challenges along the way. Here's what we can think of:
The plan is to put this work as Open Source online (on GitHub). Everything was licensed under the Apache License.
Alternative implementations are planned for Python (and maybe PHP).
The JS SDK will need some more updates to speed it up. It was really slow. But hey, it's alpha/public beta. Considering that, it worked exceptionally well.
It only works for people with a lot of patience, since many different systems are involved.
Note: If you don't see any results, fear not! We are on it. We can't change the run-script without a code change (which would disqualify us), so we have to restart the needed processes manually (and will do so on a regular basis until the winners are announced). Thank you for your understanding.
So here's the flow and systems in play.
What I learned during development is, that it's hard to run several processes from node. Frankly, they should be running in their own docker container or across several systems, but the rules (resp. welcome mail) forbids me to do so.
Plus, required integration packages like
What I struggled with were the Heroku pipelines. It always showed me a warning, but gave no details (or an ID) so I didn't bothered the support.
.envis good practice in my opinion.
console.logall over the place, give this one a try. The setup is quite easy, but you can extend it down the road.
methods, that would have been even easier).
All features were developed using git-flow. For each of them, there's an issue on GitHub (with a description, even if it were really short!).
Some things were set up early in the hope to use them towards the hackathon. However, other things took priority, so they end up unused (but the learning of setting them up was worth it).
This is an interesting idea but I couldn't get it to do anything.
It's hard to say whether this is useful or not; I tried a few different URLs and none of them gave and indication of having done anything. If there is something going on in the background, it would be great to have a visual indication of that.
Otherwise, I love the idea of a shared repository or index for all of the internet. That's an ambitious problem, but it will probably happen eventually!
The concept idea is very good, but I don't think you were able to implement it the way you want with the time & constraints that you had. Would love to use a properly completed version of this tool. Couldn't get results when I tried.
Please give an explanation about sections in the result to make it more user-friendly. Good documentation and breakdown of the problem. It would be nice if users could see that.