Analyse websites for Search Engine Optimisation - in a decentralised fashion 8) See instructions why you may not see results.

Team Members

Launch Site

lobsang

Explore, whether it is possible to decentralise a search index ala Google.

V800 nko2018 lobsang

Description

This is a Technical Entry

What if we could re-do how we index the web for search?

One of our team members had this thought one morning after waking up.

Since the web is designed to be decentral, why do we have several services building up their own giant database instead of sharing it? So this entry explores the feasibility, whether it could be done in an alternative fashion. Something, everyone would be able to contribute to.

Inspiration

Recently Mozilla Hacks is writing about different initiatives of the DWeb. Things like WebTorrents, Dat protocol or IPFS and Matrix are interesting ones. They all cover the topic of federation.

Also, this is not the first decentral search engine, because there's already YaCy. But it is blowing up the disk space fast.

Breaking up the challenge

So what about breaking up the concept of indexing the web for search?

  1. Getting a URL as starting point.
  2. Saving that URL.
  3. Triggering several other things to do with an URL.
  4. Saving the results of those researches.
  5. Looking at whether step 3 could be done with new URLs derived from step 4.

But there's more. We need five different pillars (all united in this entry because of limitations with the rules, i.e. only a single repository allowed).

  1. Crawling the web
  2. Spreading knowledge to others
  3. Processing knowledge of others and deriving new insights
  4. Storing knowledge
  5. Presenting knowledge

Spreading the work

Since we are heading into an era of Internet of Things, we can utilise those resources as well. Therefore we divide the work into these areas:

  1. Crawling the web - this means, following links and download the response for saving or passing forward
  2. Storing knowledge - that means, listening to message busses and saving everything for further processing
  3. Processing knowledge - that means, focussing on a single area (say, image recognition) and just processing those

So if you have low-level hardware, they could crawl the web. If you have cheap cloud storage, it can save the knowledge (Database-as-a-service or similiar). If you have highly specialised software (like Semantic Analysis) they can pick up certain messages.

Most important: No single instance has to do everything but all win.

Deciding on a proof-of-concept

So here's where Matrix shines: It's built on top of HTTP and JSON. Something (almost) every programming language understands.

Translating above steps means:

  1. Writing an express app to ask the user for an URL. This could also provide an API for programmatic input.
  2. Using the new, shiny Matrix JS SDK (currently somewhere between alpha and public beta) to distribute the URL. If that does not work, fall back to IPFS (requires a daemon) or WebTorrent (should work from within a browser). Luckily Heroku allowed a connection to Matrix.org.
  3. Listen to Matrix using the same SDK. Alternatively broadcast the insights using Redis PubSub.
  4. Storing the knowledge in Neo4J (falling back to relative databases like MariaDB). Since we are dealing with many small entries, a relational database can't play out its strength.
  5. Using express.js to clob together HTML (or later, JSON) to tell the user about the knowledge gained.

Defining the message structure

Since the knowledge will be spreaded in a distributed fashion, a common message format is needed. At the very minimum it should contain this:

  1. An ID. In order to associate the different moving parts with each other, everything needs to have a unique ID. In this example a SHA-512 will be used over the payload (except issuer and id).
  2. A pointer to another ID. If knowledge was derived from another entry, this field should refer to that other's ID. If it is new knowledge, set it to null. I haven't looked to deeply into the RDF standard, but it could likely be applied here as well.
  3. A timestamp (in ISO format, since this can be understood by humans and many programming languages. Plus it is more future-proof than UNIX timestamps).
  4. A license. Since there may be limitations about how to re-use some entries, everything should provide a SPDX license identifier.
  5. A cryptographic key indicating the issuer. Since there will be some need to assess the credibility of a source, there should be a public key of the instance issuing an entry attached to the payload. Here, it's RSA key with 2048 which signs the payload (including the hashed ID).
  6. Some pre-defined keywords to describe the kind of information passed around. This allows picking only „interesting” messages from a bus for processing.
  7. Content-Type (also known as MIME type). This is meant for further specialising binary payloads.
  8. Most important: The actual content. This could be a string, an Array or something serialisable.

This way, a Directed Acyclic Graph can be built up to trace back the path of some knowledge. More importantly: All those messages can be treated as immutable. This gives us leeway to apply different topics of Computer Science on it.

If we ever come to have to audit machine learning algorithms, this would give us a way to reproduce the results. That is, same input should yield to the same output, right?

Next to it, scientists could do … interesting things with this small messages as well. Since the structure is well-defined it should be easy to develop tooling for it.

What a search engines has additional to just the websites we can observe is knowledge about behaviour of the people doing the search. Like jumping back from a page. Hovering over entries etc.

What does that have to do with Search Engines?

The initial idea was to recreate Screaming Frog SEO Spider. If you look at its presentation, it is a crawler with some tables (exportable as CSV) and some charts on top of it. This can be used for so-called OnPage optimisation.

After some while, backlink analysis and Domain Authority could be looked at also.

Just look at what is possible to derive (click to enlarge)

derivations from crawling

Challenges

There always some challenges along the way. Here's what we can think of:

  1. Intellectual Property. There could be some organisations or persons out there, which consider certain information as Intellectual Property. Say, a domain name.
  2. GDPR. With the current approach the „right to be forgotten” and „hand me everything you know about me” could be hard to realise. We would defer that to the tooling used to extract things here.
  3. Infrastructure blocks access to the message bus. Here, an alternative message bus (see above… IPFS, Dat etc) could be used and certain bits could bridge between those.
  4. Child processes dying all the time. Well, I've learned, that it is difficult to run processes in parallel with Node. So here they not died, but degraded the performance.
  5. Loosing links. If certain messages aren't forwarded to the bus, but others link to them, we have a broken chain of references.
  6. People messing around with the messages. Like sending invalid posts, wrong structure, etc. It will happen, so the best approach is to be rigorous about the input to accept and ditch everything else.

Moving forward

The plan is to put this work as Open Source online (on GitHub). Everything was licensed under the Apache License.

Alternative implementations are planned for Python (and maybe PHP).

The JS SDK will need some more updates to speed it up. It was really slow. But hey, it's alpha/public beta. Considering that, it worked exceptionally well.

Instructions

It only works for people with a lot of patience, since many different systems are involved.

Note: If you don't see any results, fear not! We are on it. We can't change the run-script without a code change (which would disqualify us), so we have to restart the needed processes manually (and will do so on a regular basis until the winners are announced). Thank you for your understanding.

So here's the flow and systems in play.

  1. Node starts a process running express.js. This one will render a HTML page to you with a form to accept an URL. A bit of Vue.js is used to tell you, what the result link will be, since the process takes about 5 to 10 minutes (if everything is running).
  2. Express.js takes the submitted URL and publish it to a Redis channel.
  3. Some other part of the application picks up the message and wrap it with metadata (see above) and pass that to Matrix.
  4. The message appears as String in Matrix (currently only strings are allowed in the JS SDK as message). This can take some time, since there are many other events running through the application, which have to be sorted out until our sendEvent gets recognised.
  5. Another process started a Matrix client to listen for messages in that channel and finds the above message. It informs all registered agents about the new message.
  6. Each of those agents processes the URL in their own way and report their new value via a Redis publication. Step 3 kicks in.
  7. Meanwhile a third process is listening to all messages on Matrix and writes them into a database (Neo4J or here as fallback: MariaDB).
  8. When visiting the report URL, express queries the database for all entries with a relation to the URL (resp. to the URL entry with a value of it, using its hash attribute for relational searches … I'm not good with databases by the way).

What I learned during development is, that it's hard to run several processes from node. Frankly, they should be running in their own docker container or across several systems, but the rules (resp. welcome mail) forbids me to do so.

Built With

Installed Heroku Add-Ons

  • GrapheneDB Dev - Tried as Neo4J database, but had to be dropped, because connection could not be established.
  • Hosted Graphite - Provides a Grafana dashboard, but I haven't taken time to learn how to pipe in my metrics.
  • JawsDB Maria - MariaDB as fallback to Neo4J
  • Keen - Ought to provide inside into analytics, but was unusable (I didn't understand the interface).
  • Librato - Automatically gives you some insights into your log files. Use Papertrail or Graphite if you need more.
  • Papertrail - Giving access to the logs generated on the server. Harder though to derive actions from it.
  • RedisGreen - Redis as a Service. Was available almost always. The Admin UI does not provide me the information I need though (like values of keys. Insights into pub-sub channels etc.)
  • Sentry - Error logging. It's handy to get informed about crashes by mail. Since the quality is so high, that rarely happened during the hackathon. I love it!
  • Snyk - If you develop in JavaScript and have access to it, use it! There service is tremendously helpful! Just let them know, that your are taking part at a hackathon and that the traffic / deployments could be a bit higher :-)

Plus, required integration packages like

What I struggled with were the Heroku pipelines. It always showed me a warning, but gave no details (or an ID) so I didn't bothered the support.

Used tool chain

  • commitlint - This way, I can ensure that my commits are written in a homogeneous way (can be used for release notes, too).
  • husky - Such an easy way to have git hooks!
  • standard - Saves you some time with setting up a linter. Yes, it's opinionated.
  • body-parser - To parse HTML form elements
  • browserslist-useragent - That's something I've learned from Smashing Magazine the other day. Cool that it works as described! However, my struggling with keen.io addon rendered it pretty useless for actions.
  • chota - That's a CSS framework (based on Flexbox) which I discovered only before the hackathon. I'm in love with small packages focussing on one thing.
  • concurrently - My life saver. Otherwise I'd have to do everything in one process. Especially database connections tended to screw up.
  • cypher-query-builder - Cypher is the Query Language of Neo4J. Go, check it out! I've written everything up just to learn, that GrapheneDB wasn't reachable. No, I won't join Slack to get some support. I expect that to happen where the users are.
  • dotenv - What would the world be without it? Storing sensitive details in a .env is good practice in my opinion.
  • ejs - Although not the most beautiful template language I like it's robustness and flexibility.
  • express - Another framework focussing on minimalism and flexibility. Maybe I should try Koa or Sails one day, but I am confident to build rock-solid apps with Express for now.
  • express-useragent - Saw that from broserslist before? This one is actually parsing the User Agent, so I could enrich keen.io with more information.
  • helmet - If you use Express (or Connect), install helmet. You can't get basic security easier.
  • http-status - If you are like me and don't want to remember the codes, but actually their meaning, use this lib to make it translate between each.
  • i18next - Whenever I want the possibility to translate my JavaScript application, I reach to i18next. So … basically always.
  • js-sha512 - I know, that Node.js has a Crypto API … but I want only to quickly hash something. So I discovered this package.
  • loglevel - If you fell also in that party using console.log all over the place, give this one a try. The setup is quite easy, but you can extend it down the road.
  • loglevel-plugin-prefix - I normally add this one to have a prefix with timestamp, logger name and level to it.
  • matrix-js-sdk - This one was published shortly before the hackathon. So it's hot! I am amazed how reliable it works (it even reconnects to matrix on its own!) given that it is alpha! Nice job!
  • mime - Since I was looking how hard it is to add a MIME-type to my content part of the payload, I spend some time researching packages. This one works quite nice.
  • morgan - A quick way to get your application logging in a standardised format (so you can throw more tooling at it).
  • node-rsa - I am not sure, whether „No needed OpenSSL” is actually an advantage, but I couldn't be sure it's present on Heroku.
  • sequelize - It took me a bit to get it running, but I prefer an ORM over hardcoding SQL. This one is one of the best for Node.js if you are dealing with relational ones. I need to spend some time on modelling / putting my files in a meaningful directory structure next time. As I didn't planned to use these kind of databases, I didn't …
  • sitemapper - This feature would have made it almost into the report as well, but the Promise wasn't waited for for some reason. I'll keep it in mind though.
  • vue - Since Sarah Drasner wrote about replacing jQuery with Vue I lost my shy and gave it a try. If the documentation would explain where to put lifecycle methods (hint: root level of object not under methods, that would have been even easier).
  • winston - Although I prefer bunyan I had to use this one, since bunyan's development seem to have been stalled. Sadly papertrail etc. weren't able to make more sense out of the logs.

All features were developed using git-flow. For each of them, there's an issue on GitHub (with a description, even if it were really short!).

Configured but unused tools

Some things were set up early in the hope to use them towards the hackathon. However, other things took priority, so they end up unused (but the learning of setting them up was worth it).

  • ava - Since most of the application is stateless, this would be an ideal candidate for using ava! Plus, Justin Fuller made me curious with his post over at freecodecamp.
  • ink-docstrap - I discovered this package from a dependency of another game jam I participated and consider it a nice way to spice up our JSDoc
  • jsdoc - Speaking of JSDoc, Yeah, didn't make it in either. But there are interesting tools built on top like flow-jsdoc.
  • lerna - Initially we thought about turning this project into a lerna one and put it on npm registry. Turned out that it would need some other approach to it.
  • nyc - Code coverage made easy.

Feedback / Ratings (3)

All Feedback