Build your own Google

Raymond Meester
5 min readNov 15, 2020

--

Google’s founders intended to write a thesis about search engines at Stanford University. They told their mentor that they wanted to download the whole internet. Even at the second half of the nighties this was a huge task. Their thesis supervisor thought they were crazy. Still, they did it. The rest is history.

We are now all aware how important search engines are. Google searches most of the internet, present results to queries ultrafast. It uses millions(!) of servers and specialized software to make this happen. Maurice de Kunder made a website that estimates the number of indexed websites by Google. The current estimate is around 60 billion pages.

Not all the internet is covered by Google. It focuses solely on the public available sources. The parts that are not public, either belong to the dark web or the deep web. The dark web consist of content that are deliberately made only accessible by special software and authorization, like Tor and Freenet. This dark net however form just a small portion of the internet that is not indexed by search engines.

The big part is in the deep web. That are online databases, password protected content (like corporate and government databases) and content not linked to any other public pages. Other examples are websites that require subscriptions (newspapers, blogs, streaming), social accounts, CRM products, and SaaS products.

But what if you exactly want to query through these data? When all this data is outside the public internet. Whether this is personal or company data. Of course, you still want to search just like you do on the internet. And to build your own Google you need a good search engine.

Search engines

How do search engines work? First, you need to feed them with data and let the engine index them. Once a page is in the index, result to relevant queries can be shown. Finally, a ranking is applied which decides which result come on top.

The next question is: What engines are available? Apache Lucene is probably the most used engine. Lucene started in 1999, one year after Google was founded. At first only for Java, but later ported to other languages like C++ and Python. The long active development period made Lucene feature rich.

Indexing, Ranking and searching is all possible and highly configurable. But at the end it’s only the engine. What is an engine without the car. It is maybe powerful, but it doesn’t get very far. To get a more feature complete experience search databases are build on top of the Lucene engine. Best known are either Solr which is offered by the Lucene team or Elastic Search by Elastic.

Solr and Elastic bring more comfort. They offer ways to scale search through shards, admin tools and try to make the functionality available through REST API’s. Especially Elastic has become one of the best known search databases, but there is so much more.

Alternatives

So Lucene is the most used engine, Elastic the most used search database and for log management there are capable providers (like Splunk). All solutions have one in common: they are heavy-weight in accordance with a steep learning curve (and sometimes pricey as well).

What are the options for more lightweight alternatives? Here is an overview of free, open-source full text search engines that are lightweight, but still very capable. Let’s take a look at 7 search engines that are popular on GitHub and claim to be fast and light weight.

What stands out, besides that more than half of the search engines start with the letter T, is that most are written in Rust. This is no wonder, because though Rust has a highly modern syntax it’s design for on safety, control of memory layout, and concurrency.

The details

As MeiliSearch and Sonic are the most popular and have public releases on GitHub, let’s go into a little more detail. Sonic was build for the messaging platform Crisp. The main developers are Baptiste Jamin and Valerian Saliou. They found Lucene (Elastic & Co) too heavyweight and tried to build a lighter variant.

The basic features are:

  • The indexed search terms are stored in collections
  • Sonic channel: A protocol to search an index, manage data ingestion and perform administrative actions
  • Sonic doesn’t store any direct textual data in its index, but uses word index.
  • Autocompletion and correcting typos
  • Unicode compatible on 80+ most spoken languages in the world.
  • Various libraries for multiple programming languages

More info:

MeiliSearch (like Sonic) started in France. It’s still a very young project. Version 0.1 came out on 18th December 2018. Since then its popularity grows. So much, that the founder recently turned the project into an organization. In their release blog of the company they stated that they want to be an open organization following organization like Mozilla and Red Hat.

The main features are:

  • Full-text search
  • Autocompletion and typo tolerant
  • Supports Synonym
  • Whole documents are returned
  • Faceted search and filters
  • Easy installation and maintenance
  • Snapshots: Restore from previous states
  • RESTful API

More info:

It’s great to see that there is still a lot of going on in the development of search engines which are open to use for your own data. The best way is to install Docker, try it for yourself and build your own Google.

Links

--

--

Raymond Meester