A software engineer from New York got so fed up with the irrelevant results and SEO spam in search engines that he decided to create a better one. Two months later, he has a demo search engine up and running. Here is how he did it, and four important insights about what he feels are the hurdles to creating a high-quality search engine.
One of the motives for creating a new search engine was the perception that mainstream search engines contained increasing amount of SEO spam. After two months the software engineer wrote about their creation:
“What’s great is the comparable lack of SEO spam.”
Neural Embeddings
The software engineer, Wilson Lin, decided that the best approach would be neural embeddings. He created a small-scale test to validate the approach and noted that the embeddings approach was successful.
Chunking Content
The next phase was how to process the data, like should it be divided into blocks of paragraphs or sentences? He decided that the sentence level was the most granular level that made sense because it enabled identifying the most relevant answer within a sentence while also enabling the creation of larger paragraph-level embedding units for context and semantic coherence.
But he still had problems with identifying context with indirect references that used words like “it” or “the” so he took an additional step in order to be able to better understand context:
“I trained a DistilBERT classifier model that would take a sentence and the preceding sentences, and label which one (if any) it depends upon in order to retain meaning. Therefore, when embedding a statement, I would follow the “chain” backwards to ensure all dependents were also provided in context.
This also had the benefit of labelling sentences that should never be matched, because they were not “leaf” sentences by themselves.”
Identifying The Main Content
A challenge for crawling was developing a way to ignore the non-content parts of a web page in order to index what Google calls the Main Content (MC). What made this challenging was the fact that all websites use different markup to signal the parts of a web page, and although he didn’t mention it, not all websites use semantic HTML, which would make it vastly easier for crawlers to identify where the main content is.
So he basically relied on HTML tags like the paragraph tag
to identify which parts of a web page contained the content and which parts did not.
This is the list of HTML tags he relied on to identify the main content:
- blockquote – A quotation
- dl – A description list (a list of descriptions or definitions)
- ol – An ordered list (like a numbered list)
- p – Paragraph element
- pre – Preformatted text
- table – The element for tabular data
- ul – An unordered list (like bullet points)
Issues With Crawling
Crawling was another part that came with a multitude of problems to solve. For example, he discovered, to his surprise, that DNS resolution was a fairly frequent point of failure. The type of URL was another issue, where he had to block any URL from crawling that was not using the HTTPS protocol.
These were some of the challenges:
“They must have https: protocol, not ftp:, data:, javascript:, etc.
They must have a valid eTLD and hostname, and can’t have ports, usernames, or passwords.
Canonicalization is done to deduplicate. All components are percent-decoded then re-encoded with a minimal consistent charset. Query parameters are dropped or sorted. Origins are lowercased.
Some URLs are extremely long, and you can run into rare limits like HTTP headers and database index page sizes.
Some URLs also have strange characters that you wouldn’t think would be in a URL, but will get rejected downstream by systems like PostgreSQL and SQS.”
Storage
At first, Wilson chose Oracle Cloud because of the low cost of transferring data out (egress costs).
He explained:
“I initially chose Oracle Cloud for infra needs due to their very low egress costs with 10 TB free per month. As I’d store terabytes of data, this was a good reassurance that if I ever needed to move or export data (e.g. processing, backups), I wouldn’t have a hole in my wallet. Their compute was also far cheaper than other clouds, while still being a reliable major provider.”
But the Oracle Cloud solution ran into scaling issues. So he moved the project over to PostgreSQL, experienced a different set of technical issues, and eventually landed on RocksDB, which worked well.
He explained:
“I opted for a fixed set of 64 RocksDB shards, which simplified operations and client routing, while providing enough distribution capacity for the foreseeable future.
…At its peak, this system could ingest 200K writes per second across thousands of clients (crawlers, parsers, vectorizers). Each web page not only consisted of raw source HTML, but also normalized data, contextualized chunks, hundreds of high dimensional embeddings, and lots of metadata.”
GPU
Wilson used GPU-powered inference to generate semantic vector embeddings from crawled web content using transformer models. He initially used OpenAI embeddings via API, but that became expensive as the project scaled. He then switched to a self-hosted inference solution using GPUs from a company called Runpod.
He explained:
“In search of the most cost effective scalable solution, I discovered Runpod, who offer high performance-per-dollar GPUs like the RTX 4090 at far cheaper per-hour rates than AWS and Lambda. These were operated from tier 3 DCs with stable fast networking and lots of reliable compute capacity.”
Lack Of SEO Spam
The software engineer claimed that his search engine had less search spam and used the example of the query “best programming blogs” to illustrate his point. He also pointed out that his search engine could understand complex queries and gave the example of inputting an entire paragraph of content and discovering interesting articles about the topics in the paragraph.
Four Takeaways
Wilson listed many discoveries, but here are four that may be of interest to digital marketers and publishers interested in this journey of creating a search engine:
1. The Size Of The Index Is Important
One of the most important takeaways Wilson learned from two months of building a search engine is that the size of the search index is important because in his words, “coverage defines quality.” This is
2. Crawling And Filtering Are Hardest Problems
Although crawling as much content as possible is important for surfacing useful content, Wilson also learned that filtering low quality content was difficult because it required balancing the need for quantity against the pointlessness of crawling a seemingly endless web of useless or junk content. He discovered that a way of filtering out the useless content was necessary.
This is actually the problem that Sergey Brin and Larry Page solved with Page Rank. Page Rank modeled user behavior, the choice and votes of humans who validate web pages with links. Although Page Rank is nearly 30 years old, the underlying intuition remains so relevant today that the AI search engine Perplexity uses a modified version of it for its own search engine.
3. Limitations Of Small-Scale Search Engines
Another takeaway he discovered is that there are limits to how successful a small independent search engine can be. Wilson cited the inability to crawl the entire web as a constraint which creates coverage gaps.
4. Judging trust and authenticity at scale is complex
Automatically determining originality, accuracy, and quality across unstructured data is non-trivial
Wilson writes:
“Determining authenticity, trust, originality, accuracy, and quality automatically is not trivial. …if I started over I would put more emphasis on researching and developing this aspect first.
Infamously, search engines use thousands of signals on ranking and filtering pages, but I believe newer transformer-based approaches towards content evaluation and link analysis should be simpler, cost effective, and more accurate.”
Interested in trying the search engine? You can find it here and you can read how the full technical details of how he did it here.
Featured Image by Shutterstock/Red Vector