“For most documents, we rely on Apache Tika to transform the original document into a canonical HTML representation”
Dropbox has rebuilt and released a new search engine, dubbed “Nautilus”, saying it is significantly faster at indexing new and updated content — and that the company is working on unlocking search for image, video, and audio files.
The new engine uses machine learning to help power personalised searches for Dropbox’s 500 million users across hundreds of billions of documents; something the company’s engineers described as a unique challenge, owing to the need for searches to be highly personalised and working across rapidly changes sets of documents.
The Dropbox Search Engine Architecture
The system they ultimately built is the first overhaul of the Dropbox search engine since 2015. It uses machine learning to help find files, and required a fundamental rethink of the architecture to make this possible and the separation of indexing and serving.
It targets a budget of 500ms for the 95th percentile search (i.e., only 5 percent of searches should ever take longer than 500ms).
The role of the indexing pipeline is to process file and user activity, extract content and metadata out of it, and create a search index. The serving system then uses this search index to return a set of results in response to user queries.
Together, these systems span several geographically-distributed Dropbox data centers, running tens of thousands of processes on more than a thousand physical hosts.
Engineering lead Diwaker Gupta writes of the new engine: “For most documents, we rely on Apache Tika to transform the original document into a canonical HTML representation, which then gets parsed in order to extract a list of “tokens” (i.e. words) and their “attributes” (i.e. formatting, position, etc…).
“After we extract the tokens, we can augment the data in various ways using a “Doc Understanding” pipeline, which is well suited for experimenting with extraction of optional metadata and signals. As input it takes the data extracted from the document itself and outputs a set of additional data which we call ‘annotations.’ Pluggable modules called “annotators” are in charge of generating the annotations.”
The ranking engine, meanwhile, is powered by a machine learning model that outputs a score for each document based on a variety of signals.
Some signals measure the relevance of the document to the query (e.g., BM25), while others measure the relevance of the document to the user at the current moment in time (e.g., who the user has been interacting with, or what types of files the user has been working on). It is trained on anonymised “click” data from the Dropbox front-end, which excludes any personally identifiable data.