An interactive demo of a document retrieval system, with explanations of how it works
Interact and add documents using the toolbar at the bottom of the page.
Google’s mission statement, besides not being evil, is “to organize the world’s information and make it universally accessible and useful”. This could also describe the field of information retrieval (IR), which Wikipedia says is “the activity of obtaining information resources relevant to an information need from a collection of information resources”. Web search engines have transformed the way that people look for information. Tools like Apache Lucene and Elasticsearch have made it possible for nearly anyone with some technical acumen to organize their own information. In this post, I’ll show you how these systems work from top to bottom (at a very simple level). This will also (hopefully) be interesting to people coming from other natural language processing disciplines.
At a high level, information retrieval systems match some query with the data that that query pertains to. For example, let’s consider a collection of medical documents. If our system is given the query “flu”, we want to return documents containing information about the flu. However, there are more steps involved before this can happen:
There are a lot of aspects to this problem, so before I dive in, I’ll mention some of the simplifying assumptions that I make in this system. These assumptions are bad in most real-world systems, but many people have come up with clever ways to fix them.
With these simplifying assumptions in mind, let’s look at how we’ll organize the system. Each document is associated with an index. We extract the terms from these documents, and organize them in a dictionary where each term points to the documents that contain that term. The figure below shows a visualization of this:
At a simple level, we could look at all the terms in a given query, find the documents that contain those terms, and return them. A slight optimization is, for each term, to sort the documents in descending order according to how many times they mention that term. That way, instead of returning all the documents, we could just return the most relevant ones; we’re using the term frequency as a metric for relevance.
One of the first flaws in our system is that there are many ways to refer to the same thing. For example, if have documents containing the words “flu” and “influenza”, they are probably about similar topics, but our index system doesn’t have a good way to represent that. The most simple steps we can take to fix this are called stemming and lemmatization. These are two ways of figuring out common word forms. Stemming typically involves identifying common suffixes and removing them, such as the “-ic” or “-zation”. In this example, I’ll use an implementation of Porter’s algorithm, which has been shown to work well heuristically. Lemmatization is a more sophisticated version which uses the word’s part-of-speech to figure out the word’s lemma.
The next step in building out system is to remove words that probably don’t communicate very much information about what the document is about. For example, knowing that a document contains the word “the” or “of” tells us very little about that document’s content. Having a good list of stopwords will also greatly reduce the size of our index, since we’ll be able to exclude some of the most common words when we build it.
With stopwords, we briefly touched on the idea that some words communicate more information about a document’s content than others. With that in mind, here is the intuition for something called term frequency / inverse document frequency (TF-IDF):
If a term appears a lot in one document, and not very much in other documents, then it is probably a good indicator of what that document is about. More specifically, if we present our system with a query, we want to choose the documents that have a lot of the terms from that query, and we want to weight the less common terms more highly.
If you’ve added some documents using the toolbar at the bottom of the page, the table below will contain those documents, and their parsed phrases. Like the diagram above, we organize the documents by their ID; the right-most column shows the terms that we’ve parsed from the document.
We collect all the terms from all the documents in another table, where each term is associated with the documents that contain it. We compute the document frequency as the percentage of documents that contain that term; if our collection of documents gets big enough, then we’ll have some terms that appear only in a few documents. If a query contains one of those terms, then it is a safe bet that those documents are relevant to that query.
The search box below can be used to compute the term frequency and document frequency for a particular term. The term frequency is different for each document; it is the number of times the given term appears in the document. The document frequency is unique to the term; it is the number of unique documents that the term appears in. There are a number of ways to compute the TF-IDF score, but we’ll compute it as
|ID||Term Frequency||Document Frequency||Score|
We can compute the TF-IDF scores for each term in a query, then add them together for each document, to get a single score. We can then rank the documents by this score to get the relevant documents.