# codekansas

A blog of language, neuroscience, and deep learning

# Overview

Google’s mission statement, besides not being evil, is “to organize the world’s information and make it universally accessible and useful”. This could also describe the field of information retrieval (IR), which Wikipedia says is “the activity of obtaining information resources relevant to an information need from a collection of information resources”. Web search engines have transformed the way that people look for information. Tools like Apache Lucene and Elasticsearch have made it possible for nearly anyone with some technical acumen to organize their own information. In this post, I’ll show you how these systems work from top to bottom (at a very simple level). This will also (hopefully) be interesting to people coming from other natural language processing disciplines.

# User Interaction

At a high level, information retrieval systems match some query with the data that that query pertains to. For example, let’s consider a collection of medical documents. If our system is given the query “flu”, we want to return documents containing information about the flu. However, there are more steps involved before this can happen:

• The user has a question to which they want an answer
• They write a query that they think will give them that answer
• The system has a bunch of documents, and it has to choose the documents that will give the user the best answer to the

There are a lot of aspects to this problem, so before I dive in, I’ll mention some of the simplifying assumptions that I make in this system. These assumptions are bad in most real-world systems, but many people have come up with clever ways to fix them.

• All the documents are good quality. On the web, there is obviously a lot of crap. In our system, we’ll be scoring documents only by their relevance to a particular query. One of Google’s innovations was the PageRank algorithm, which is an attempt to determine how good a document is based on how many other documents on the web point to it.
• A document can be represented purely by the words it contains; we can ignore the relationships between those words. This is called the bag-of-words assumption, and it is an interesting problem in information retrieval. This is an important simplifying assumption in many IR systems, because it greatly reduces the number of relationships that we have to represent; when looking for a document about cats eating lasagne, we can look for documents containing the words “cats”, “eating” and “lasagne”, rather than trying to represent the relationships between those words. This also leads us to a convenient way of representing those documents. Some more complex systems attempt to model concepts at a higher level.

## System Overview

With these simplifying assumptions in mind, let’s look at how we’ll organize the system. Each document is associated with an index. We extract the terms from these documents, and organize them in a dictionary where each term points to the documents that contain that term. The figure below shows a visualization of this:

At a simple level, we could look at all the terms in a given query, find the documents that contain those terms, and return them. A slight optimization is, for each term, to sort the documents in descending order according to how many times they mention that term. That way, instead of returning all the documents, we could just return the most relevant ones; we’re using the term frequency as a metric for relevance.

# Stemming and Lemmatization

One of the first flaws in our system is that there are many ways to refer to the same thing. For example, if have documents containing the words “flu” and “influenza”, they are probably about similar topics, but our index system doesn’t have a good way to represent that. The most simple steps we can take to fix this are called stemming and lemmatization. These are two ways of figuring out common word forms. Stemming typically involves identifying common suffixes and removing them, such as the “-ic” or “-zation”. In this example, I’ll use an implementation of Porter’s algorithm, which has been shown to work well heuristically. Lemmatization is a more sophisticated version which uses the word’s part-of-speech to figure out the word’s lemma.

# Stopwords

The next step in building out system is to remove words that probably don’t communicate very much information about what the document is about. For example, knowing that a document contains the word “the” or “of” tells us very little about that document’s content. Having a good list of stopwords will also greatly reduce the size of our index, since we’ll be able to exclude some of the most common words when we build it.

# Term Frequency / Inverse Document Frequency

With stopwords, we briefly touched on the idea that some words communicate more information about a document’s content than others. With that in mind, here is the intuition for something called term frequency / inverse document frequency (TF-IDF):

# Documents

If you’ve added some documents using the toolbar at the bottom of the page, the table below will contain those documents, and their parsed phrases. Like the diagram above, we organize the documents by their ID; the right-most column shows the terms that we’ve parsed from the document.

ID Text Terms

# Terms

We collect all the terms from all the documents in another table, where each term is associated with the documents that contain it. We compute the document frequency as the percentage of documents that contain that term; if our collection of documents gets big enough, then we’ll have some terms that appear only in a few documents. If a query contains one of those terms, then it is a safe bet that those documents are relevant to that query.

Term Document Frequency IDs

# Single-word Queries

The search box below can be used to compute the term frequency and document frequency for a particular term. The term frequency $tf$ is different for each document; it is the number of times the given term appears in the document. The document frequency $df$ is unique to the term; it is the number of unique documents that the term appears in. There are a number of ways to compute the TF-IDF score, but we’ll compute it as

ID Term Frequency Document Frequency Score

# General Queries

We can compute the TF-IDF scores for each term in a query, then add them together for each document, to get a single score. We can then rank the documents by this score to get the relevant documents.

ID Text Score