Similarity Search and Hashing for Text Documents

by Cody Collier

Introduction

This is a high level overview of similarity hashing for text, locality sensitive hashing (LSH) in particular, and connections to application domains like approximate nearest neighbor (ANN) search. This writeup is the result of a literature search and part of a broader project to identify an implementation pattern for similarity search in large scale document collections (100M+).

The Continuum of Similarity

At Catalyst, we encounter ideas which involve the identification and measurement of document similarity. Consider a "more like this" feature, ranking documents by similarity to other documents, clustering documents, tagging near duplicates, email threading, and so on.

In the abstract, these are all issues of similarity, and similarity can be scored on a continuum between 0 and 1. Mildly similar documents might score a 0.5. Exact duplicates score at 1.0.

Is there a general implementation pattern that can be brought to bear?

Can we derive a similarity scoring engine and apply it to multiple current and future features?

How do our answers change as we face larger scale collections of 100M or 500M documents?

These are the drivers behind this research.

Traditional Hashing

In software and systems, hash functions have gained popularity as a means to deterministically map input data to a predefined range of "buckets". Here's a very basic hash function, which will hash positive integers in to 10 different buckets:

# What is the leftover amount when x is divided by 10?
h(x) = x mod 10</code></pre>

# Here are some examples:
h(10) =  10 / 10 =  1 with 0 leftover = bucket 0
h(18) =  18 / 10 =  1 with 8 leftover = bucket 8
h(34) =  33 / 10 =  3 with 4 leftover = bucket 4
h(237) = 237 / 10 = 23 with 7 leftover = bucket 7

When faced with storing 10 million files, one might wish to spread them evenly across a small range of 1,000 directories (aka buckets). A sequence of hashing and heuristics could be used to assign each of the files to one of the directories.

At the other extreme, when faced with identifying exact duplicates of data, one might define a much larger range, greater even than the number of items to be hashed. If two items hash to the same bucket, then you can be relatively assured that they are identical. This can be used in various ways to filter duplicate items. As a side note, the extension of this idea in to probabilistic data structures is where the popular Bloom Filter is situated.

The main intuition behind this common kind of hashing is that it ensures any given set of inputs will generate hashes which are randomly and uniformly distributed across the defined range. Even the slightest change in input will generate a vastly different hash or signature.

In cryptographic hashing and signatures, the intuition is similar. Hashing of the binary contents of a file, using MD5 or SHA class hashes will generate a fixed length signature for that file. Even a single character change within a text file would then result in a dramatically different document signature.

Similarity Hashing

Similarity hashing arises from a contrasting idea to traditional hashing. The intuition for similarity hashing is to generate hashes or signatures which preserve the similarity and relationships between two items. Rather than a random and uniform distribution, the aim of similarity hashing is to cluster like items together.

Two documents which contain very similar content should result in very similar signatures when passed through a similarity hashing system. Similar content leads to similar hashes. Locality sensitive hashing (LSH) is a formal name for such a system, and a broad academic topic addressing related concerns.

An Example

Let's walk through an example of one kind of similarity hashing. This example should help explore the concept. It will also begin to show that LSH is not a singular hash function but an assembly of many techniques involving the generation and comparison of document signatures.

First, imagine two sentences:

Sentence A: "The dog is brown"
Sentence B: "The dog is happy"

Then, imagine representing those two sentences in a matrix like so:

words: the | dog | is | brown | happy | sunshine
    A:  1  |  1  | 1  |   1   |   0   |    0
    B:  1  |  1  | 1  |   0   |   1   |    0

For each sentence and each column, a 1 indicates a word was present and a 0 indicates the word was not present. Note that the word "sunshine" is added to the matrix, and since neither sentence contain that word, both have a 0 in that column. This is added to the example to help illustrate how those are dropped later as a part of the similarity measurement.

How similar are the two sentences?

How can the similarity be scored?

This can be formulated as a set problem, and the two ordered sets can be compared:

Sentence A = [1, 1, 1, 1, 0, 0]
Sentence B = [1, 1, 1, 0, 1, 0]

Each sentence has a bit in the first 3 columns. And 5 columns have a bit in at least one of the sets. So we'll give these two sentences a similarity score of 3/5 or 0.6. The formal name for this is Jaccard Similarity.

Walking through these steps, we have hashed two sentences and then compared their signatures to generate a similarity score.

On the Path to MinHash

Let's move closer to a real world scenario, where it's not just two sentences, but a collection of many documents. Envision a matrix of all the terms and documents in the collection.

 words: ace | cat | bag | brown | happy | water | ...
 doc 1:  1  |  1  |  1  |   1   |   0   |   0   | ...
 doc 2:  1  |  0  |  0  |   1   |   0   |   0   | ...
 doc 3:  1  |  0  |  1  |   0   |   0   |   0   | ...
 doc 4:  0  |  0  |  0  |   0   |   0   |   0   | ...
 doc 5:  1  |  1  |  1  |   1   |   0   |   0   | ...
 doc n...

Such a matrix would quickly get very large and it would also tend to be very sparse, meaning it would contain lots of zeros. When dealing with very large scale collections, holding such large matrices in memory and performing comparison calculations becomes a difficult problem.

Here, the various families of LSH techniques come in to play. One of the earliest techniques to be used was MinHash [2]. MinHash happens to begin with a characteristic matrix much like the one in our example. Then, using some clever arrangements of hashing and probability, very compact document signatures can be generated. For example, imagine each document in a collection being represented by only a 1024 bit sequence of bits or numbers. Formally, this is often called dimensionality reduction.

Pausing here, imagine now a collection of 1 million document signatures. In the worst case, you might have to compare every doc signature to every other signature. Comparing all possible pairs would require roughly 1M^2 or 1 trillion comparisons. If each comparison takes 1 millisecond, we'd be done in only 31.6 years! This is the comparison problem.

Continuing on with MinHash, another clever round of hashing and algorithms allows for a much smaller pool of "candidate signatures" to be identified for comparison. With a smaller set of signatures, the comparison problem becomes more reasonable.

Note however, that with each round of dimensionality reduction and hashing, small bits of risk for false positives and/or false negatives creep in to our results. This is the trade off for using probabilistic techniques. By extension, this is why leveraging LSH for nearest neighbor searches leads to the name of Approximate Nearest Neighbors (ANN) and why various other uses result in approximate answers and not exact answers.

A substantial amount of detail in the MinHash algorithms has been skipped in this description, partially to make it accessible and partially because there are better references than I need to provide here. If you'd like more detail, the early sections of Chapter 3 in the free book Mining of Massive Datasets is highly recommended.

To wrap up and generalize, several stages are starting to be exposed:

  1. How should a collection of documents be represented?
  2. How can that representation be made smaller and more compact?
  3. How can the comparison problem be improved by identifying smaller pools of candidate pairs?
  4. How can the remaining comparisons be performed quickly?

Let's move on and extrapolate from this example to the larger field of LSH.

Broad View of LSH

By some measures, locality sensitive hashing has been around almost 20 years, beginning with Broder's paper in 1997 [1]. Despite that, the field is still pretty young. LSH is found in a diverse set of academic research areas including digital signal processing, information retrieval, statistics, machine learning, and data mining. As such, searches of the literature reveal a fluid definition of the term LSH and related items.

In this section, I describe a common set of stages, found across the families of LSH implementations. Assuming a collection of documents, the rough sections are:

  1. selecting document characteristics
  2. dimensionality reduction and representation
  3. distance measures
  4. identifying candidate pairs
  5. performing comparisons, ranking, clustering, etc.

Now, let's look at each section in a more detail. If you're less interested in the details, this is a good place to jump ahead to the next section "Expansion of the Field".

(A) selecting document characteristics

Selecting document characteristics might also be noted as "feature selection" in the lexicon of machine learning. There are numerous ways to select and represent document characteristics. This selection can change the definition of what it means to be "similar". It can also add or remove technical constraints to the system.

One common approach to selecting document characteristics can be to generate shingles from the characters or even bytes of a document. As an example, the word "foxtrot" could be 3-shingled as ['fox', 'oxt', 'xtr', 'tro', 'rot']. Another approach might be to tokenize text and select unigrams, bigrams, or other n-grams. A more advanced approach might be to identify the words in a document along with their term frequencies or even their tf-idf scores.

When shingling or tokenization is used, the characteristics tend to be more representative of the lexical nature and basic structure of the text. Did these three characters exist in the document? Did these four words appear together, in order, in the document? This lends itself well, to a type of similarity that is more literal. This is seen in the historic use of MinHash for de-duplication of web collections, or in application toward the identification of plagiarism where common sentences might be copied and re-arranged.

Moving to term frequencies and especially tf-idf begins to create a more semantic representation of text. Later, this can lead to a similarity score which is more about document meaning and concepts. It can also build on the rich research around text statistics in collections.

Feature engineering is a particularly broad and also important part of the process. Some of these techniques can also be important in non-LSH domains and even paired with more traditional hashes and set operations [5, 7]. However, let's continue on with our overview of the LSH scheme.

(B) dimensionality reduction and representation

Remember now that these document characteristics can grow in size very quickly. This can leads an example of the commonly described "curse of dimensionality". Enter the tools of lossy compression and dimensionality reduction.

Remember the matrix of words and documents in our early examples. Each list of numbers for a given document can be thought of as a vector or signature. They could be as simple as a bit vector like [1, 1, 0, ...] or more complex vectors of real numbers or strings of characters. In dealing with vectors or other representations, measuring distance becomes an important tool. If two document vectors are close in distance, it can mean the documents are similar or have similar content.

There are numerous techniques to reduce the dimensionality of document vectors or more broadly reduce the amount of data. They include more basic techniques like sampling and quantization. They also extend in to more complex techniques like MinHash [2], Simhash [10], and Random Projection / Random Indexing. These techniques can be surprisingly good at shrinking the data while preserving the distance or other relationships between documents.

(C) distance measures for different representations

As noted in the previous section, distance measure become important in this world of alternate representations of documents. There are euclidean distance measures, cosine distance, and so on. On a basic two dimensional plot, this can be as simple as measuring distance between two points. It can get more complex when working with vectors in higher dimensions. Alternatively, when working with strings or even just bits, other distance measures can be leveraged. Examples there include Levenshtein distance or Hamming distance.

Pausing a moment, let me say that while this section is in the middle, it can be really useful as a starting point. Hamming distance is conceptually simple, and measures the difference between strings of 1s and 0s. Here's an example:

bit vector 1: 001001
bit vector 2: 001011

These two vectors have a Hamming distance of 1, because there is only one position in which the two differ.

Additionally, representing anything as strings of bits allows for the use of bit oriented operations. Bits are easy to understand in the abstract, easy to work with, and there exist many very fast operations for their manipulation and comparison. So, it may be motivating to try and focus on the families of LSH techniques surrounding bit vectors. As an example, see the papers from Chappell Et al. [3] on efficient top-k retrieval for similarity signatures of bit vectors.

(D) identifying candidate pairs

Now, no matter what characteristics are used and no matter how small each document signature becomes, there is still the matter of making numerous comparisons between signatures. So, various technical and statistical techniques have arisen for identifying candidate pairs. Finding these candidate pairs which have a high likelihood of being similar is very important. This can dramatically reduce the number of comparisons operations that are required and make the whole idea of a similarity search feasible.

(E) performing comparisons, ranking, clustering, etc.

In some situations, you will want to make full comparisons of the candidate pairs, to confirm their similarity. Some candidates may be thrown out, or the candidates may be ranked by measure of similarity. Depending on the service, documents could be included, excluded, ranked, clustered, and so on.

Expansion of the Field

By now, heads can be swimming (mine does). The field expands quickly, with each section of the LSH sequence being very deep.

LSH begins as a straight forward intuition of similarity hashes and comparisons. However, simple techniques fail at large scales (100M+ docs). Tackling these scale hurdles with approximation and clever algorithms then gives rise to families of LSH techniques around Hamming distance, families around Euclidean Distance, and so on.

In practice, the simple intuition of similarity hashing becomes a tedious sequencing of many tedious steps. It is not insurmountable though, and these techniques are already well established in industry and even available in open source packages.

Closing

Thank you for taking the time to read this writeup. While this document represents many hours of my own work, I recognize the extant and parallel work of others on this topic at Catalyst. Discussion, questions, and feedback on any details large or small are warmly welcomed.

Thanks to JBull for feedback on drafts of this document.

References

Here, you'll find some of the references which informed this document, and other interesting resources.

[1] A. Broder. On the resemblance and containment of documents. Sequences 1997. Link
This is the original paper on LSH.

[2] A. Broder, M. Charikar, A. Frieze, and M. Mitzenmacher. "Min-wise independent permutations". ACM Symposium on Theory of Computing, pp. 327–336, 1998. Link
This is the original MinHash paper.

[3] T. Chappell, S. Geva, A. Nguyen, G. Zuccon. Efficient top-k retrieval with signatures. ADCS 2013. (r)
T. Chappell, S. Geva, G. Zuccon. Approximate Nearest-Neighbour Search with Inverted Signature Slice Lists. ECIR 2015. (r)
These two papers, essentially the same, describe accessible and interesting techniques for very fast comparisons of bit vectors. They address the comparison problem with a mix of inverted indexes and probability. Email me for copies.

[4] S. Geva, C. De Vries. TopSig: Topology Preserving Document Signatures. CIKM 2011. (r) Link
This paper was published prior to the Chappell papers [3], and addresses the generation of document signaures which extends on random projection by introducing tf-idf scores. It's a bit vague, but there's a C code implementation.

[5] Redacted - internal source

[6] J. Leskovec, A. Rajaraman, and JD Ullman. 2014. Mining of Massive Datasets. Cambridge University Press, New York, NY, USA. (pr) Link
This is a pretty good book from a Stanford trio, covering a broad array of large scale data mining topics. It is available for free online and also used in a data mining class at Coursera.

[7] C. Manning, P Raghavan, H Schütze. Introduction to Information Retrieval. 2008 Cambridge University Press. Link Link
This is a good, free book on Information Retrieval from Stanford.

[8] M. Slaney, M. Casey. Locality-Sensitive Hashing for Finding Nearest Neighbors. IEEE Signal Processing Magazine, March 2008.
This is a well respected description of one kind of Locality Sensitive Hashing.

[9] J. Wang, H. T. Shen, J. Song, and J. Ji. Hashing for similarity search: A survey. CoRR, abs/1408.2927, 2014 (pr) Link
This is a very lengthy and thorough survey of Locality Sensitive Hashing and similarity search. It holds particular interest because it was completed so recently (2014).

[10] C. Sadowski, G. Levin. SimHash: Hash-based Similarity Detection.
G. Manku, A. Jain, A. Das Sarma. Detecting Near-Duplicates for Web Crawling. WWW 2007. Link
These are two of the papers on SimHash from people with Google ties.