# How to compute the similarity between two text documents?

###### Posted By: Anonymous

I am looking at working on an NLP project, in any programming language (though Python will be my preference).

I want to take two documents and determine how similar they are.

## Solution

The common way of doing this is to transform the documents into TF-IDF vectors and then compute the cosine similarity between them. Any textbook on information retrieval (IR) covers this. See esp. *Introduction to Information Retrieval*, which is free and available online.

### Computing Pairwise Similarities

TF-IDF (and similar text transformations) are implemented in the Python packages Gensim and scikit-learn. In the latter package, computing cosine similarities is as easy as

```
from sklearn.feature_extraction.text import TfidfVectorizer
documents = [open(f) for f in text_files]
tfidf = TfidfVectorizer().fit_transform(documents)
# no need to normalize, since Vectorizer will return normalized tf-idf
pairwise_similarity = tfidf * tfidf.T
```

or, if the documents are plain strings,

```
>>> corpus = ["I'd like an apple",
... "An apple a day keeps the doctor away",
... "Never compare an apple to an orange",
... "I prefer scikit-learn to Orange",
... "The scikit-learn docs are Orange and Blue"]
>>> vect = TfidfVectorizer(min_df=1, stop_words="english")
>>> tfidf = vect.fit_transform(corpus)
>>> pairwise_similarity = tfidf * tfidf.T
```

though Gensim may have more options for this kind of task.

See also this question.

[Disclaimer: I was involved in the scikit-learn TF-IDF implementation.]

### Interpreting the Results

From above, `pairwise_similarity`

is a Scipy sparse matrix that is square in shape, with the number of rows and columns equal to the number of documents in the corpus.

```
>>> pairwise_similarity
<5x5 sparse matrix of type '<class 'numpy.float64'>'
with 17 stored elements in Compressed Sparse Row format>
```

You can convert the sparse array to a NumPy array via `.toarray()`

or `.A`

:

```
>>> pairwise_similarity.toarray()
array([[1. , 0.17668795, 0.27056873, 0. , 0. ],
[0.17668795, 1. , 0.15439436, 0. , 0. ],
[0.27056873, 0.15439436, 1. , 0.19635649, 0.16815247],
[0. , 0. , 0.19635649, 1. , 0.54499756],
[0. , 0. , 0.16815247, 0.54499756, 1. ]])
```

Let’s say we want to find the document most similar to the final document, “The scikit-learn docs are Orange and Blue”. This document has index 4 in `corpus`

. You can find the index of the most similar document by **taking the argmax of that row, but first you’ll need to mask the 1’s, which represent the similarity of each document to itself**. You can do the latter through `np.fill_diagonal()`

, and the former through `np.nanargmax()`

:

```
>>> import numpy as np
>>> arr = pairwise_similarity.toarray()
>>> np.fill_diagonal(arr, np.nan)
>>> input_doc = "The scikit-learn docs are Orange and Blue"
>>> input_idx = corpus.index(input_doc)
>>> input_idx
4
>>> result_idx = np.nanargmax(arr[input_idx])
>>> corpus[result_idx]
'I prefer scikit-learn to Orange'
```

Note: the purpose of using a sparse matrix is to save (a substantial amount of space) for a large corpus & vocabulary. Instead of converting to a NumPy array, you could do:

```
>>> n, _ = pairwise_similarity.shape
>>> pairwise_similarity[np.arange(n), np.arange(n)] = -1.0
>>> pairwise_similarity[input_idx].argmax()
3
```

###### Answered By: Anonymous

Disclaimer: This content is shared under creative common license cc-by-sa 3.0. It is generated from StackExchange Website Network.