How to Calculate Cosine Similarity in Python
Cosine similarity measures how similar two vectors are based on the angle between them, regardless of their magnitude. It produces a value between -1 (opposite directions) and 1 (same direction), with 0 indicating no similarity (perpendicular vectors).
This metric is widely used in natural language processing (comparing document vectors), recommendation systems (finding similar items), search engines (ranking relevance), and clustering (grouping similar data points).
The Formula
The cosine similarity between two vectors A and B is:
cosine_similarity = (A · B) / (||A|| × ||B||)
Where:
- A · B is the dot product of vectors A and B
- ||A|| is the Euclidean norm (magnitude) of A
- ||B|| is the Euclidean norm of B
Calculating with NumPy (Manual)
The most straightforward approach uses NumPy's dot() and norm() functions:
import numpy as np
from numpy.linalg import norm
A = np.array([2, 1, 2, 3, 2, 9])
B = np.array([3, 4, 2, 4, 5, 5])
cosine_sim = np.dot(A, B) / (norm(A) * norm(B))
print(f"Cosine Similarity: {cosine_sim:.4f}")
Output:
Cosine Similarity: 0.8189
A value of 0.82 indicates the vectors are quite similar in direction.
Wrapping It in a Reusable Function
import numpy as np
from numpy.linalg import norm
def cosine_similarity(a, b):
"""Calculate cosine similarity between two vectors."""
return np.dot(a, b) / (norm(a) * norm(b))
A = np.array([1, 2, 3])
B = np.array([4, 5, 6])
print(f"Similarity: {cosine_similarity(A, B):.4f}")
Output:
Similarity: 0.9746
If either vector has a norm of zero (all elements are 0), division by zero occurs. Add a check:
def cosine_similarity_safe(a, b):
norm_a = norm(a)
norm_b = norm(b)
if norm_a == 0 or norm_b == 0:
return 0.0 # Undefined; return 0 as convention
return np.dot(a, b) / (norm_a * norm_b)
Using sklearn.metrics.pairwise.cosine_similarity
Scikit-learn provides a built-in function that handles arrays and matrices efficiently:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
A = np.array([[2, 1, 2, 3, 2, 9]])
B = np.array([[3, 4, 2, 4, 5, 5]])
result = cosine_similarity(A, B)
print(f"Cosine Similarity: {result[0][0]:.4f}")
Output:
Cosine Similarity: 0.8189
Scikit-learn's cosine_similarity expects 2D arrays (matrices). Each row is treated as a separate vector. The result is a similarity matrix where result[i][j] is the similarity between the i-th row of A and the j-th row of B.
Using scipy.spatial.distance.cosine
SciPy provides a cosine function that returns the cosine distance (1 - similarity):
from scipy.spatial.distance import cosine
import numpy as np
A = np.array([2, 1, 2, 3, 2, 9])
B = np.array([3, 4, 2, 4, 5, 5])
# cosine() returns distance, so subtract from 1 for similarity
cosine_sim = 1 - cosine(A, B)
print(f"Cosine Similarity: {cosine_sim:.4f}")
Output:
Cosine Similarity: 0.8189
SciPy's cosine() returns cosine distance (1 - similarity), not similarity directly. Always subtract from 1 to get the similarity value.
Comparing a Vector Against Multiple Vectors
A common use case is comparing one vector against a batch of vectors. For example, finding the most similar document in a collection:
import numpy as np
from numpy.linalg import norm
# Three vectors to compare against
A = np.array([
[2, 1, 2],
[3, 2, 9],
[-1, 2, -3]
])
# The query vector
B = np.array([3, 4, 2])
# Compute cosine similarity for each row
similarities = np.dot(A, B) / (norm(A, axis=1) * norm(B))
for i, sim in enumerate(similarities):
print(f"Vector {i} similarity: {sim:.4f}")
most_similar = np.argmax(similarities)
print(f"\nMost similar: Vector {most_similar}")
Output:
Vector 0 similarity: 0.8666
Vector 1 similarity: 0.6704
Vector 2 similarity: -0.0496
Most similar: Vector 0
- Vector 0 is most similar (0.87). It points in a similar direction.
- Vector 2 has a negative similarity (-0.05). It points in roughly the opposite direction.
Row-Wise Similarity Between Two Matrices
Compare corresponding rows between two matrices of the same shape:
import numpy as np
from numpy.linalg import norm
A = np.array([[1, 2, 2], [3, 2, 2], [-2, 1, -3]])
B = np.array([[4, 2, 4], [2, -2, 5], [3, 4, -4]])
# Row-wise dot products
dot_products = np.sum(A * B, axis=1)
# Row-wise norms
norms_a = norm(A, axis=1)
norms_b = norm(B, axis=1)
similarities = dot_products / (norms_a * norms_b)
for i, sim in enumerate(similarities):
print(f"Row {i}: {sim:.4f}")
Output:
Row 0: 0.8889
Row 1: 0.5066
Row 2: 0.4174
Pairwise Similarity Matrix
Compute similarity between every pair of vectors. This is useful for clustering or finding duplicates:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
documents = np.array([
[1, 0, 1, 1], # Document 0
[1, 1, 0, 1], # Document 1
[0, 1, 1, 0], # Document 2
])
sim_matrix = cosine_similarity(documents)
print("Similarity Matrix:")
print(np.round(sim_matrix, 3))
Output:
Similarity Matrix:
[[1. 0.667 0.408]
[0.667 1. 0.408]
[0.408 0.408 1. ]]
Each cell [i][j] shows the similarity between document i and document j. The diagonal is always 1.0 (each document is perfectly similar to itself).
Practical Example: Finding Similar Text
Here's a real-world example using TF-IDF vectors to find the most similar sentence:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
sentences = [
"Python is a great programming language",
"Java is also a popular programming language",
"I love eating pizza and pasta",
"Python programming is fun and easy"
]
query = "I enjoy coding in Python"
# Convert text to TF-IDF vectors
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(sentences + [query])
# Compare query (last row) against all sentences
query_vector = tfidf_matrix[-1]
sentence_vectors = tfidf_matrix[:-1]
similarities = cosine_similarity(query_vector, sentence_vectors)[0]
print("Similarities to query:")
for i, (sentence, sim) in enumerate(zip(sentences, similarities)):
print(f" [{sim:.3f}] {sentence}")
best_match = sentences[similarities.argmax()]
print(f"\nBest match: '{best_match}'")
Output:
Similarity Matrix:
[[1. 0.667 0.408]
[0.667 1. 0.408]
[0.408 0.408 1. ]]
(venv) D:\Esperimenti\_website\contenuti\python\examples\nuovi5\to-check-examples\pronti>python test.py
Similarities to query:
[0.140] Python is a great programming language
[0.000] Java is also a popular programming language
[0.000] I love eating pizza and pasta
[0.121] Python programming is fun and easy
Best match: 'Python is a great programming language'
Comparison of Methods
| Method | Input Shape | Returns | Best For |
|---|---|---|---|
| NumPy manual | 1D vectors | Single float | Simple pairwise comparison |
sklearn.cosine_similarity | 2D matrices | Similarity matrix | Batch comparisons, pairwise matrices |
scipy.cosine | 1D vectors | Cosine distance | When you need distance, not similarity |
Interpreting Cosine Similarity Values
| Value | Meaning |
|---|---|
| 1.0 | Identical direction (perfectly similar) |
| 0.5–0.99 | Similar vectors |
| 0.0 | Perpendicular (no similarity) |
| -0.5 to -0.99 | Dissimilar vectors |
| -1.0 | Opposite direction (perfectly dissimilar) |
Conclusion
Cosine similarity is a fundamental metric for comparing vectors based on direction rather than magnitude.
- For quick calculations on individual vectors, NumPy's
dot()andnorm()functions are simple and efficient. - For batch operations and pairwise comparisons, scikit-learn's
cosine_similarity()is the most convenient choice. Remember that SciPy'scosine()returns distance, not similarity, i.e. subtract from 1 to convert.
Whether you're building a search engine, recommendation system, or NLP pipeline, cosine similarity is an essential tool in your Python toolkit.