Skip to main content

How to Calculate Cosine Similarity in Python

Cosine similarity measures how similar two vectors are based on the angle between them, regardless of their magnitude. It produces a value between -1 (opposite directions) and 1 (same direction), with 0 indicating no similarity (perpendicular vectors).

This metric is widely used in natural language processing (comparing document vectors), recommendation systems (finding similar items), search engines (ranking relevance), and clustering (grouping similar data points).

The Formula

The cosine similarity between two vectors A and B is:

cosine_similarity = (A · B) / (||A|| × ||B||)

Where:

  • A · B is the dot product of vectors A and B
  • ||A|| is the Euclidean norm (magnitude) of A
  • ||B|| is the Euclidean norm of B

Calculating with NumPy (Manual)

The most straightforward approach uses NumPy's dot() and norm() functions:

import numpy as np
from numpy.linalg import norm

A = np.array([2, 1, 2, 3, 2, 9])
B = np.array([3, 4, 2, 4, 5, 5])

cosine_sim = np.dot(A, B) / (norm(A) * norm(B))
print(f"Cosine Similarity: {cosine_sim:.4f}")

Output:

Cosine Similarity: 0.8189

A value of 0.82 indicates the vectors are quite similar in direction.

Wrapping It in a Reusable Function

import numpy as np
from numpy.linalg import norm

def cosine_similarity(a, b):
"""Calculate cosine similarity between two vectors."""
return np.dot(a, b) / (norm(a) * norm(b))

A = np.array([1, 2, 3])
B = np.array([4, 5, 6])

print(f"Similarity: {cosine_similarity(A, B):.4f}")

Output:

Similarity: 0.9746
Handle zero vectors

If either vector has a norm of zero (all elements are 0), division by zero occurs. Add a check:

def cosine_similarity_safe(a, b):
norm_a = norm(a)
norm_b = norm(b)
if norm_a == 0 or norm_b == 0:
return 0.0 # Undefined; return 0 as convention
return np.dot(a, b) / (norm_a * norm_b)

Using sklearn.metrics.pairwise.cosine_similarity

Scikit-learn provides a built-in function that handles arrays and matrices efficiently:

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

A = np.array([[2, 1, 2, 3, 2, 9]])
B = np.array([[3, 4, 2, 4, 5, 5]])

result = cosine_similarity(A, B)
print(f"Cosine Similarity: {result[0][0]:.4f}")

Output:

Cosine Similarity: 0.8189
note

Scikit-learn's cosine_similarity expects 2D arrays (matrices). Each row is treated as a separate vector. The result is a similarity matrix where result[i][j] is the similarity between the i-th row of A and the j-th row of B.

Using scipy.spatial.distance.cosine

SciPy provides a cosine function that returns the cosine distance (1 - similarity):

from scipy.spatial.distance import cosine
import numpy as np

A = np.array([2, 1, 2, 3, 2, 9])
B = np.array([3, 4, 2, 4, 5, 5])

# cosine() returns distance, so subtract from 1 for similarity
cosine_sim = 1 - cosine(A, B)
print(f"Cosine Similarity: {cosine_sim:.4f}")

Output:

Cosine Similarity: 0.8189
warning

SciPy's cosine() returns cosine distance (1 - similarity), not similarity directly. Always subtract from 1 to get the similarity value.

Comparing a Vector Against Multiple Vectors

A common use case is comparing one vector against a batch of vectors. For example, finding the most similar document in a collection:

import numpy as np
from numpy.linalg import norm

# Three vectors to compare against
A = np.array([
[2, 1, 2],
[3, 2, 9],
[-1, 2, -3]
])

# The query vector
B = np.array([3, 4, 2])

# Compute cosine similarity for each row
similarities = np.dot(A, B) / (norm(A, axis=1) * norm(B))

for i, sim in enumerate(similarities):
print(f"Vector {i} similarity: {sim:.4f}")

most_similar = np.argmax(similarities)
print(f"\nMost similar: Vector {most_similar}")

Output:

Vector 0 similarity: 0.8666
Vector 1 similarity: 0.6704
Vector 2 similarity: -0.0496

Most similar: Vector 0
  • Vector 0 is most similar (0.87). It points in a similar direction.
  • Vector 2 has a negative similarity (-0.05). It points in roughly the opposite direction.

Row-Wise Similarity Between Two Matrices

Compare corresponding rows between two matrices of the same shape:

import numpy as np
from numpy.linalg import norm

A = np.array([[1, 2, 2], [3, 2, 2], [-2, 1, -3]])
B = np.array([[4, 2, 4], [2, -2, 5], [3, 4, -4]])

# Row-wise dot products
dot_products = np.sum(A * B, axis=1)

# Row-wise norms
norms_a = norm(A, axis=1)
norms_b = norm(B, axis=1)

similarities = dot_products / (norms_a * norms_b)

for i, sim in enumerate(similarities):
print(f"Row {i}: {sim:.4f}")

Output:

Row 0: 0.8889
Row 1: 0.5066
Row 2: 0.4174

Pairwise Similarity Matrix

Compute similarity between every pair of vectors. This is useful for clustering or finding duplicates:

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

documents = np.array([
[1, 0, 1, 1], # Document 0
[1, 1, 0, 1], # Document 1
[0, 1, 1, 0], # Document 2
])

sim_matrix = cosine_similarity(documents)

print("Similarity Matrix:")
print(np.round(sim_matrix, 3))

Output:

Similarity Matrix:
[[1. 0.667 0.408]
[0.667 1. 0.408]
[0.408 0.408 1. ]]

Each cell [i][j] shows the similarity between document i and document j. The diagonal is always 1.0 (each document is perfectly similar to itself).

Practical Example: Finding Similar Text

Here's a real-world example using TF-IDF vectors to find the most similar sentence:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

sentences = [
"Python is a great programming language",
"Java is also a popular programming language",
"I love eating pizza and pasta",
"Python programming is fun and easy"
]

query = "I enjoy coding in Python"

# Convert text to TF-IDF vectors
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(sentences + [query])

# Compare query (last row) against all sentences
query_vector = tfidf_matrix[-1]
sentence_vectors = tfidf_matrix[:-1]

similarities = cosine_similarity(query_vector, sentence_vectors)[0]

print("Similarities to query:")
for i, (sentence, sim) in enumerate(zip(sentences, similarities)):
print(f" [{sim:.3f}] {sentence}")

best_match = sentences[similarities.argmax()]
print(f"\nBest match: '{best_match}'")

Output:

Similarity Matrix:
[[1. 0.667 0.408]
[0.667 1. 0.408]
[0.408 0.408 1. ]]

(venv) D:\Esperimenti\_website\contenuti\python\examples\nuovi5\to-check-examples\pronti>python test.py
Similarities to query:
[0.140] Python is a great programming language
[0.000] Java is also a popular programming language
[0.000] I love eating pizza and pasta
[0.121] Python programming is fun and easy

Best match: 'Python is a great programming language'

Comparison of Methods

MethodInput ShapeReturnsBest For
NumPy manual1D vectorsSingle floatSimple pairwise comparison
sklearn.cosine_similarity2D matricesSimilarity matrixBatch comparisons, pairwise matrices
scipy.cosine1D vectorsCosine distanceWhen you need distance, not similarity

Interpreting Cosine Similarity Values

ValueMeaning
1.0Identical direction (perfectly similar)
0.5–0.99Similar vectors
0.0Perpendicular (no similarity)
-0.5 to -0.99Dissimilar vectors
-1.0Opposite direction (perfectly dissimilar)

Conclusion

Cosine similarity is a fundamental metric for comparing vectors based on direction rather than magnitude.

  • For quick calculations on individual vectors, NumPy's dot() and norm() functions are simple and efficient.
  • For batch operations and pairwise comparisons, scikit-learn's cosine_similarity() is the most convenient choice. Remember that SciPy's cosine() returns distance, not similarity, i.e. subtract from 1 to convert.

Whether you're building a search engine, recommendation system, or NLP pipeline, cosine similarity is an essential tool in your Python toolkit.