Why are Cosine Similarities of Text embeddings almost always positive?

Vaibhav Garg
5 min readMay 19, 2022

--

In early February, we were in the process of developing our application, BookMapp. Just as a background, the principal value proposition of the app is to identify similarities across the uploaded set of multiple books. In other words, we are creating an n-dimensional (where n is the number of books you uploaded) graph of connected nodes across those n dimensions. If we have, say, 5 books; we create a tensor of Rank 5, with each pair of dimensions is — roughly speaking — a matrix of relatedness. Each of the elements of the tensor is connected using a weight or relatedness to every other node.

We make heavy use of GPT-3 Text embeddings and Cosine Similarity scores for a bunch of analysis internally.

Cosine similarity measures the similarity between two vectors of an inner product space. It is measured by the cosine of the angle between two vectors and determines whether two vectors are pointing in roughly the same direction. It is often used to measure document similarity in text analysis.

Going by the definition of Cosine Similarity, we expect the values to lie in the closed interval of [-1, 1]. However, we found the values to be always positive AND larger than 0.45 (for standard English text). For a corpus of about 5000 documents, we have 25 million pairings and the smallest score was 0.4522.

Since this was not the long pole in the app development, we parked the investigation.

The second wind

I was browsing through the OpenAI community forums, and someone was asking about the same issue(Embeddings and Cosine Similarity — General API discussion — OpenAI API Community Forum). This bumped up the investigation itch and got me wondering again.

I dusted out the embeddings dataset and just did some exploratory analysis. The histogram of the saved embeddings is as below:

Histogram of the 2.5e6 embeddings values

However, the histogram of the pairwise cosine similarities has a strong skew towards positive values.

histogram of the pairwise cosine similarities

The million-dollar question is whether this is something that is special to Text embeddings or is this an artifact of the Cosine Similarity calculation process.

To disambiguate, let us play with some synthetic data.

Experiments with synthetic data

Let us create a set of synthetic data.

import numpy as np
embs = np.random.normal(0, 1, (100, 768))
# normalize
embs = embs / np.linalg.norm(embs, axis=1, keepdims=True)

The embs has a shape of (100, 768), that is simulating 100 sets of 768 dimensional embeddings. We have normalized them using the Frobenius norm of each set.

The histogram of values is:

Histogram of the embs values

These are centered at 0, as expected.

The histogram of the cosine similarities is as below.

histogram of the cosine similarities

Whoa, the cosine similarities are also centered at 0. What gives?

This establishes that cosine similarities do not cause the skew we observed above. There must be something going on in the text embeddings themselves!

Back to Text Embeddings

I traced the GPT model back to the seminal paper, Attention is all you need; and the corresponding implementations. The paper which describes the model Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks — Reimers,Gurevych describes that they used pooling layers on top of the BERT-like pre-trained models to obtain a fixed encoding.

As a background, GPT-3 embeddings come in four flavours based on the dimensions of the embeddings.

  • Ada (1024 dimensions),
  • Babbage (2048 dimensions),
  • Curie (4096 dimensions),
  • Davinci (12288 dimensions).

Assuming that all of those come from the same model, with some kind of a pooling operation reducing the dimensions (I was using 2048 dimensional embeddings above); I tried to implement a Max Pool filter with window size of 2 and stride of 2.

embs_pooled = np.zeros((embs.shape[0], embs.shape[1]//2))
for i in range(embs.shape[0]):
for j in range(0, embs.shape[1], 2):
# max pooling
embs_pooled[i, j//2] = np.max(embs[i, j:j+2])

As seen below, there is a slight skew towards positive values in the pooled data, compared to the original plotted on the same scale. However, the skew is not very large. This is understandable, as some of the -ve/+ve pairs should have been reduced to +ve, and so forth.

Pooled vs original synthetic Embeddings

However, as soon as pairwise cosine similarities are calculated, there is a marked rightward shift in the values.

For Cosine similarities, shift after max pool

This is a remarkable change.

Is it possible to have positive and negative values in vectors and still have only positive cosine similarity values? The answer is true, it is possible if the embedding vectors are contained into a nappe of a conical surface fixed in origin.

In other words, the cosine similarity has a positive contribution if the corresponding dimensions in the vector are either both pointing in the positive direction, or both in the negative direction. Since the max pool moves some dimensions in the positive direction, it is statistically imperative in a large dimensional space to have a net positive bias.

Mystery solved!

--

--