What is Cosine Similarity Evaluation?
by Stephen M. Walker II, CoFounder / CEO
What is Cosine Similarity Evaluation?
Cosine similarity is a mathematical metric used to measure the similarity between two vectors in a multidimensional space. When evaluating Large Language Model (LLM) generations, cosine similarity is used to assess the semantic similarity between the generated text and a reference text.
The process involves providing the same prompts to different models to generate content on specified topics. The vector embeddings of the outputs are then compared against vector embeddings for reference texts. The cosine similarity score ranges from 0 to 1, with a score closer to 1 indicating higher semantic similarity between the generated text and the reference text.
This method provides a numerical benchmark for the contextual relevance of different models' generated text, moving beyond subjective assessments of language models to more reproducible and quantifiable evaluations. It can be straightforwardly adapted to evaluate translation and question answering.
However, there are challenges to productizing this benchmarking method. Expanding the prompts and reference texts to cover diverse use cases is necessary, and testing must happen routinely as new models and versions emerge to keep pace with rapid innovation.
Cosine similarity is not affected by the size of the vectors but solely by the angle between them. A cosine similarity of 1 represents vectors that are as close as possible (angle of 0 degrees), 0 indicates orthogonal vectors, and 1 denotes vectors pointing in opposite directions.
In addition to cosine similarity, other metrics like Rouge, Bleu, Meteor, and Levenshtein distance are also used to evaluate the performance of LLMs. However, cosine similarity stands out as a mathematically grounded and robust approach.
How does Cosine Similarity Evaluation work?
The evaluation process involves several steps. First, the same prompts are provided to different models to generate content on specified topics. The generated text and the reference text are then converted into vector embeddings using an embedding model. The cosine similarity between these vector embeddings is calculated, which essentially measures the angle between the two vectors. A smaller angle results in a higher cosine similarity score, indicating that the generated text is semantically similar to the reference text.
This method provides a numerical benchmark for the contextual relevance of different models' generated text, offering a more reproducible and quantifiable evaluation than subjective assessments. However, it's important to note that cosine similarity only considers the direction of the vectors, not their magnitude. This means it might not fully capture similarity in cases where the size of the vectors is significant.
While cosine similarity can be adapted to evaluate translation and question answering, there are challenges to its widespread use. To make the evaluation more comprehensive, the prompts and reference texts need to cover diverse use cases. Additionally, as new models and versions emerge, testing must be conducted routinely to keep pace with rapid innovation.
How is cosine similarity calculated for LLM generations?
Cosine similarity for Large Language Model (LLM) generations is calculated by comparing the orientation of two highdimensional vectors. These vectors represent the text generated by the LLM and a reference text, with each dimension corresponding to a feature of the text, such as the presence of certain words or phrases.
The calculation involves two main steps. First, the dot product of the two vectors is computed. This is done by multiplying the corresponding entries of the vectors and summing the results. Second, the magnitudes (or Euclidean norms) of the vectors are calculated. This is done by squaring each entry of a vector, summing the squares, and taking the square root of the result.
The cosine similarity is then calculated as the ratio of the dot product to the product of the magnitudes. This results in a score ranging from 1 to 1. A score of 1 indicates that the vectors are identical in orientation, signifying maximum similarity. A score of 0 indicates orthogonality, meaning there is no similarity. A score of 1 indicates that the vectors are diametrically opposed, signifying maximum dissimilarity.
In Python, libraries such as sklearn.metrics.pairwise.cosine_similarity
or scipy.spatial.distance.cosine
can be used to calculate cosine similarity. In the context of LLMs, this metric is used to quantitatively assess the semantic similarity of the generated text to a reference text, providing a measure of the model's performance in generating contextually relevant content.
What are the downsides to Cosine Similarity Evaluation?
Cosine similarity evaluation, while useful in many contexts, does have several downsides:

Doesn't Account for Word Frequency — Cosine similarity does not consider the frequency of words, which can be a significant factor in text analysis. This means it may not accurately reflect the importance of rare or common terms.

Ignores Order and Context — The method does not consider the order or context of words, potentially missing syntactic or semantic variations.

Linear Dependence Issue — In some cases, cosine similarity may not handle linearly dependent vectors well. For example, two vectors that are multiples of each other will have a cosine similarity of 1, even if they represent different magnitudes in the original space.

Inadequate for Different Rating Scales — In recommendation systems, cosine similarity can fail to account for differences in rating scales between different users.

Underestimation for High Frequency Words — Research has found that cosine similarity can underestimate the similarity of highfrequency words, even after controlling for factors like polysemy.

Same Value for Different Vector Sizes — Cosine similarity yields the same value regardless of the size of the vectors being compared, as long as the angle between them is the same. This can be problematic when comparing vectors of different sizes.

Limited Semantic Understanding — Even when using techniques like Natural Language Processing, cosine similarity does not take into account the semantic meanings of words, which can limit its effectiveness in tasks that require a more nuanced understanding of similarity.
These limitations suggest that while cosine similarity can be a useful tool for evaluating the performance of large language models, it should be used in conjunction with other evaluation methods to provide a more comprehensive assessment.
What are some future directions for Cosine Similarity Evaluation research?
Future research could focus on improving the interpretability of cosine similarity evaluations, developing new feature selection methods, refining document embedding techniques, tailoring metrics for specific types of documents, applying cosine similarity to new fields, and creating improved similarity measurements.