What is the Jaro-Winkler distance?

by Stephen M. Walker II, Co-Founder / CEO

What is the Jaro-Winkler distance?

The Jaro-Winkler distance is a string metric used in computer science and statistics to measure the edit distance, or the difference, between two sequences. It's an extension of the Jaro distance metric, proposed by William E. Winkler in 1990, and is often used in the context of record linkage, data deduplication, and string matching.

The Jaro-Winkler distance computes a value between 0 and 1, where 0 indicates no similarity and 1 represents identical strings. It considers the number of matching characters, the number of transpositions (swapped characters), and a scaling factor for common prefix matches. The scaling factor in Jaro-Winkler gives higher weight to prefix similarities, making it especially effective for cases where slight misspellings or prefixes are common.

The calculation involves three main steps: computing the Jaro Distance, calculating the common prefix length, and applying the Jaro-Winkler formula. The formula takes into account the Jaro Distance, and the prefix scaling factor (usually 0.1 or 0.25), and adjusts the similarity score accordingly.

The Jaro-Winkler similarity (simw) is defined as:

$$simw = simj + lp(1 – simj)$$ 

where:

  • simj: The Jaro similarity between two strings, s1 and s2
  • l: Length of the common prefix at the start of the string (max of 4 characters)
  • p: Scaling factor for how much the score is adjusted upwards for having common prefixes. Typically this is defined as p = 0.1 and should not exceed p = 0.25.

The Jaro-Winkler distance would then be defined as 1 - simw.

How is the jaro-winkler distance calculated?

The Jaro-Winkler distance is a measure of similarity between two strings of text. It is used to compare the similarity of words or phrases by comparing their characters. The distance is calculated as follows:

  1. First, calculate the number of matching characters in both strings. Two characters are considered matching if they are the same and are not more than 2 positions away from each other.
  2. Next, calculate the number of transpositions, which are pairs of characters that are out of order in one string compared to the other.
  3. Finally, calculate the Jaro distance as follows: Jaro = (matches / length1 + matches / length2 + (matches - transpositions) / matches) / 3.
  4. The Winkler modification is then applied by adding a scaling factor to the Jaro distance if the strings have a prefix match of at least 4 characters. This scaling factor is calculated as prefix_match * (1 - Jaro) / 2.
  5. The final Jaro-Winkler distance is calculated as Jaro + Winkler_modification. The Jaro-Winkler distance ranges from 0 to 1, with a value of 1 indicating that the two strings are identical and a value of 0 indicating that they have no characters in common.

More terms

What is knowledge acquisition?

Knowledge acquisition refers to the process of extracting, structuring, and organizing knowledge from various sources, such as human experts, books, documents, sensors, or computer files, so that it can be used in software applications, particularly knowledge-based systems. This process is crucial for the development of expert systems, which are AI systems that emulate the decision-making abilities of a human expert in a specific domain.

Read more

What is versioning in LLMOps?

Versioning in Large Language Model Operations (LLMOps) refers to the systematic process of tracking and managing different versions of Large Language Models (LLMs) throughout their lifecycle. As LLMs evolve and improve, it becomes crucial to maintain a history of these changes. This practice enhances reproducibility, allowing for specific models and their performance to be recreated at a later point. It also ensures traceability by documenting changes made to LLMs, which aids in understanding their evolution and impact. Furthermore, versioning facilitates optimization in the LLMOps process by enabling the comparison of different model versions and the selection of the most effective one for deployment.

Read more

It's time to build

Collaborate with your team on reliable Generative AI features.
Want expert guidance? Book a 1:1 onboarding session from your dashboard.

Start for free