String Distance and Similarity Metrics
This module provides a collection of string distance and similarity metrics, each implemented to strictly adhere to their formal definitions. These functions are robust, handle edge cases (e.g., empty strings, identical strings), and are thoroughly tested for correctness. They are useful for tasks such as spell-checking, fuzzy matching, sequence alignment, and text similarity analysis.
The metrics include edit-based distances (e.g., Levenshtein, Damerau-Levenshtein), similarity measures (e.g., Jaro, Jaro-Winkler, Cosine), and alignment-based methods (e.g., Smith-Waterman). Each function is documented with its purpose, parameters, return values, and usage examples.
This project provides a collection of string distance and similarity metrics, both normalized and non-normalized, implemented in Python with thorough testing and robust edge-case handling. These metrics are useful for tasks like spell-checking, fuzzy matching, text similarity, and sequence alignment.
The following table summarizes the implemented metrics, including both normalized and non-normalized versions.
Usage Illustration
Below are examples demonstrating how to use each metric. These examples use the strings "hello" and "helo" (or equal-length strings for Hamming distance) to illustrate typical outputs.
from string_metrics import (
levenshtein_distance, normalized_levenshtein_distance,
hamming_distance, normalized_hamming_distance,
jaro_similarity, jaro_winkler_distance,
cosine_similarity, cosine_bigram_similarity,
lcs_distance, normalized_lcs_distance,
dice_coefficient,
smith_waterman_distance, normalized_smith_waterman_distance,
damerau_levenshtein_distance, normalized_damerau_levenshtein_distance
)
# Example strings
s1 = "hello"
s2 = "helo"
s3 = "cat"
s4 = "hat" # For Hamming distance (equal length)
# Levenshtein Distance
print(f"Levenshtein Distance: {levenshtein_distance(s1, s2)}") # Output: 1
print(f"Normalized Levenshtein Distance: {normalized_levenshtein_distance(s1, s2):.4f}") # Output: 0.2000
# Hamming Distance (requires equal-length strings)
try:
print(hamming_distance(s1, s2))
except ValueError as e:
print(f"Hamming Distance Error: {e}")
print(f"Hamming Distance: {hamming_distance(s3, s4)}") # Output: 1
print(f"Normalized Hamming Distance: {normalized_hamming_distance(s3, s4):.4f}") # Output: 0.3333
# Jaro Similarity
print(f"Jaro Similarity: {jaro_similarity(s1, s2):.4f}") # Output: 0.9333
print(f"Jaro-Winkler Distance: {jaro_winkler_distance(s1, s2):.4f}") # Output: 0.9533
# Cosine Similarity
print(f"Cosine Similarity: {cosine_similarity(s1, s2):.4f}") # Output: 1.0000
print(f"Cosine Bigram Similarity: {cosine_bigram_similarity(s1, s2):.4f}") # Output: 0.8660
# LCS Distance
print(f"LCS Distance: {lcs_distance(s1, s2)}") # Output: 2
print(f"Normalized LCS Distance: {normalized_lcs_distance(s1, s2):.4f}") # Output: 0.4000
# Dice’s Coefficient
print(f"Dice’s Coefficient: {dice_coefficient(s1, s2):.4f}") # Output: 0.8571
# Smith-Waterman Distance
print(f"Smith-Waterman Distance: {smith_waterman_distance(s1, s2)}") # Output: -8
print(f"Normalized Smith-Waterman Distance: {normalized_smith_waterman_distance(s1, s2):.4f}") # Output: 0.2000
# Damerau-Levenshtein Distance
print(f"Damerau-Levenshtein Distance: {damerau_levenshtein_distance(s1, s2)}") # Output: 1
print(f"Normalized Damerau-Levenshtein Distance: {normalized_damerau_levenshtein_distance(s1, s2):.4f}") # Output: 0.2000
API Reference
Below are the detailed descriptions and usage examples for each function in the string_metrics module.
Levenshtein Distance
- levenshtein_distance(s1: str, s2: str) int
Calculate the Levenshtein distance between two strings.
- Parameters:
s1 (str) – The first input string.
s2 (str) – The second input string.
- Returns:
The minimum number of edit operations (insertions, deletions, or substitutions) required.
- Return type:
int
Example:
from string_metrics import levenshtein_distance print(levenshtein_distance("kitten", "sitting")) # Output: 3 print(levenshtein_distance("cat", "act")) # Output: 2 print(levenshtein_distance("cat", "")) # Output: 3
Normalized Levenshtein Distance
- normalized_levenshtein_distance(s1: str, s2: str) float
Calculate the normalized Levenshtein distance between two strings.
- Parameters:
s1 (str) – The first input string.
s2 (str) – The second input string.
- Returns:
The Levenshtein distance normalized by the maximum string length, ranging from 0 (identical) to 1 (completely different).
- Return type:
float
Example:
from string_metrics import normalized_levenshtein_distance print(normalized_levenshtein_distance("kitten", "sitting")) # Output: 0.4286 print(normalized_levenshtein_distance("cat", "act")) # Output: 0.6667 print(normalized_levenshtein_distance("cat", "")) # Output: 1.0
Hamming Distance
- hamming_distance(s1: str, s2: str) int
Calculate the Hamming distance between two strings of equal length.
- Parameters:
s1 (str) – The first input string.
s2 (str) – The second input string.
- Returns:
The number of positions where the strings differ.
- Return type:
int
- Raises:
ValueError – If the strings have different lengths.
Example:
from string_metrics import hamming_distance print(hamming_distance("karolin", "kathrin")) # Output: 3 print(hamming_distance("10110", "11110")) # Output: 1 # print(hamming_distance("cat", "cats")) # Raises ValueError
Normalized Hamming Distance
- normalized_hamming_distance(s1: str, s2: str) float
Calculate the normalized Hamming distance between two strings of equal length.
- Parameters:
s1 (str) – The first input string.
s2 (str) – The second input string.
- Returns:
The Hamming distance normalized by the string length, ranging from 0 (identical) to 1 (completely different).
- Return type:
float
- Raises:
ValueError – If the strings have different lengths.
Example:
from string_metrics import normalized_hamming_distance print(normalized_hamming_distance("karolin", "kathrin")) # Output: 0.4286 print(normalized_hamming_distance("10110", "11110")) # Output: 0.2 # print(normalized_hamming_distance("cat", "cats")) # Raises ValueError
Jaro Similarity
- jaro_similarity(s1: str, s2: str) float
Calculate the Jaro similarity between two strings.
- Parameters:
s1 (str) – The first input string.
s2 (str) – The second input string.
- Returns:
The Jaro similarity score, ranging from 0 (no similarity) to 1 (identical strings).
- Return type:
float
Example:
from string_metrics import jaro_similarity print(jaro_similarity("martha", "marhta")) # Output: 0.9444 print(jaro_similarity("cat", "")) # Output: 0.0 print(jaro_similarity("same", "same")) # Output: 1.0
Jaro-Winkler Distance
- jaro_winkler_distance(s1: str, s2: str, p: float = 0.15) float
Calculate the Jaro-Winkler similarity between two strings.
- Parameters:
s1 (str) – The first input string.
s2 (str) – The second input string.
p (float, optional) – Prefix scaling factor (default is 0.15 to match textdistance).
- Returns:
The Jaro-Winkler similarity score, ranging from 0 (no similarity) to 1 (identical strings).
- Return type:
float
Example:
from string_metrics import jaro_winkler_distance print(jaro_winkler_distance("martha", "marhta")) # Output: 0.9667 print(jaro_winkler_distance("dixon", "dicksonx")) # Output: 0.8133 print(jaro_winkler_distance("cat", "")) # Output: 0.0
Cosine Similarity (Word-Based)
- cosine_similarity(s1: str, s2: str) float
Calculate the cosine similarity between two strings using word vectors.
- Parameters:
s1 (str) – The first input string.
s2 (str) – The second input string.
- Returns:
The cosine similarity score, ranging from 0 (no similarity) to 1 (identical word sets).
- Return type:
float
Example:
from string_metrics import cosine_similarity print(cosine_similarity("cat hat", "hat cat")) # Output: 1.0 print(cosine_similarity("cat", "dog")) # Output: 0.0 print(cosine_similarity("", "dog")) # Output: 0.0
Cosine Bigram Similarity
- cosine_bigram_similarity(s1: str, s2: str) float
Calculate the cosine similarity between two strings using bigram vectors.
- Parameters:
s1 (str) – The first input string.
s2 (str) – The second input string.
- Returns:
The cosine similarity score, ranging from 0 (no similarity) to 1 (identical bigram sets).
- Return type:
float
Example:
from string_metrics import cosine_bigram_similarity print(cosine_bigram_similarity("cat", "cap")) # Output: 0.5 print(cosine_bigram_similarity("cat", "act")) # Output: 0.0 print(cosine_bigram_similarity("", "dog")) # Output: 0.0
Longest Common Subsequence (LCS) Distance
- lcs_distance(s1: str, s2: str) int
Calculate the LCS-based distance between two strings.
- Parameters:
s1 (str) – The first input string.
s2 (str) – The second input string.
- Returns:
The LCS distance, defined as len(s1) + len(s2) - 2 * len(LCS).
- Return type:
int
Example:
from string_metrics import lcs_distance print(lcs_distance("kitten", "sitting")) # Output: 5 print(lcs_distance("cat", "act")) # Output: 2 print(lcs_distance("cat", "")) # Output: 3
Normalized LCS Distance
- normalized_lcs_distance(s1: str, s2: str) float
Calculate the normalized LCS-based distance between two strings.
- Parameters:
s1 (str) – The first input string.
s2 (str) – The second input string.
- Returns:
The LCS distance normalized by the maximum string length, ranging from 0 (identical) to 1 (completely different).
- Return type:
float
Example:
from string_metrics import normalized_lcs_distance print(normalized_lcs_distance("kitten", "sitting")) # Output: 0.7143 print(normalized_lcs_distance("cat", "act")) # Output: 0.6667 print(normalized_lcs_distance("cat", "")) # Output: 1.0
Dice’s Coefficient
- dice_coefficient(s1: str, s2: str) float
Calculate Dice’s coefficient between two strings using bigrams.
- Parameters:
s1 (str) – The first input string.
s2 (str) – The second input string.
- Returns:
Dice’s coefficient, ranging from 0 (no shared bigrams) to 1 (identical bigram sets).
- Return type:
float
Example:
from string_metrics import dice_coefficient print(dice_coefficient("night", "nacht")) # Output: 0.25 print(dice_coefficient("cat", "cat")) # Output: 1.0 print(dice_coefficient("cat", "")) # Output: 0.0
Smith-Waterman Distance
- smith_waterman_distance(s1: str, s2: str, match_score: int = 2, mismatch_score: int = -1, gap_score: int = -1) int
Calculate the Smith-Waterman distance between two strings.
- Parameters:
s1 (str) – The first input string.
s2 (str) – The second input string.
match_score (int, optional) – Score for matching characters (default is 2).
mismatch_score (int, optional) – Score for mismatching characters (default is -1).
gap_score (int, optional) – Score for gaps (default is -1).
- Returns:
The inverse of the maximum local alignment score (lower scores indicate greater distance).
- Return type:
int
Example:
from string_metrics import smith_waterman_distance print(smith_waterman_distance("kitten", "sitting")) # Output: -10 print(smith_waterman_distance("cat", "act")) # Output: -5 print(smith_waterman_distance("cat", "")) # Output: 0
Normalized Smith-Waterman Distance
- normalized_smith_waterman_distance(s1: str, s2: str, match_score: int = 2, mismatch_score: int = -1, gap_score: int = -1) float
Calculate the normalized Smith-Waterman distance between two strings.
- Parameters:
s1 (str) – The first input string.
s2 (str) – The second input string.
match_score (int, optional) – Score for matching characters (default is 2).
mismatch_score (int, optional) – Score for mismatching characters (default is -1).
gap_score (int, optional) – Score for gaps (default is -1).
- Returns:
The normalized Smith-Waterman distance, ranging from 0 (identical) to 1 (completely different).
- Return type:
float
Example:
from string_metrics import normalized_smith_waterman_distance print(normalized_smith_waterman_distance("kitten", "sitting")) # Output: 0.1667 print(normalized_smith_waterman_distance("cat", "act")) # Output: 0.3333 print(normalized_smith_waterman_distance("cat", "")) # Output: 0.0
Damerau-Levenshtein Distance
- damerau_levenshtein_distance(s1: str, s2: str) int
Calculate the Damerau-Levenshtein distance between two strings.
- Parameters:
s1 (str) – The first input string.
s2 (str) – The second input string.
- Returns:
The minimum number of edit operations (insertions, deletions, substitutions, or transpositions) required.
- Return type:
int
Example:
from string_metrics import damerau_levenshtein_distance print(damerau_levenshtein_distance("cat", "act")) # Output: 1 print(damerau_levenshtein_distance("cat", "hat")) # Output: 1 print(damerau_levenshtein_distance("cat", "cats")) # Output: 1
Normalized Damerau-Levenshtein Distance
- normalized_damerau_levenshtein_distance(s1: str, s2: str) float
Calculate the normalized Damerau-Levenshtein distance between two strings.
- Parameters:
s1 (str) – The first input string.
s2 (str) – The second input string.
- Returns:
The Damerau-Levenshtein distance normalized by the maximum string length, ranging from 0 (identical) to 1 (completely different).
- Return type:
float
Example:
from string_metrics import normalized_damerau_levenshtein_distance print(normalized_damerau_levenshtein_distance("cat", "act")) # Output: 0.3333 print(normalized_damerau_levenshtein_distance("cat", "hat")) # Output: 0.3333 print(normalized_damerau_levenshtein_distance("cat", "cats")) # Output: 0.25
Usage Notes
Input Requirements: All functions accept strings as input. For
hamming_distanceandnormalized_hamming_distance, strings must be of equal length.Output Interpretation: - Non-normalized distance metrics (Levenshtein, Hamming, LCS, Smith-Waterman, Damerau-Levenshtein): Return non-negative integers (except Smith-Waterman, which returns negative integers); higher values indicate greater difference. - Normalized distance metrics (Normalized Levenshtein, Normalized Hamming, Normalized LCS, Normalized Smith-Waterman, Normalized Damerau-Levenshtein): Return floats in [0,1]; 1 indicates completely different strings. - Similarity metrics (Jaro, Jaro-Winkler, Cosine, Cosine Bigram, Dice): Return floats in [0,1]; 1 indicates identical strings.
Edge Cases: All functions handle empty strings, identical strings, and single-character strings appropriately.
Dependencies: Requires Python’s standard library (
collections.Counterandmath).Jaro-Winkler Note: The default prefix scaling factor is
p = 0.15to match thetextdistancelibrary, differing from the standardp = 0.1.
Example Application
Here’s an example of using multiple metrics to compare two strings:
from slearn.symbols import (
levenshtein_distance,
jaro_winkler_distance,
cosine_similarity,
damerau_levenshtein_distance
)
s1 = "kitten"
s2 = "sitting"
print(f"Levenshtein Distance: {levenshtein_distance(s1, s2)}") # Output: 3
print(f"Jaro-Winkler Similarity: {jaro_winkler_distance(s1, s2):.3f}") # Output: ~0.746
print(f"Cosine Similarity: {cosine_similarity(s1, s2):.3f}") # Output: 0.0
print(f"Damerau-Levenshtein Distance: {damerau_levenshtein_distance(s1, s2)}") # Output: 3