String Distance and Similarity Metrics

This module provides a collection of string distance and similarity metrics, each implemented to strictly adhere to their formal definitions. These functions are robust, handle edge cases (e.g., empty strings, identical strings), and are thoroughly tested for correctness. They are useful for tasks such as spell-checking, fuzzy matching, sequence alignment, and text similarity analysis.

The metrics include edit-based distances (e.g., Levenshtein, Damerau-Levenshtein), similarity measures (e.g., Jaro, Jaro-Winkler, Cosine), and alignment-based methods (e.g., Smith-Waterman). Each function is documented with its purpose, parameters, return values, and usage examples.

This project provides a collection of string distance and similarity metrics, both normalized and non-normalized, implemented in Python with thorough testing and robust edge-case handling. These metrics are useful for tasks like spell-checking, fuzzy matching, text similarity, and sequence alignment.

The following table summarizes the implemented metrics, including both normalized and non-normalized versions.

Usage Illustration

Below are examples demonstrating how to use each metric. These examples use the strings "hello" and "helo" (or equal-length strings for Hamming distance) to illustrate typical outputs.

from string_metrics import (
    levenshtein_distance, normalized_levenshtein_distance,
    hamming_distance, normalized_hamming_distance,
    jaro_similarity, jaro_winkler_distance,
    cosine_similarity, cosine_bigram_similarity,
    lcs_distance, normalized_lcs_distance,
    dice_coefficient,
    smith_waterman_distance, normalized_smith_waterman_distance,
    damerau_levenshtein_distance, normalized_damerau_levenshtein_distance
)

# Example strings
s1 = "hello"
s2 = "helo"
s3 = "cat"
s4 = "hat"  # For Hamming distance (equal length)

# Levenshtein Distance
print(f"Levenshtein Distance: {levenshtein_distance(s1, s2)}")  # Output: 1
print(f"Normalized Levenshtein Distance: {normalized_levenshtein_distance(s1, s2):.4f}")  # Output: 0.2000

# Hamming Distance (requires equal-length strings)
try:
    print(hamming_distance(s1, s2))
except ValueError as e:
    print(f"Hamming Distance Error: {e}")
print(f"Hamming Distance: {hamming_distance(s3, s4)}")  # Output: 1
print(f"Normalized Hamming Distance: {normalized_hamming_distance(s3, s4):.4f}")  # Output: 0.3333

# Jaro Similarity
print(f"Jaro Similarity: {jaro_similarity(s1, s2):.4f}")  # Output: 0.9333
print(f"Jaro-Winkler Distance: {jaro_winkler_distance(s1, s2):.4f}")  # Output: 0.9533

# Cosine Similarity
print(f"Cosine Similarity: {cosine_similarity(s1, s2):.4f}")  # Output: 1.0000
print(f"Cosine Bigram Similarity: {cosine_bigram_similarity(s1, s2):.4f}")  # Output: 0.8660

# LCS Distance
print(f"LCS Distance: {lcs_distance(s1, s2)}")  # Output: 2
print(f"Normalized LCS Distance: {normalized_lcs_distance(s1, s2):.4f}")  # Output: 0.4000

# Dice’s Coefficient
print(f"Dice’s Coefficient: {dice_coefficient(s1, s2):.4f}")  # Output: 0.8571

# Smith-Waterman Distance
print(f"Smith-Waterman Distance: {smith_waterman_distance(s1, s2)}")  # Output: -8
print(f"Normalized Smith-Waterman Distance: {normalized_smith_waterman_distance(s1, s2):.4f}")  # Output: 0.2000

# Damerau-Levenshtein Distance
print(f"Damerau-Levenshtein Distance: {damerau_levenshtein_distance(s1, s2)}")  # Output: 1
print(f"Normalized Damerau-Levenshtein Distance: {normalized_damerau_levenshtein_distance(s1, s2):.4f}")  # Output: 0.2000

API Reference

Below are the detailed descriptions and usage examples for each function in the string_metrics module.

Levenshtein Distance

levenshtein_distance(s1: str, s2: str) int

Calculate the Levenshtein distance between two strings.

Parameters:
  • s1 (str) – The first input string.

  • s2 (str) – The second input string.

Returns:

The minimum number of edit operations (insertions, deletions, or substitutions) required.

Return type:

int

Example:

from string_metrics import levenshtein_distance

print(levenshtein_distance("kitten", "sitting"))  # Output: 3
print(levenshtein_distance("cat", "act"))         # Output: 2
print(levenshtein_distance("cat", ""))            # Output: 3

Normalized Levenshtein Distance

normalized_levenshtein_distance(s1: str, s2: str) float

Calculate the normalized Levenshtein distance between two strings.

Parameters:
  • s1 (str) – The first input string.

  • s2 (str) – The second input string.

Returns:

The Levenshtein distance normalized by the maximum string length, ranging from 0 (identical) to 1 (completely different).

Return type:

float

Example:

from string_metrics import normalized_levenshtein_distance

print(normalized_levenshtein_distance("kitten", "sitting"))  # Output: 0.4286
print(normalized_levenshtein_distance("cat", "act"))         # Output: 0.6667
print(normalized_levenshtein_distance("cat", ""))            # Output: 1.0

Hamming Distance

hamming_distance(s1: str, s2: str) int

Calculate the Hamming distance between two strings of equal length.

Parameters:
  • s1 (str) – The first input string.

  • s2 (str) – The second input string.

Returns:

The number of positions where the strings differ.

Return type:

int

Raises:

ValueError – If the strings have different lengths.

Example:

from string_metrics import hamming_distance

print(hamming_distance("karolin", "kathrin"))  # Output: 3
print(hamming_distance("10110", "11110"))      # Output: 1
# print(hamming_distance("cat", "cats"))       # Raises ValueError

Normalized Hamming Distance

normalized_hamming_distance(s1: str, s2: str) float

Calculate the normalized Hamming distance between two strings of equal length.

Parameters:
  • s1 (str) – The first input string.

  • s2 (str) – The second input string.

Returns:

The Hamming distance normalized by the string length, ranging from 0 (identical) to 1 (completely different).

Return type:

float

Raises:

ValueError – If the strings have different lengths.

Example:

from string_metrics import normalized_hamming_distance

print(normalized_hamming_distance("karolin", "kathrin"))  # Output: 0.4286
print(normalized_hamming_distance("10110", "11110"))      # Output: 0.2
# print(normalized_hamming_distance("cat", "cats"))       # Raises ValueError

Jaro Similarity

jaro_similarity(s1: str, s2: str) float

Calculate the Jaro similarity between two strings.

Parameters:
  • s1 (str) – The first input string.

  • s2 (str) – The second input string.

Returns:

The Jaro similarity score, ranging from 0 (no similarity) to 1 (identical strings).

Return type:

float

Example:

from string_metrics import jaro_similarity

print(jaro_similarity("martha", "marhta"))  # Output: 0.9444
print(jaro_similarity("cat", ""))           # Output: 0.0
print(jaro_similarity("same", "same"))     # Output: 1.0

Jaro-Winkler Distance

jaro_winkler_distance(s1: str, s2: str, p: float = 0.15) float

Calculate the Jaro-Winkler similarity between two strings.

Parameters:
  • s1 (str) – The first input string.

  • s2 (str) – The second input string.

  • p (float, optional) – Prefix scaling factor (default is 0.15 to match textdistance).

Returns:

The Jaro-Winkler similarity score, ranging from 0 (no similarity) to 1 (identical strings).

Return type:

float

Example:

from string_metrics import jaro_winkler_distance

print(jaro_winkler_distance("martha", "marhta"))     # Output: 0.9667
print(jaro_winkler_distance("dixon", "dicksonx"))    # Output: 0.8133
print(jaro_winkler_distance("cat", ""))              # Output: 0.0

Cosine Similarity (Word-Based)

cosine_similarity(s1: str, s2: str) float

Calculate the cosine similarity between two strings using word vectors.

Parameters:
  • s1 (str) – The first input string.

  • s2 (str) – The second input string.

Returns:

The cosine similarity score, ranging from 0 (no similarity) to 1 (identical word sets).

Return type:

float

Example:

from string_metrics import cosine_similarity

print(cosine_similarity("cat hat", "hat cat"))  # Output: 1.0
print(cosine_similarity("cat", "dog"))          # Output: 0.0
print(cosine_similarity("", "dog"))             # Output: 0.0

Cosine Bigram Similarity

cosine_bigram_similarity(s1: str, s2: str) float

Calculate the cosine similarity between two strings using bigram vectors.

Parameters:
  • s1 (str) – The first input string.

  • s2 (str) – The second input string.

Returns:

The cosine similarity score, ranging from 0 (no similarity) to 1 (identical bigram sets).

Return type:

float

Example:

from string_metrics import cosine_bigram_similarity

print(cosine_bigram_similarity("cat", "cap"))   # Output: 0.5
print(cosine_bigram_similarity("cat", "act"))   # Output: 0.0
print(cosine_bigram_similarity("", "dog"))      # Output: 0.0

Longest Common Subsequence (LCS) Distance

lcs_distance(s1: str, s2: str) int

Calculate the LCS-based distance between two strings.

Parameters:
  • s1 (str) – The first input string.

  • s2 (str) – The second input string.

Returns:

The LCS distance, defined as len(s1) + len(s2) - 2 * len(LCS).

Return type:

int

Example:

from string_metrics import lcs_distance

print(lcs_distance("kitten", "sitting"))  # Output: 5
print(lcs_distance("cat", "act"))         # Output: 2
print(lcs_distance("cat", ""))            # Output: 3

Normalized LCS Distance

normalized_lcs_distance(s1: str, s2: str) float

Calculate the normalized LCS-based distance between two strings.

Parameters:
  • s1 (str) – The first input string.

  • s2 (str) – The second input string.

Returns:

The LCS distance normalized by the maximum string length, ranging from 0 (identical) to 1 (completely different).

Return type:

float

Example:

from string_metrics import normalized_lcs_distance

print(normalized_lcs_distance("kitten", "sitting"))  # Output: 0.7143
print(normalized_lcs_distance("cat", "act"))         # Output: 0.6667
print(normalized_lcs_distance("cat", ""))            # Output: 1.0

Dice’s Coefficient

dice_coefficient(s1: str, s2: str) float

Calculate Dice’s coefficient between two strings using bigrams.

Parameters:
  • s1 (str) – The first input string.

  • s2 (str) – The second input string.

Returns:

Dice’s coefficient, ranging from 0 (no shared bigrams) to 1 (identical bigram sets).

Return type:

float

Example:

from string_metrics import dice_coefficient

print(dice_coefficient("night", "nacht"))  # Output: 0.25
print(dice_coefficient("cat", "cat"))      # Output: 1.0
print(dice_coefficient("cat", ""))         # Output: 0.0

Smith-Waterman Distance

smith_waterman_distance(s1: str, s2: str, match_score: int = 2, mismatch_score: int = -1, gap_score: int = -1) int

Calculate the Smith-Waterman distance between two strings.

Parameters:
  • s1 (str) – The first input string.

  • s2 (str) – The second input string.

  • match_score (int, optional) – Score for matching characters (default is 2).

  • mismatch_score (int, optional) – Score for mismatching characters (default is -1).

  • gap_score (int, optional) – Score for gaps (default is -1).

Returns:

The inverse of the maximum local alignment score (lower scores indicate greater distance).

Return type:

int

Example:

from string_metrics import smith_waterman_distance

print(smith_waterman_distance("kitten", "sitting"))  # Output: -10
print(smith_waterman_distance("cat", "act"))         # Output: -5
print(smith_waterman_distance("cat", ""))            # Output: 0

Normalized Smith-Waterman Distance

normalized_smith_waterman_distance(s1: str, s2: str, match_score: int = 2, mismatch_score: int = -1, gap_score: int = -1) float

Calculate the normalized Smith-Waterman distance between two strings.

Parameters:
  • s1 (str) – The first input string.

  • s2 (str) – The second input string.

  • match_score (int, optional) – Score for matching characters (default is 2).

  • mismatch_score (int, optional) – Score for mismatching characters (default is -1).

  • gap_score (int, optional) – Score for gaps (default is -1).

Returns:

The normalized Smith-Waterman distance, ranging from 0 (identical) to 1 (completely different).

Return type:

float

Example:

from string_metrics import normalized_smith_waterman_distance

print(normalized_smith_waterman_distance("kitten", "sitting"))  # Output: 0.1667
print(normalized_smith_waterman_distance("cat", "act"))         # Output: 0.3333
print(normalized_smith_waterman_distance("cat", ""))            # Output: 0.0

Damerau-Levenshtein Distance

damerau_levenshtein_distance(s1: str, s2: str) int

Calculate the Damerau-Levenshtein distance between two strings.

Parameters:
  • s1 (str) – The first input string.

  • s2 (str) – The second input string.

Returns:

The minimum number of edit operations (insertions, deletions, substitutions, or transpositions) required.

Return type:

int

Example:

from string_metrics import damerau_levenshtein_distance

print(damerau_levenshtein_distance("cat", "act"))    # Output: 1
print(damerau_levenshtein_distance("cat", "hat"))    # Output: 1
print(damerau_levenshtein_distance("cat", "cats"))   # Output: 1

Normalized Damerau-Levenshtein Distance

normalized_damerau_levenshtein_distance(s1: str, s2: str) float

Calculate the normalized Damerau-Levenshtein distance between two strings.

Parameters:
  • s1 (str) – The first input string.

  • s2 (str) – The second input string.

Returns:

The Damerau-Levenshtein distance normalized by the maximum string length, ranging from 0 (identical) to 1 (completely different).

Return type:

float

Example:

from string_metrics import normalized_damerau_levenshtein_distance

print(normalized_damerau_levenshtein_distance("cat", "act"))    # Output: 0.3333
print(normalized_damerau_levenshtein_distance("cat", "hat"))    # Output: 0.3333
print(normalized_damerau_levenshtein_distance("cat", "cats"))   # Output: 0.25

Usage Notes

  • Input Requirements: All functions accept strings as input. For hamming_distance and normalized_hamming_distance, strings must be of equal length.

  • Output Interpretation: - Non-normalized distance metrics (Levenshtein, Hamming, LCS, Smith-Waterman, Damerau-Levenshtein): Return non-negative integers (except Smith-Waterman, which returns negative integers); higher values indicate greater difference. - Normalized distance metrics (Normalized Levenshtein, Normalized Hamming, Normalized LCS, Normalized Smith-Waterman, Normalized Damerau-Levenshtein): Return floats in [0,1]; 1 indicates completely different strings. - Similarity metrics (Jaro, Jaro-Winkler, Cosine, Cosine Bigram, Dice): Return floats in [0,1]; 1 indicates identical strings.

  • Edge Cases: All functions handle empty strings, identical strings, and single-character strings appropriately.

  • Dependencies: Requires Python’s standard library (collections.Counter and math).

  • Jaro-Winkler Note: The default prefix scaling factor is p = 0.15 to match the textdistance library, differing from the standard p = 0.1.

Example Application

Here’s an example of using multiple metrics to compare two strings:

from slearn.symbols import (
    levenshtein_distance,
    jaro_winkler_distance,
    cosine_similarity,
    damerau_levenshtein_distance
)

s1 = "kitten"
s2 = "sitting"

print(f"Levenshtein Distance: {levenshtein_distance(s1, s2)}")         # Output: 3
print(f"Jaro-Winkler Similarity: {jaro_winkler_distance(s1, s2):.3f}") # Output: ~0.746
print(f"Cosine Similarity: {cosine_similarity(s1, s2):.3f}")           # Output: 0.0
print(f"Damerau-Levenshtein Distance: {damerau_levenshtein_distance(s1, s2)}") # Output: 3