.. _string-metrics: String Distance and Similarity Metrics ====================================== This module provides a collection of string distance and similarity metrics, each implemented to strictly adhere to their formal definitions. These functions are robust, handle edge cases (e.g., empty strings, identical strings), and are thoroughly tested for correctness. They are useful for tasks such as spell-checking, fuzzy matching, sequence alignment, and text similarity analysis. The metrics include edit-based distances (e.g., Levenshtein, Damerau-Levenshtein), similarity measures (e.g., Jaro, Jaro-Winkler, Cosine), and alignment-based methods (e.g., Smith-Waterman). Each function is documented with its purpose, parameters, return values, and usage examples. .. contents:: Table of Contents :local: :depth: 2 .. _metrics: This project provides a collection of string distance and similarity metrics, both normalized and non-normalized, implemented in Python with thorough testing and robust edge-case handling. These metrics are useful for tasks like spell-checking, fuzzy matching, text similarity, and sequence alignment. The following table summarizes the implemented metrics, including both normalized and non-normalized versions. +------------------------------------------+------------+------------+-----------------------------------------------------------------------------------+------------+------------------------------------------+ | Metric Name | Type | Normalized | Key Features | Complexity | Primary Use Cases | +==========================================+============+============+===================================================================================+============+==========================================+ | Levenshtein Distance | Distance | No | Counts insertions, deletions, substitutions; returns raw edit count | O(n*m) | Spell-checking, sequence alignment | +------------------------------------------+------------+------------+-----------------------------------------------------------------------------------+------------+------------------------------------------+ | Normalized Levenshtein Distance | Distance | Yes | Counts insertions, deletions, substitutions; normalized by max length to [0,1] | O(n*m) | Spell-checking, sequence alignment | +------------------------------------------+------------+------------+-----------------------------------------------------------------------------------+------------+------------------------------------------+ | Hamming Distance | Distance | No | Counts differing positions in equal-length strings; returns raw count | O(n) | Error detection, fixed-length sequences | +------------------------------------------+------------+------------+-----------------------------------------------------------------------------------+------------+------------------------------------------+ | Normalized Hamming Distance | Distance | Yes | Counts differing positions in equal-length strings; normalized by length to [0,1] | O(n) | Error detection, fixed-length sequences | +------------------------------------------+------------+------------+-----------------------------------------------------------------------------------+------------+------------------------------------------+ | Jaro Similarity | Similarity | Yes | Measures matching characters and transpositions, inherently normalized to [0,1] | O(n*m) | Record linkage, fuzzy matching | +------------------------------------------+------------+------------+-----------------------------------------------------------------------------------+------------+------------------------------------------+ | Jaro-Winkler Distance | Similarity | Yes | Enhances Jaro with prefix weighting, inherently normalized to [0,1] | O(n*m) | Name matching, deduplication | +------------------------------------------+------------+------------+-----------------------------------------------------------------------------------+------------+------------------------------------------+ | Cosine Similarity | Similarity | Yes | Compares word frequency vectors, inherently normalized to [0,1] | O(n) | Text similarity, document comparison | +------------------------------------------+------------+------------+-----------------------------------------------------------------------------------+------------+------------------------------------------+ | Cosine Bigram Similarity | Similarity | Yes | Compares bigram frequency vectors, inherently normalized to [0,1] | O(n) | Character-level text similarity | +------------------------------------------+------------+------------+-----------------------------------------------------------------------------------+------------+------------------------------------------+ | LCS Distance | Distance | No | Based on longest common subsequence; returns raw difference | O(n*m) | Sequence alignment, diff tools | +------------------------------------------+------------+------------+-----------------------------------------------------------------------------------+------------+------------------------------------------+ | Normalized LCS Distance | Distance | Yes | Based on longest common subsequence; normalized by max length to [0,1] | O(n*m) | Sequence alignment, diff tools | +------------------------------------------+------------+------------+-----------------------------------------------------------------------------------+------------+------------------------------------------+ | Dice’s Coefficient | Similarity | Yes | Measures shared bigram overlap, inherently normalized to [0,1] | O(n) | Short text similarity, fuzzy matching | +------------------------------------------+------------+------------+-----------------------------------------------------------------------------------+------------+------------------------------------------+ | Smith-Waterman Distance | Distance | No | Local alignment score (match=2, mismatch=-1, gap=-1); returns negative score | O(n*m) | Bioinformatics, partial sequence matching | +------------------------------------------+------------+------------+-----------------------------------------------------------------------------------+------------+------------------------------------------+ | Normalized Smith-Waterman Distance | Distance | Yes | Local alignment score (match=2, mismatch=-1, gap=-1); normalized to [0,1] | O(n*m) | Bioinformatics, partial sequence matching | +------------------------------------------+------------+------------+-----------------------------------------------------------------------------------+------------+------------------------------------------+ | Damerau-Levenshtein Distance | Distance | No | Extends Levenshtein with transpositions; returns raw edit count | O(n*m) | Typo correction, spell-checking | +------------------------------------------+------------+------------+-----------------------------------------------------------------------------------+------------+------------------------------------------+ | Normalized Damerau-Levenshtein Distance | Distance | Yes | Extends Levenshtein with transpositions; normalized by max length to [0,1] | O(n*m) | Typo correction, spell-checking | +------------------------------------------+------------+------------+-----------------------------------------------------------------------------------+------------+------------------------------------------+ Usage Illustration ------------------ Below are examples demonstrating how to use each metric. These examples use the strings ``"hello"`` and ``"helo"`` (or equal-length strings for Hamming distance) to illustrate typical outputs. .. code-block:: python from string_metrics import ( levenshtein_distance, normalized_levenshtein_distance, hamming_distance, normalized_hamming_distance, jaro_similarity, jaro_winkler_distance, cosine_similarity, cosine_bigram_similarity, lcs_distance, normalized_lcs_distance, dice_coefficient, smith_waterman_distance, normalized_smith_waterman_distance, damerau_levenshtein_distance, normalized_damerau_levenshtein_distance ) # Example strings s1 = "hello" s2 = "helo" s3 = "cat" s4 = "hat" # For Hamming distance (equal length) # Levenshtein Distance print(f"Levenshtein Distance: {levenshtein_distance(s1, s2)}") # Output: 1 print(f"Normalized Levenshtein Distance: {normalized_levenshtein_distance(s1, s2):.4f}") # Output: 0.2000 # Hamming Distance (requires equal-length strings) try: print(hamming_distance(s1, s2)) except ValueError as e: print(f"Hamming Distance Error: {e}") print(f"Hamming Distance: {hamming_distance(s3, s4)}") # Output: 1 print(f"Normalized Hamming Distance: {normalized_hamming_distance(s3, s4):.4f}") # Output: 0.3333 # Jaro Similarity print(f"Jaro Similarity: {jaro_similarity(s1, s2):.4f}") # Output: 0.9333 print(f"Jaro-Winkler Distance: {jaro_winkler_distance(s1, s2):.4f}") # Output: 0.9533 # Cosine Similarity print(f"Cosine Similarity: {cosine_similarity(s1, s2):.4f}") # Output: 1.0000 print(f"Cosine Bigram Similarity: {cosine_bigram_similarity(s1, s2):.4f}") # Output: 0.8660 # LCS Distance print(f"LCS Distance: {lcs_distance(s1, s2)}") # Output: 2 print(f"Normalized LCS Distance: {normalized_lcs_distance(s1, s2):.4f}") # Output: 0.4000 # Dice’s Coefficient print(f"Dice’s Coefficient: {dice_coefficient(s1, s2):.4f}") # Output: 0.8571 # Smith-Waterman Distance print(f"Smith-Waterman Distance: {smith_waterman_distance(s1, s2)}") # Output: -8 print(f"Normalized Smith-Waterman Distance: {normalized_smith_waterman_distance(s1, s2):.4f}") # Output: 0.2000 # Damerau-Levenshtein Distance print(f"Damerau-Levenshtein Distance: {damerau_levenshtein_distance(s1, s2)}") # Output: 1 print(f"Normalized Damerau-Levenshtein Distance: {normalized_damerau_levenshtein_distance(s1, s2):.4f}") # Output: 0.2000 API Reference ------------- Below are the detailed descriptions and usage examples for each function in the ``string_metrics`` module. Levenshtein Distance ~~~~~~~~~~~~~~~~~~~ .. py:function:: levenshtein_distance(s1: str, s2: str) -> int Calculate the Levenshtein distance between two strings. :param s1: The first input string. :type s1: str :param s2: The second input string. :type s2: str :return: The minimum number of edit operations (insertions, deletions, or substitutions) required. :rtype: int **Example**: .. code-block:: python from string_metrics import levenshtein_distance print(levenshtein_distance("kitten", "sitting")) # Output: 3 print(levenshtein_distance("cat", "act")) # Output: 2 print(levenshtein_distance("cat", "")) # Output: 3 Normalized Levenshtein Distance ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. py:function:: normalized_levenshtein_distance(s1: str, s2: str) -> float Calculate the normalized Levenshtein distance between two strings. :param s1: The first input string. :type s1: str :param s2: The second input string. :type s2: str :return: The Levenshtein distance normalized by the maximum string length, ranging from 0 (identical) to 1 (completely different). :rtype: float **Example**: .. code-block:: python from string_metrics import normalized_levenshtein_distance print(normalized_levenshtein_distance("kitten", "sitting")) # Output: 0.4286 print(normalized_levenshtein_distance("cat", "act")) # Output: 0.6667 print(normalized_levenshtein_distance("cat", "")) # Output: 1.0 Hamming Distance ~~~~~~~~~~~~~~~~ .. py:function:: hamming_distance(s1: str, s2: str) -> int Calculate the Hamming distance between two strings of equal length. :param s1: The first input string. :type s1: str :param s2: The second input string. :type s2: str :return: The number of positions where the strings differ. :rtype: int :raises ValueError: If the strings have different lengths. **Example**: .. code-block:: python from string_metrics import hamming_distance print(hamming_distance("karolin", "kathrin")) # Output: 3 print(hamming_distance("10110", "11110")) # Output: 1 # print(hamming_distance("cat", "cats")) # Raises ValueError Normalized Hamming Distance ~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. py:function:: normalized_hamming_distance(s1: str, s2: str) -> float Calculate the normalized Hamming distance between two strings of equal length. :param s1: The first input string. :type s1: str :param s2: The second input string. :type s2: str :return: The Hamming distance normalized by the string length, ranging from 0 (identical) to 1 (completely different). :rtype: float :raises ValueError: If the strings have different lengths. **Example**: .. code-block:: python from string_metrics import normalized_hamming_distance print(normalized_hamming_distance("karolin", "kathrin")) # Output: 0.4286 print(normalized_hamming_distance("10110", "11110")) # Output: 0.2 # print(normalized_hamming_distance("cat", "cats")) # Raises ValueError Jaro Similarity ~~~~~~~~~~~~~~~ .. py:function:: jaro_similarity(s1: str, s2: str) -> float Calculate the Jaro similarity between two strings. :param s1: The first input string. :type s1: str :param s2: The second input string. :type s2: str :return: The Jaro similarity score, ranging from 0 (no similarity) to 1 (identical strings). :rtype: float **Example**: .. code-block:: python from string_metrics import jaro_similarity print(jaro_similarity("martha", "marhta")) # Output: 0.9444 print(jaro_similarity("cat", "")) # Output: 0.0 print(jaro_similarity("same", "same")) # Output: 1.0 Jaro-Winkler Distance ~~~~~~~~~~~~~~~~~~~~~ .. py:function:: jaro_winkler_distance(s1: str, s2: str, p: float = 0.15) -> float Calculate the Jaro-Winkler similarity between two strings. :param s1: The first input string. :type s1: str :param s2: The second input string. :type s2: str :param p: Prefix scaling factor (default is 0.15 to match textdistance). :type p: float, optional :return: The Jaro-Winkler similarity score, ranging from 0 (no similarity) to 1 (identical strings). :rtype: float **Example**: .. code-block:: python from string_metrics import jaro_winkler_distance print(jaro_winkler_distance("martha", "marhta")) # Output: 0.9667 print(jaro_winkler_distance("dixon", "dicksonx")) # Output: 0.8133 print(jaro_winkler_distance("cat", "")) # Output: 0.0 Cosine Similarity (Word-Based) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. py:function:: cosine_similarity(s1: str, s2: str) -> float Calculate the cosine similarity between two strings using word vectors. :param s1: The first input string. :type s1: str :param s2: The second input string. :type s2: str :return: The cosine similarity score, ranging from 0 (no similarity) to 1 (identical word sets). :rtype: float **Example**: .. code-block:: python from string_metrics import cosine_similarity print(cosine_similarity("cat hat", "hat cat")) # Output: 1.0 print(cosine_similarity("cat", "dog")) # Output: 0.0 print(cosine_similarity("", "dog")) # Output: 0.0 Cosine Bigram Similarity ~~~~~~~~~~~~~~~~~~~~~~~~ .. py:function:: cosine_bigram_similarity(s1: str, s2: str) -> float Calculate the cosine similarity between two strings using bigram vectors. :param s1: The first input string. :type s1: str :param s2: The second input string. :type s2: str :return: The cosine similarity score, ranging from 0 (no similarity) to 1 (identical bigram sets). :rtype: float **Example**: .. code-block:: python from string_metrics import cosine_bigram_similarity print(cosine_bigram_similarity("cat", "cap")) # Output: 0.5 print(cosine_bigram_similarity("cat", "act")) # Output: 0.0 print(cosine_bigram_similarity("", "dog")) # Output: 0.0 Longest Common Subsequence (LCS) Distance ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. py:function:: lcs_distance(s1: str, s2: str) -> int Calculate the LCS-based distance between two strings. :param s1: The first input string. :type s1: str :param s2: The second input string. :type s2: str :return: The LCS distance, defined as len(s1) + len(s2) - 2 * len(LCS). :rtype: int **Example**: .. code-block:: python from string_metrics import lcs_distance print(lcs_distance("kitten", "sitting")) # Output: 5 print(lcs_distance("cat", "act")) # Output: 2 print(lcs_distance("cat", "")) # Output: 3 Normalized LCS Distance ~~~~~~~~~~~~~~~~~~~~~~~ .. py:function:: normalized_lcs_distance(s1: str, s2: str) -> float Calculate the normalized LCS-based distance between two strings. :param s1: The first input string. :type s1: str :param s2: The second input string. :type s2: str :return: The LCS distance normalized by the maximum string length, ranging from 0 (identical) to 1 (completely different). :rtype: float **Example**: .. code-block:: python from string_metrics import normalized_lcs_distance print(normalized_lcs_distance("kitten", "sitting")) # Output: 0.7143 print(normalized_lcs_distance("cat", "act")) # Output: 0.6667 print(normalized_lcs_distance("cat", "")) # Output: 1.0 Dice’s Coefficient ~~~~~~~~~~~~~~~~~~ .. py:function:: dice_coefficient(s1: str, s2: str) -> float Calculate Dice's coefficient between two strings using bigrams. :param s1: The first input string. :type s1: str :param s2: The second input string. :type s2: str :return: Dice's coefficient, ranging from 0 (no shared bigrams) to 1 (identical bigram sets). :rtype: float **Example**: .. code-block:: python from string_metrics import dice_coefficient print(dice_coefficient("night", "nacht")) # Output: 0.25 print(dice_coefficient("cat", "cat")) # Output: 1.0 print(dice_coefficient("cat", "")) # Output: 0.0 Smith-Waterman Distance ~~~~~~~~~~~~~~~~~~~~~~~ .. py:function:: smith_waterman_distance(s1: str, s2: str, match_score: int = 2, mismatch_score: int = -1, gap_score: int = -1) -> int Calculate the Smith-Waterman distance between two strings. :param s1: The first input string. :type s1: str :param s2: The second input string. :type s2: str :param match_score: Score for matching characters (default is 2). :type match_score: int, optional :param mismatch_score: Score for mismatching characters (default is -1). :type mismatch_score: int, optional :param gap_score: Score for gaps (default is -1). :type gap_score: int, optional :return: The inverse of the maximum local alignment score (lower scores indicate greater distance). :rtype: int **Example**: .. code-block:: python from string_metrics import smith_waterman_distance print(smith_waterman_distance("kitten", "sitting")) # Output: -10 print(smith_waterman_distance("cat", "act")) # Output: -5 print(smith_waterman_distance("cat", "")) # Output: 0 Normalized Smith-Waterman Distance ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. py:function:: normalized_smith_waterman_distance(s1: str, s2: str, match_score: int = 2, mismatch_score: int = -1, gap_score: int = -1) -> float Calculate the normalized Smith-Waterman distance between two strings. :param s1: The first input string. :type s1: str :param s2: The second input string. :type s2: str :param match_score: Score for matching characters (default is 2). :type match_score: int, optional :param mismatch_score: Score for mismatching characters (default is -1). :type mismatch_score: int, optional :param gap_score: Score for gaps (default is -1). :type gap_score: int, optional :return: The normalized Smith-Waterman distance, ranging from 0 (identical) to 1 (completely different). :rtype: float **Example**: .. code-block:: python from string_metrics import normalized_smith_waterman_distance print(normalized_smith_waterman_distance("kitten", "sitting")) # Output: 0.1667 print(normalized_smith_waterman_distance("cat", "act")) # Output: 0.3333 print(normalized_smith_waterman_distance("cat", "")) # Output: 0.0 Damerau-Levenshtein Distance ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. py:function:: damerau_levenshtein_distance(s1: str, s2: str) -> int Calculate the Damerau-Levenshtein distance between two strings. :param s1: The first input string. :type s1: str :param s2: The second input string. :type s2: str :return: The minimum number of edit operations (insertions, deletions, substitutions, or transpositions) required. :rtype: int **Example**: .. code-block:: python from string_metrics import damerau_levenshtein_distance print(damerau_levenshtein_distance("cat", "act")) # Output: 1 print(damerau_levenshtein_distance("cat", "hat")) # Output: 1 print(damerau_levenshtein_distance("cat", "cats")) # Output: 1 Normalized Damerau-Levenshtein Distance ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. py:function:: normalized_damerau_levenshtein_distance(s1: str, s2: str) -> float Calculate the normalized Damerau-Levenshtein distance between two strings. :param s1: The first input string. :type s1: str :param s2: The second input string. :type s2: str :return: The Damerau-Levenshtein distance normalized by the maximum string length, ranging from 0 (identical) to 1 (completely different). :rtype: float **Example**: .. code-block:: python from string_metrics import normalized_damerau_levenshtein_distance print(normalized_damerau_levenshtein_distance("cat", "act")) # Output: 0.3333 print(normalized_damerau_levenshtein_distance("cat", "hat")) # Output: 0.3333 print(normalized_damerau_levenshtein_distance("cat", "cats")) # Output: 0.25 Usage Notes ----------- - **Input Requirements**: All functions accept strings as input. For ``hamming_distance`` and ``normalized_hamming_distance``, strings must be of equal length. - **Output Interpretation**: - **Non-normalized distance metrics** (Levenshtein, Hamming, LCS, Smith-Waterman, Damerau-Levenshtein): Return non-negative integers (except Smith-Waterman, which returns negative integers); higher values indicate greater difference. - **Normalized distance metrics** (Normalized Levenshtein, Normalized Hamming, Normalized LCS, Normalized Smith-Waterman, Normalized Damerau-Levenshtein): Return floats in [0,1]; 1 indicates completely different strings. - **Similarity metrics** (Jaro, Jaro-Winkler, Cosine, Cosine Bigram, Dice): Return floats in [0,1]; 1 indicates identical strings. - **Edge Cases**: All functions handle empty strings, identical strings, and single-character strings appropriately. - **Dependencies**: Requires Python's standard library (``collections.Counter`` and ``math``). - **Jaro-Winkler Note**: The default prefix scaling factor is ``p = 0.15`` to match the ``textdistance`` library, differing from the standard ``p = 0.1``. Example Application ------------------- Here’s an example of using multiple metrics to compare two strings: .. code-block:: python from slearn.symbols import ( levenshtein_distance, jaro_winkler_distance, cosine_similarity, damerau_levenshtein_distance ) s1 = "kitten" s2 = "sitting" print(f"Levenshtein Distance: {levenshtein_distance(s1, s2)}") # Output: 3 print(f"Jaro-Winkler Similarity: {jaro_winkler_distance(s1, s2):.3f}") # Output: ~0.746 print(f"Cosine Similarity: {cosine_similarity(s1, s2):.3f}") # Output: 0.0 print(f"Damerau-Levenshtein Distance: {damerau_levenshtein_distance(s1, s2)}") # Output: 3