Skip to content Skip to sidebar Skip to footer

Modify Levenshtein-Distance To Ignore Order

I'm looking to compute the the Levenshtein-distance between sequences containing up to 6 values. The order of these values should not affect the distance. How would I implement thi

Solution 1:

You don't need the Levenshtein machinery for this.

import collections
def distance(s1, s2):
    cnt = collections.Counter()
    for c in s1:
        cnt[c] += 1
    for c in s2:
        cnt[c] -= 1
    return sum(abs(diff) for diff in cnt.values()) // 2 + \
        (abs(sum(cnt.values())) + 1) // 2   # can be omitted if len(s1) == len(s2)

Solution 2:

Why not just count how many letters are in common, and find and answer from this? For each character calculate its frequency, then for each string calculate how many "extra" characters it has based on frequencies, and take maximum of these "extra".

Pseudocode:

for c in s1:
    cnt1[c]++
for c in s2:
    cnt2[c]++
extra1 = 0
extra2 = 0
for c in all_chars:
    if cnt1[c]>cnt2[c]
        extra1 += cnt1[c]-cnt2[c]
    else
        extra2 += cnt2[c]-cnt1[c]
return max(extra1, extra2)

Solution 3:

this might be late but I think it can help someone and I also still looking an improvement. The challenge I had was:

  1. match_function('kigali rwanda','rwanda kigali') probable match percentage should be 100%

  2. match_function('kigali','ligaki') probable match percentage should be +50% ... I wrote a funny function in T-SQL using cross join and Levenstein and it helped at some point but I still need an improvement:

     Create FUNCTION [dbo].[GetPercentageMatch](@left  VARCHAR(100),@right VARCHAR(100))
     RETURNS DECIMAL
     AS
     BEGIN
     DECLARE @returnvalue DECIMAL(5, 2);
     DECLARE @list1 TABLE(value VARCHAR(50));
     declare @count1 int, @count2 int, @matchPerc int;
     INSERT INTO @list1 (value) select value from STRING_SPLIT(@left, ' ');
    
     DECLARE @list2 TABLE(value VARCHAR(50));
     INSERT INTO @list2 (value) select * from STRING_SPLIT(@right, ' ');
    
     select @count1 = count(*) from @list1
     select @count2 = count(*) from @list2
    
     select @matchPerc = (r3.percSum/case when @count1 > @count2 then @count1 else @count2 end) from (
     select count(r2.l1) rCount, sum(r2.perc) percSum from(
     select r.t1, r.t2, r.distance, (100-((r.distance*100)/(case when len(r.t1) > len(r.t2) then len(r.t1) else len(r.t2) end))) perc, len(r.t1) l1,len(r.t2)l2 from
     (select 
     isnull(t1.value,'') t1, 
     isnull(t2.value,'') t2, 
     [dbo].[LEVENSHTEIN](isnull(t1.value,''),isnull(t2.value,'')) distance
     from @list1 t1 cross join @list2 t2 ) as r
     ) r2
     ) r3
    
     return case when @matchPerc > 100 then 100 else @matchPerc end
     END;
    

Post a Comment for "Modify Levenshtein-Distance To Ignore Order"