Modify Levenshtein-Distance To Ignore Order

August 26, 2022 Post a Comment

I'm looking to compute the the Levenshtein-distance between sequences containing up to 6 values. The order of these values should not affect the distance. How would I implement thi

Solution 1:

You don't need the Levenshtein machinery for this.

import collections
def distance(s1, s2):
    cnt = collections.Counter()
    for c in s1:
        cnt[c] += 1
    for c in s2:
        cnt[c] -= 1
    return sum(abs(diff) for diff in cnt.values()) // 2 + \
        (abs(sum(cnt.values())) + 1) // 2   # can be omitted if len(s1) == len(s2)

Solution 2:

Why not just count how many letters are in common, and find and answer from this? For each character calculate its frequency, then for each string calculate how many "extra" characters it has based on frequencies, and take maximum of these "extra".

Pseudocode:

for c in s1:
    cnt1[c]++
for c in s2:
    cnt2[c]++
extra1 = 0
extra2 = 0
for c in all_chars:
    if cnt1[c]>cnt2[c]
        extra1 += cnt1[c]-cnt2[c]
    else
        extra2 += cnt2[c]-cnt1[c]
return max(extra1, extra2)

Solution 3:

this might be late but I think it can help someone and I also still looking an improvement. The challenge I had was:

match_function('kigali rwanda','rwanda kigali') probable match percentage should be 100%

match_function('kigali','ligaki') probable match percentage should be +50% ... I wrote a funny function in T-SQL using cross join and Levenstein and it helped at some point but I still need an improvement:

 Create FUNCTION [dbo].[GetPercentageMatch](@left  VARCHAR(100),@right VARCHAR(100))
 RETURNS DECIMAL
 AS
 BEGIN
 DECLARE @returnvalue DECIMAL(5, 2);
 DECLARE @list1 TABLE(value VARCHAR(50));
 declare @count1 int, @count2 int, @matchPerc int;
 INSERT INTO @list1 (value) select value from STRING_SPLIT(@left, ' ');

 DECLARE @list2 TABLE(value VARCHAR(50));
 INSERT INTO @list2 (value) select * from STRING_SPLIT(@right, ' ');

 select @count1 = count(*) from @list1
 select @count2 = count(*) from @list2

 select @matchPerc = (r3.percSum/case when @count1 > @count2 then @count1 else @count2 end) from (
 select count(r2.l1) rCount, sum(r2.perc) percSum from(
 select r.t1, r.t2, r.distance, (100-((r.distance*100)/(case when len(r.t1) > len(r.t2) then len(r.t1) else len(r.t2) end))) perc, len(r.t1) l1,len(r.t2)l2 from
 (select 
 isnull(t1.value,'') t1, 
 isnull(t2.value,'') t2, 
 [dbo].[LEVENSHTEIN](isnull(t1.value,''),isnull(t2.value,'')) distance
 from @list1 t1 cross join @list2 t2 ) as r
 ) r2
 ) r3

 return case when @matchPerc > 100 then 100 else @matchPerc end
 END;

Python Guru

Modify Levenshtein-Distance To Ignore Order

Solution 1:

Solution 2:

Solution 3:

Post a Comment for "Modify Levenshtein-Distance To Ignore Order"