Modify Levenshtein-Distance To Ignore Order
I'm looking to compute the the Levenshtein-distance between sequences containing up to 6 values. The order of these values should not affect the distance. How would I implement thi
Solution 1:
You don't need the Levenshtein machinery for this.
import collections
def distance(s1, s2):
cnt = collections.Counter()
for c in s1:
cnt[c] += 1
for c in s2:
cnt[c] -= 1
return sum(abs(diff) for diff in cnt.values()) // 2 + \
(abs(sum(cnt.values())) + 1) // 2 # can be omitted if len(s1) == len(s2)
Solution 2:
Why not just count how many letters are in common, and find and answer from this? For each character calculate its frequency, then for each string calculate how many "extra" characters it has based on frequencies, and take maximum of these "extra".
Pseudocode:
for c in s1:
cnt1[c]++
for c in s2:
cnt2[c]++
extra1 = 0
extra2 = 0
for c in all_chars:
if cnt1[c]>cnt2[c]
extra1 += cnt1[c]-cnt2[c]
else
extra2 += cnt2[c]-cnt1[c]
return max(extra1, extra2)
Solution 3:
this might be late but I think it can help someone and I also still looking an improvement. The challenge I had was:
match_function('kigali rwanda','rwanda kigali') probable match percentage should be 100%
match_function('kigali','ligaki') probable match percentage should be +50% ... I wrote a funny function in T-SQL using cross join and Levenstein and it helped at some point but I still need an improvement:
Create FUNCTION [dbo].[GetPercentageMatch](@left VARCHAR(100),@right VARCHAR(100)) RETURNS DECIMAL AS BEGIN DECLARE @returnvalue DECIMAL(5, 2); DECLARE @list1 TABLE(value VARCHAR(50)); declare @count1 int, @count2 int, @matchPerc int; INSERT INTO @list1 (value) select value from STRING_SPLIT(@left, ' '); DECLARE @list2 TABLE(value VARCHAR(50)); INSERT INTO @list2 (value) select * from STRING_SPLIT(@right, ' '); select @count1 = count(*) from @list1 select @count2 = count(*) from @list2 select @matchPerc = (r3.percSum/case when @count1 > @count2 then @count1 else @count2 end) from ( select count(r2.l1) rCount, sum(r2.perc) percSum from( select r.t1, r.t2, r.distance, (100-((r.distance*100)/(case when len(r.t1) > len(r.t2) then len(r.t1) else len(r.t2) end))) perc, len(r.t1) l1,len(r.t2)l2 from (select isnull(t1.value,'') t1, isnull(t2.value,'') t2, [dbo].[LEVENSHTEIN](isnull(t1.value,''),isnull(t2.value,'')) distance from @list1 t1 cross join @list2 t2 ) as r ) r2 ) r3 return case when @matchPerc > 100 then 100 else @matchPerc end END;
Post a Comment for "Modify Levenshtein-Distance To Ignore Order"