Skip to content Skip to sidebar Skip to footer

How To Eliminate Duplicate List Entries In Python While Preserving Case-sensitivity?

I'm looking for a way to remove duplicate entries from a Python list but with a twist; The final list has to be case sensitive with a preference of uppercase words. For example, b

Solution 1:

This does not preserve the order of words, but it does produce a list of "unique" words with a preference for capitalized ones.

In [34]: words = ['Hello', 'hello', 'world', 'world', 'poland', 'Poland', ]

In [35]: wordset = set(words)

In [36]: [item for item in wordset if item.istitle() or item.title() not in wordset]
Out[36]: ['world', 'Poland', 'Hello']

If you wish to preserve the order as they appear in words, then you could use a collections.OrderedDict:

In [43]: wordset = collections.OrderedDict()

In [44]: wordset = collections.OrderedDict.fromkeys(words)

In [46]: [item for item in wordset if item.istitle() or item.title() not in wordset]
Out[46]: ['Hello', 'world', 'Poland']

Solution 2:

Using set to track seen words:

def uniq(words):
    seen = set()
    for word in words:
        l = word.lower()  # Use `word.casefold()` if possible. (3.3+)
        if l in seen:
            continue
        seen.add(l)
        yield word

Usage:

>>> list(uniq(['Hello', 'hello', 'world', 'world', 'Poland', 'poland']))
['Hello', 'world', 'Poland']

UPDATE

Previous version does not take care of preference of uppercase over lowercase. In the updated version I used the min as @TheSoundDefense did.

import collections

def uniq(words):
    seen = collections.OrderedDict()  # Use {} if the order is not important.
    for word in words:
        l = word.lower()  # Use `word.casefold()` if possible (3.3+)
        seen[l] = min(word, seen.get(l, word))
    return seen.values()

Solution 3:

Since an uppercase letter is "smaller" than a lowercase letter in a comparison, I think you can do this:

orig_list = ["Hello", "hello", "world", "world", "Poland", "poland"]
unique_list = []
for word in orig_list:
  for i in range(len(unique_list)):
    if unique_list[i].lower() == word.lower():
      unique_list[i] = min(word, unique_list[i])
      break
  else:
    unique_list.append(word)

The min will have a preference for words with uppercase letters earlier on.


Solution 4:

Some better answers here, but hopefully something simple, different and useful. This code satisfies the conditions of your test, sequential pairs of matching words, but would fail on anything more complicated; such as non-sequential pairs, non-pairs or non-strings. Anything more complicated and I'd take a different approach.

p1 = ['Hello', 'hello', 'world', 'world', 'Poland', 'poland']
p2 = ['hello', 'Hello', 'world', 'world', 'Poland', 'Poland']

def pref_upper(p):
    q = []
    a = 0
    b = 1

    for x in range(len(p) /2):
            if p[a][0].isupper() and p[b][0].isupper():
                    q.append(p[a])
            if p[a][0].isupper() and p[b][0].islower():
                    q.append(p[a])
            if p[a][0].islower() and p[b][0].isupper():
                    q.append(p[b])
            if p[a][0].islower() and p[b][0].islower():
                    q.append(p[b])
            a +=2
            b +=2
    return q

print pref_upper(p1)
print pref_upper(p2)

Post a Comment for "How To Eliminate Duplicate List Entries In Python While Preserving Case-sensitivity?"