Skip to content Skip to sidebar Skip to footer

Removing Duplicated Lines From A Txt File

I am processing large text files (~20MB) containing data delimited by line. Most data entries are duplicated and I want to remove these duplications to only keep one copy. Also, to

Solution 1:

How about the following (in Python):

prev = Nonefor line insorted(open('file')):
  line = line.strip()
  if prev isnotNoneandnot line.startswith(prev):
    print prev
  prev = line
if prev isnotNone:
  print prev

If you find memory usage an issue, you can do the sort as a pre-processing step using Unix sort (which is disk-based) and change the script so that it doesn't read the entire file into memory.

Solution 2:

awk '{x[$1 " " $2 " " $3] = $0} END {for (y in x) print x[y]}'

If you need to specify the number of columns for different files:

awk -v ncols=3 '
  {
    key ="";
    for (i=1; i<=ncols; i++) {key = key FS$i}
    if (length($0) > length(x[key])) {x[key] =$0}
  }
  END {for (y in x) print y "\t" x[y]}
'

Solution 3:

This variation on glenn jackman's answer should work regardless of the position of lines with extra bits:

awk '{idx =$1" "$2" "$3; if (length($0) > length(x[idx])) x[idx] =$0} END {for (idx in x) print x[idx]}' inputfile

Or

awk -v ncols=3 '
  {
    key ="";
    for (i=1; i<=ncols; i++) {key = key FS$i}
    if (length($0) > length(x[key])) x[key] =$0
  }
  END {for (y in x) print x[y]}
' inputfile

Solution 4:

This or a slight variant should do:

finalData = {}
for line ininput:
    parts = line.split()
    key,extra = tuple(parts[0:3]),parts[3:]
    if key notin finalData or extra:
        finalData[key] = extra

pprint(finalData)

outputs:

{('BOB', '123', '1DB'): ['EXTRA', 'BITS'],
 ('DAVE', '789', '1DB'): [],
 ('JIM', '456', '3DB'): ['AX']}

Solution 5:

You'll have to define a function to split your line into important bits and extra bits, then you can do:

defsplit_extra(s):
    """Return a pair, the important bits and the extra bits."""return blah blah blah

data = {}
for line inopen('file'):
    impt, extra = split_extra(line)
    existing = data.setdefault(impt, extra)
    iflen(extra) > len(existing):
        data[impt] = extra

out = open('newfile', 'w')
for impt, extra in data.iteritems():
    out.write(impt + extra)

Post a Comment for "Removing Duplicated Lines From A Txt File"