Removing Duplicated Lines From A Txt File
I am processing large text files (~20MB) containing data delimited by line. Most data entries are duplicated and I want to remove these duplications to only keep one copy. Also, to
Solution 1:
How about the following (in Python):
prev = Nonefor line insorted(open('file')):
line = line.strip()
if prev isnotNoneandnot line.startswith(prev):
print prev
prev = line
if prev isnotNone:
print prev
If you find memory usage an issue, you can do the sort as a pre-processing step using Unix sort
(which is disk-based) and change the script so that it doesn't read the entire file into memory.
Solution 2:
awk '{x[$1 " " $2 " " $3] = $0} END {for (y in x) print x[y]}'
If you need to specify the number of columns for different files:
awk -v ncols=3 '
{
key ="";
for (i=1; i<=ncols; i++) {key = key FS$i}
if (length($0) > length(x[key])) {x[key] =$0}
}
END {for (y in x) print y "\t" x[y]}
'
Solution 3:
This variation on glenn jackman's answer should work regardless of the position of lines with extra bits:
awk '{idx =$1" "$2" "$3; if (length($0) > length(x[idx])) x[idx] =$0} END {for (idx in x) print x[idx]}' inputfile
Or
awk -v ncols=3 '
{
key ="";
for (i=1; i<=ncols; i++) {key = key FS$i}
if (length($0) > length(x[key])) x[key] =$0
}
END {for (y in x) print x[y]}
' inputfile
Solution 4:
This or a slight variant should do:
finalData = {}
for line ininput:
parts = line.split()
key,extra = tuple(parts[0:3]),parts[3:]
if key notin finalData or extra:
finalData[key] = extra
pprint(finalData)
outputs:
{('BOB', '123', '1DB'): ['EXTRA', 'BITS'],
('DAVE', '789', '1DB'): [],
('JIM', '456', '3DB'): ['AX']}
Solution 5:
You'll have to define a function to split your line into important bits and extra bits, then you can do:
defsplit_extra(s):
"""Return a pair, the important bits and the extra bits."""return blah blah blah
data = {}
for line inopen('file'):
impt, extra = split_extra(line)
existing = data.setdefault(impt, extra)
iflen(extra) > len(existing):
data[impt] = extra
out = open('newfile', 'w')
for impt, extra in data.iteritems():
out.write(impt + extra)
Post a Comment for "Removing Duplicated Lines From A Txt File"