Skip to content Skip to sidebar Skip to footer

Python Removing Duplicates And Saving The Result

I am trying to remove duplicates of 3-column tab-delimited txt file, but as long as the first two columns are duplicates, then it should be removed even if the two has different 3r

Solution 1:

  • I'm assuming since you assign input to the first command line argument with input = sys.argv[1] and output to the second, you intend those to be your input and output file names. But you're never opening any file for the input data, so you're callling .splitlines() on a file name, not on file contents.

  • Next, splitlines() is the wrong approach here anyway. To iterate over a file line-by-line, simply use for line in f, where f is an open file. Those lines will include the newline at the end of the line, so it needs to be stripped if it's not supposed to be part of the third columns data.

  • Then you're opening and closing the file inside your loop, which means you'll try to write the entire contents of data to the file every iteration, effectively overwriting any data written to the file before. Therefore I moved that block out of the loop.

  • It's good practice to use the with statement for opening files. with open(out_fn, "w") as outfile will open the file named out_fn and assign the open file to outfile, and close it for you as soon as you exit that indented block.

  • input is a builtin function in Python. I therefore renamed your variables so no builtin names get shadowed.

  • You're trying to directly write data to the output file. This won't work since data is a list of lines. You need to join those lines first in order to turn them in a single string again before writing it to a file.

So here's your code with all those issues addressed:

from operator import itemgetter
import sys


in_fn = sys.argv[1]
out_fn = sys.argv[2]

getkey = itemgetter(0, 1)
seen = set()
data = []

with open(in_fn, 'r') as infile:
    for line in infile:
        line = line.strip()
        key = getkey(line.split())
        if key not in seen:
            data.append(line)
            seen.add(key)

with open(out_fn, "w") as outfile:
    outfile.write('\n'.join(data))

Solution 2:

Why is the above code causing error?
Because you haven't opened the file, you are trying to work with the string input.txtrather than with the file. Then when you try to access your item, you get a list index out of range because line.split() returns ['input.txt']. How to fix that: open the file and then work with it, not with its name. For example, you can do (I tried to stay as close to your code as possible)

input = sys.argv[1]
infile = open(input, 'r')
(...)
lines = infile.readlines()
infile.close()
for line in lines:
    (...)

Why is this not saving result?
Because you are opening/closing the file inside the loop. What you need to do is write the data once you're out of the loop. Also, you cannot write directly a list to a file. Hence, you need to do something like (outside of your loop):

outfile = open(output, "w")
for item in data:
  outfile.write(item)
outfile.close()

All together
There are other ways of reading/writing files, and it is pretty well documented on the internet but I tried to stay close to your code so that you would understand better what was wrong with it

from operator import itemgetter
import sys

input = sys.argv[1]
infile = open(input, 'r')
output = sys.argv[2]

#Pass any column number you want, note that indexing starts at 0
ig = itemgetter(0,1)
seen = set()
data = []
lines = infile.readlines()
infile.close()
for line in lines:
    print line
    key = ig(line.split())
    if key not in seen:
        data.append(line)
        seen.add(key)

print data
outfile = open(output, "w")
for item in data:
  outfile.write(item)
outfile.close()

PS: it seems to produce the result that you needed there Python to remove duplicates using only some, not all, columns


Post a Comment for "Python Removing Duplicates And Saving The Result"