Python Removing Duplicates And Saving The Result
Solution 1:
I'm assuming since you assign
input
to the first command line argument withinput = sys.argv[1]
andoutput
to the second, you intend those to be your input and output file names. But you're never opening any file for the input data, so you're callling.splitlines()
on a file name, not on file contents.Next,
splitlines()
is the wrong approach here anyway. To iterate over a file line-by-line, simply usefor line in f
, wheref
is an open file. Those lines will include the newline at the end of the line, so it needs to be stripped if it's not supposed to be part of the third columns data.Then you're opening and closing the file inside your loop, which means you'll try to write the entire contents of
data
to the file every iteration, effectively overwriting any data written to the file before. Therefore I moved that block out of the loop.It's good practice to use the
with
statement for opening files.with open(out_fn, "w") as outfile
will open the file namedout_fn
and assign the open file tooutfile
, and close it for you as soon as you exit that indented block.input
is a builtin function in Python. I therefore renamed your variables so no builtin names get shadowed.You're trying to directly write
data
to the output file. This won't work sincedata
is a list of lines. You need tojoin
those lines first in order to turn them in a single string again before writing it to a file.
So here's your code with all those issues addressed:
from operator import itemgetter
import sys
in_fn = sys.argv[1]
out_fn = sys.argv[2]
getkey = itemgetter(0, 1)
seen = set()
data = []
with open(in_fn, 'r') as infile:
for line in infile:
line = line.strip()
key = getkey(line.split())
if key not in seen:
data.append(line)
seen.add(key)
with open(out_fn, "w") as outfile:
outfile.write('\n'.join(data))
Solution 2:
Why is the above code causing error?
Because you haven't opened the file, you are trying to work with the string input.txt
rather than with the file. Then when you try to access your item, you get a list index out of range because line.split()
returns ['input.txt']
.
How to fix that: open the file and then work with it, not with its name.
For example, you can do (I tried to stay as close to your code as possible)
input = sys.argv[1]
infile = open(input, 'r')
(...)
lines = infile.readlines()
infile.close()
for line in lines:
(...)
Why is this not saving result?
Because you are opening/closing the file inside the loop. What you need to do is write the data once you're out of the loop. Also, you cannot write directly a list to a file. Hence, you need to do something like (outside of your loop):
outfile = open(output, "w")
for item in data:
outfile.write(item)
outfile.close()
All together
There are other ways of reading/writing files, and it is pretty well documented on the internet but I tried to stay close to your code so that you would understand better what was wrong with it
from operator import itemgetter
import sys
input = sys.argv[1]
infile = open(input, 'r')
output = sys.argv[2]
#Pass any column number you want, note that indexing starts at 0
ig = itemgetter(0,1)
seen = set()
data = []
lines = infile.readlines()
infile.close()
for line in lines:
print line
key = ig(line.split())
if key not in seen:
data.append(line)
seen.add(key)
print data
outfile = open(output, "w")
for item in data:
outfile.write(item)
outfile.close()
PS: it seems to produce the result that you needed there Python to remove duplicates using only some, not all, columns
Post a Comment for "Python Removing Duplicates And Saving The Result"