How To Filter Overlap Of Two Rows In A Big File In Python

October 02, 2022 Post a Comment

I am trying to filter overlap rows in a big file in python. The overlap degrees is set to 25% of two rows and the other two rows. In other words, the overlap degrees is a*b/(c+d-

Solution 1:

I've been thinking about how to solve this problem in a better way, without all the reversing and indexing and stuff, and I've come up with a solution that's longer and more involved, but easier to read, prettier, more maintainable and extendable, IMHO.

First we need a special kind of list that we can iterate over "correctly", even if an item in it is deleted. Here is a blog post going into more detail about how lists and iterators work, and reading it will help you understand what's going on here:

class SmartList(list):
    def __init__(self, *args, **kwargs):
        super(SmartList, self).__init__(*args, **kwargs)
        self.iterators = []

    def __iter__(self):
        return SmartListIter(self)

    def __delitem__(self, index):
        super(SmartList, self).__delitem__(index)
        for iterator in self.iterators:
            iterator.item_deleted(index)

We extend the built-in list and make it return a custom iterator instead of the default. Whenever an item in the list is deleted, we call the item_deleted method of every item in the self.iterators list. Here's the code for SmartListIter:

class SmartListIter(object):
    def __init__(self, smartlist, index=0):
        self.smartlist = smartlist
        smartlist.iterators.append(self)
        self.index = index

    def __iter__(self):
        return self

    def next(self):
        try:
            item = self.smartlist[self.index]
        except IndexError:
            self.smartlist.iterators.remove(self)
            raise StopIteration
        index = self.index
        self.index += 1
        return (index, item)

    def item_deleted(self, index):
        if index >= self.index:
            return
        self.index -= 1

So the iterator adds itself to the list of iterators, and removes itself from the same list when it is done. If an item with an index less than the current index is deleted, we decrement the current index by one so that we won't skip an item like a normal list iterator would do.

The next method returns a tuple (index, item) instead of just the item, because that makes things easier when it's time to use these classes -- we won't have to mess around with enumerate.

So that should take care of having to go backwards, but we're still having to use a lot of indexes to juggle between four different lines in every loop. Since two and two lines go together, let's make a class for that:

class LinePair(object):
    def __init__(self, pair):
        self.pair = pair
        self.sets = [set(line.split()) for line in pair]
        self.c = len(self.sets[0]) * len(self.sets[1])

    def overlap(self, other):
        ab = float(len(self.sets[0] & other.sets[0]) * \
            len(self.sets[1] & other.sets[1]))
        overlap = ab / (self.c + other.c - ab)
        return overlap

    def __str__(self):
        return "".join(self.pair)

The pair attribute is a tuple of two lines read directly from the input file, complete with newlines. We use it later to write that pair back to a file. We also convert both lines to a set and calculate the c attribute, which is a property of every pair of lines. Finally we make a method that will compute the overlap between one LinePair and another. Notice that d is gone, since that is just the c attribute of the other pair.

Now for the grand finale:

from itertools import izip

with open("iuputfile.txt") as fileobj:
    pairs = SmartList([LinePair(pair) for pair in izip(fileobj, fileobj)])

for first_index, first_pair in pairs:
    for second_index, second_pair in SmartListIter(pairs, first_index + 1):
        if first_pair.overlap(second_pair) > 0.25:
            del pairs[second_index]

with open("outputfile.txt", "w") as fileobj:
    for index, pair in pairs:
        fileobj.write(str(pair))

Notice how easy it is to read the central loop here, and how short it is. If you need to change this algorithm in the future, it's likely much more easily done with this code than with my other code. The izip is used to group two and two lines of the input file, as explained here.

Solution 2:

Stack Overflow is not here to program for you or to solve general debugging tasks. It's for specific problems that you've tried to solve yourself but can't. You're asking questions you should, as a programmer, be able to figure out yourself. Start your program like this:

python -m pdb my_script.py

Now you can step through your script line by line using the n command. If you want to see what's inside a variable, simply type the name of that variable. By using this method you will find out why things don't work yourself. There are lots of other smart things you can do using pdb (the python debugger) but for this case the n command is sufficient.

Please put in some more effort towards solving your problem yourself before asking another question here.

That being said, here is what was wrong with your modified script:

with open("iuputfile.txt") as fileobj:
    sets = [set(line.split()) for line in fileobj]
    for first_index in range(len(sets) - 4, -2, -2):
        c = len(sets[first_index]) * len(sets[first_index + 1])
        for second_index in range(len(sets) - 2, first_index, -2):
            d = len(sets[second_index]) * len(sets[second_index + 1])
            # Error 1:
            ab = len(sets[first_index] & sets[second_index]) * \
                len(sets[first_index + 1] & sets[second_index + 1])
            # Error 2:
            overlap = (float(ab) / (c + d - ab))
            if overlap > 0.25:
                # Error 3:
                del sets[second_index + 1]
                del sets[second_index]
    with open("outputfile.txt", "w") as fileobj:
        for set_ in sets:
            # You've removed the sorting, I assume it's because the order
            # is unimportant
            output = " ".join(set_)
            fileobj.write("{0}\n".format(output))

The mistakes were:

Error 1: Intersection is &. Union is |.
Error 2: Since all of the variables are integers, the result will be an integer too, unless you're using python 3. If you are, this is not an error. If you're not, you need to make sure one of the variables are a float to force the result to be a float as well. Hence the float(ab).
Error 3: Remember to always work from the back and forwards. When you delete sets[second_index], what used to be at sets[second_index + 1] takes it place, so deleting sets[second_index + 1] afterwards will delete what used to be at sets[second_index + 2], which is not what you want. So we delete the largest index first.

Python Guru

How To Filter Overlap Of Two Rows In A Big File In Python

Solution 1:

Solution 2:

Post a Comment for "How To Filter Overlap Of Two Rows In A Big File In Python"