Skip to content Skip to sidebar Skip to footer

Stripping Out Unwanted Characters That Are Breaking Readline()

I'm writing a small script to run through large folders of copyright notice emails and finding relevant information (IP and timestamp). I've already found ways around a few little

Solution 1:

It seems likely that at least some of the emails that you are processing have been encoded as quoted-printable.

This encoding is used to make 8-bit character data transportable over 7-bit (ASCII-only) systems, but it also enforces a fixed line length of 76 characters. This is implemented by inserting a soft line break consisting of "=" followed by the end of line marker.

Python provides the quopri module to handle encoding and decoding from quoted-printable. Decoding your data from quoted-printable will remove these soft line breaks.

As an example, let's use the first paragraph of your question.

>>>import quopri>>>s = """I'm writing a small script to run through large folders of copyright notice emails and finding relevant information (IP and timestamp). I've already found ways around a few little formatting hurdles (sometimes IP and TS are on different lines, sometimes on same, sometimes in different places, timestamps come in 4 different formats, etc.).""">>># Encode to latin-1 as quopri deals with bytes, not strings.>>>bs = s.encode('latin-1')>>># Encode>>>encoded = quopri.encodestring(bs)>>># Observe the "=\n" inserted into the text.>>>encoded
b"I'm writing a small script to run through large folders of copyright notice=\n emails and finding relevant information (IP and timestamp). I've already f=\nound ways around a few little formatting hurdles (sometimes IP and TS are o=\nn different lines, sometimes on same, sometimes in different places, timest=\namps come in 4 different formats, etc.)."

>>># Printing without decoding from quoted-printable shows the "=".>>>print(encoded.decode('latin-1'))
I'm writing a small script to run through large folders of copyright notice=
 emails and finding relevant information (IP and timestamp). I've already f=
ound ways around a few little formatting hurdles (sometimes IP and TS are o=
n different lines, sometimes on same, sometimes in different places, timest=
amps come in 4 different formats, etc.).

>>># Decode from quoted-printable to remove soft line breaks.>>>print(quopri.decodestring(encoded).decode('latin-1'))
I'm writing a small script to run through large folders of copyright notice emails and finding relevant information (IP and timestamp). I've already found ways around a few little formatting hurdles (sometimes IP and TS are on different lines, sometimes on same, sometimes in different places, timestamps come in 4 different formats, etc.).

To decode correctly, the entire message body needs to be processed, which conflicts with your approach using readline. One way around this is to load the decoded string into a buffer:

import io

def getIP(em):
    with open(em, 'rb') as f:
        bs = f.read()
    decoded = quopri.decodestring(bs).decode('latin-1')

    ce = io.StringIO(decoded)
    iplabel = ""
    while  not ("Torrent Hash Value: " in iplabel):
        iplabel = ce.readline()
        ...

If your files contain complete emails - including headers - then using the tools in the email module will handle this decoding automatically.

import email
from email import policy

withopen('message.eml') as f:
    s = f.read()
msg = email.message_from_string(s, policy=policy.default)
body = msg.get_content()

Solution 2:

Solved, if anyone else has a similar problem, save each line as a string, merge them together, and re.sub() them out, keeping in mind \r and \n characters. My solution is a bit spaghetti, but prevents unneeded regex being done on every file:

defgetIP(em):
ce = codecs.open(em, encoding='latin1')
iplabel = ""whilenot ("Torrent Hash Value: "in iplabel):
    iplabel = ce.readline()

ipraw = ce.readline()
if ("File Size"in ipraw):
    ipraw = ce.readline()

ip = re.findall( r'[0-9]+(?:\.[0-9]+){3}', ipraw)
if ip:
    return ip[0]
    ce.close()
else:
    ipraw2 = ce.readline()                              #made this a new var
    ip = re.findall( r'[0-9]+(?:\.[0-9]+){3}', ipraw2)
    if ip:
        return ip[0]
        ce.close()
    else:
        ipraw = ipraw + ipraw2                          #Added this section
        ipraw = re.sub(r'(=\r*\n)', '', ipraw)          #
        ip = re.findall( r'[0-9]+(?:\.[0-9]+){3}', ipraw)
        if ip:
            return ip[0]
            ce.close()
        else:
            return ("No IP found in: " + ipraw)
            ce.close()

Post a Comment for "Stripping Out Unwanted Characters That Are Breaking Readline()"