Speed Up Reading In A Compressed Bz2 File ('rb' Mode)
Solution 1:
You can use BZ2Decompressor
to deal with huge files. It decompresses blocks of data incrementally, just out of the box:
t0 = time.time()
time.sleep(0.000001)
withopen('temp.bz2', 'rb') as fi:
decomp = bz2.BZ2Decompressor()
residue = b''
total_lines = 0for data initer(lambda: fi.read(100 * 1024), b''):
raw = residue + decomp.decompress(data) # process the raw data and concatenate residual of the previous block to the beginning of the current raw data block
residue = b''# process_data(current_block) => do the processing of the current data block
current_block = raw.split(b'\n')
if raw[-1] != b'\n':
residue = current_block.pop() # last line could be incomplete
total_lines += len(current_block)
print('%i lines/sec' % (total_lines / (time.time() - t0)))
# process_data(residue) => now finish processing the last line
total_lines += 1print('Final: %i lines/sec' % (total_lines / (time.time() - t0)))
Here I read a chunk of binary file, feed it into a decompressor and receive a chunk of decompressed data. Be aware, the decompressed data chunks have to be concatenated to restore the original data. This is why last entry needs special treatment.
In my experiments it runs a little faster then your solution with io.BytesIO()
. bz2
is known to be slow, so if it bothers you consider migration to snappy
or zstandard
.
Regarding the time it takes to process bz2
in Python. It might be fastest to decompress the file into temporary one using Linux utility and then process a normal text file. Otherwise you will be dependent on Python's implementation of bz2
.
Solution 2:
This method already gives a x2 improvement over native bz2.open
.
import bz2, time, io
defchunked_readlines(f):
s = io.BytesIO()
whileTrue:
buf = f.read(1024*1024)
ifnot buf:
return s.getvalue()
s.write(buf)
s.seek(0)
L = s.readlines()
yieldfrom L[:-1]
s = io.BytesIO()
s.write(L[-1]) # very important: the last line read in the 1 MB chunk might be# incomplete, so we keep it to be processed in the next iteration# TODO: check if this is ok if f.read() stopped in the middle of a \r\n?
t0 = time.time()
i = 0with bz2.open("D:\test.bz2", 'rb') as f:
for l in chunked_readlines(f): # 500k lines per second# for l in f: # 250k lines per second
i += 1if i % 100000 == 0:
print('%i lines/sec' % (i/(time.time() - t0)))
It is probably possible to do even better.
We could have a x4 improvement if we could use s
as a a simple bytes
object instead of a io.BytesIO
. But unfortunately in this case, splitlines()
does not behave as expected: splitlines() and iterating over an opened file give different results.
Post a Comment for "Speed Up Reading In A Compressed Bz2 File ('rb' Mode)"