Optimize Python Program To Parse Two Large Files At The Same Time
Solution 1:
This is just a hypothesis, but your process could be wasting its allocated CPU slot every time it triggers an I/O to get a pair of lines. You could try reading groups of lines at a time and processing in chunks so you can make the most of each CPU time slot you get on the shared cluster.
from collections import deque
chunkSize = 1000000# number of characters in each chunk (you will need to adjust this)
chunk1 = deque([""]) #buffered lines from 1st file
chunk2 = deque([""]) #buffered lines from 2nd filewithopen(file1, "r") as f1, open(file2, "r") as f2:
while chunk1 and chunk2:
line_f1 = chunk1.popleft()
ifnot chunk1:
line_f1,*more = (line_f1+file1.read(chunkSize)).split("\n")
chunk1.extend(more)
line_f2 = chunk2.popleft()
ifnot chunk2:
line_f2,*more = (line_f2+file2.read(chunkSize)).split("\n")
chunk2.extend(more)
# process line_f1, line_f2
....
The way this works is by reading a chunk of characters (which must be larger than your longest line) and breaking it down into lines. The lines are placed in a queue for processing.
Because the chunksize is expressed in number of characters, the last line in the queue may be incomplete.
To ensure that lines are complete before being processed, another chunk is read when we get to the last line in the queue. The additional characters are added to the end of the incomplete line and the line splitting is performed on the combined string. Because we concatenated the last (incomplete) line, the .split("\n")
function always applies to a chunk of text that begins at a line boundary.
The process continues with the (now completed) last line and the rest of the lines are added to the queue.
Post a Comment for "Optimize Python Program To Parse Two Large Files At The Same Time"