Splitting A List Of File Names In A Predefined Ratio
Solution 1:
If I understand the situation correctly, your trying to partition the same proportion of each filename prefix's files. Your current method selects the correct proportion from the whole set of files, but it doesn't consider the different filename prefixes, so it may not get them in the correct proportion (though it will probably be somewhat close, most of the time).
Your second approach avoids that issue by first separating the filenames by prefix, then partitioning each sublist. But if you want a combined list with all the prefixes together, this approach may end up wasting time copying data around, since you have to separate out and then recombine the separate lists by prefix.
I think you can do what you want with a single loop over the filenames. You'll need to keep track of two data points for each filename prefix: The number of files with that prefix you've selected for the first sample and the total number of files with that prefix that you've seen.
ratio = 0.7
prefix_dict = {} # values are lists: [number_selected_for_first_list, total_number_seen]
first_sample = [] # gets a proportion of the files equal to ratio (for each prefix)
second_sample = [] # gets the rest of the files
for filename in list_of_files:
prefix = filename.split("_", 1)[0]
selected_seen = prefix_dict.setdefault(prefix, [0, 0])
selected_seen[1] += 1
if selected_seen[0] < round(ratio * selected_seen[1]):
first_sample.append(filename)
selected_seen[0] += 1
else:
second_sample.append(filename)
The only tricky part to this code is the use of dict.setdefault
to fetch the selected_seen
list. It if the requested prefix
didn't yet exist in the dictionary, a new value ([0, 0]
) will be added to the dictionary under that key (and returned). The later code modifies the list in place.
Depending on how exactly you want to handle inexact proportions, you can change the if
condition a bit. I put in a round
call (which I think will partition most accurately), but the code would work OK without it (biasing the selection towards the second sample) or with selected_seen[0] <= int(ratio * selected_seen[1])
(biasing towards the first sample).
Note that whichever way you choose to round when partitioning each prefix, there's the possibility that the separate prefixes will all end up unbalanced in the same direction, making the overall samples unbalanced by more than you'd normally expect. For instance, if you had ten prefixes with ten files (for 100 files total), a ratio of 7.5 would result in final sample lists of 80 and 20 files rather than 75 and 25. That happens since each of the prefixes gets partitioned 8 and 2 (7.5 rounds up). If every file had a unique prefix, you'd end up with everything in the first sample! If it's very important that the overall samples be the right sizes, you might need to fudge the sampling of the items a bit, based on the overall sample sizes.
Solution 2:
I figured out a good solution to this problem.
all_file_names = {}
# ObjList is a list of objects but we only need # file_name from that object for our solution
for x in ObjList:
if x.file_name not in all_file_names:
all_file_names[x.file_name] = 1
else:
all_file_names[x.file_name] += 1
trainingData = []
testData = []
temp_dict = {}
for x in ObjList:
ratio = int(0.7*all_file_names[x.file_name])+1
if x.file_name not in temp_dict:
temp_dict[x.file_name] = 1
trainingData.append(x)
else:
temp_dict[x.file_name] += 1
if(temp_dict[x.file_name] < ratio):
trainingData.append(x)
else:
testData.append(x)
Post a Comment for "Splitting A List Of File Names In A Predefined Ratio"