Skip to content Skip to sidebar Skip to footer

Parse List Of Strings For Speed

Background I have a function called get_player_path that takes in a list of strings player_file_list and a int value total_players. For the sake of example i have reduced the list

Solution 1:

Let's analyze your function first:

  • your loop should take linear time (O(n)) in the length of the input list, assuming the path lengths are bounded by a relatively "small" number;
  • the sorting takes O(n log(n)) comparisons.

Thus the sorting has the dominant cost when the list becomes big. You can micro-optimize your loop as much as you want, but as long as you keep that sorting at the end, your effort won't make much of a difference with big lists.

Your approach is fine if you're just writing a Python script. If you really needed perfomances with huge lists, you would probably be using some other language. Nonetheless, if you really care about performances (or just to learn new stuff), you could try one of the following approaches:

  • replace the generic sorting algorithm with something specific for strings; see here for example
  • use a trie, removing the need for sorting; this could be theoretically better but probably worse in practice.

Just for completeness, as a micro-optimization, assuming the date has a fixed length of 10 characters:

defget_player_path(player_file_list, total_players):
    player_files_to_process = set()
    for player_file in player_file_list:
        end = player_file.find('/', 12)       # <--- len(date) + len('/') + 1
        file_path = player_file[:end]         # <---
        player_files_to_process.add(file_path)
        iflen(player_files_to_process) == total_players:
            breakreturnsorted(player_files_to_process)

If the IDs have fixed length too, as in your example list, then you don't need any split or find, just:

LENGTH = DATE_LENGTH + ID_LENGTH + 1# 1 is for the slash between date and id
...
for player_file in player_file_list:
    file_path = player_file[:LENGTH]
...

EDIT: fixed the LENGTH initialization, I had forgotten to add 1

Solution 2:

I'll leave this solution here which can be further improved, hope it helps.

player_file_list = (
    "2020-10-27/31001804320549/31001804320549.json",
    "2020-10-27/31001804320549/IDATs/204825150047/foo_bar_Red.idat",
    "2020-10-28/31001804320548/31001804320549.json",
    "2020-10-28/31001804320548/IDATs/204825150123/foo_bar_Red.idat",
    "2020-10-29/31001804320547/31001804320549.json",
    "2020-10-29/31001804320547/IDATs/204825150227/foo_bar_Red.idat",
    "2020-10-30/31001804320546/31001804320549.json",
    "2020-10-30/31001804320546/IDATs/123455150047/foo_bar_Red.idat",
    "2020-10-31/31001804320545/31001804320549.json",
    "2020-10-31/31001804320545/IDATs/597625150047/foo_bar_Red.idat",
)

defget_player_path(l, n):
  pfl = set()
  for i in l:
    i = "/".join(i.split("/")[0:2])
    if i notin pfl:
      pfl.add(i)
    iflen(pfl) == n:
      return pfl

  
  if n > len(pfl):
    print("not enough matches")
    returnprint(get_player_path(player_file_list, 2))
# {'2020-10-27/31001804320549', '2020-10-28/31001804320548'}

Python Demo

Solution 3:

Use dict so that you don't have to sort it since your list is already sorted. If you still need to sort you can always use sorted in the return statement. Add import re and replace your function as follows:

def get_player_path(player_file_list, total_players):
    dct = {re.search('^\w+-\w+-\w+/\w+',pf).group(): 1 for pf in player_file_list}
    return [k for i,k in enumerate(dct.keys()) if i < total_players]

Post a Comment for "Parse List Of Strings For Speed"