Skip to content Skip to sidebar Skip to footer

Parse Data From Column Using If's

I have a dataframe column that contains multiple different text qualifiers and I want to be able to set a new column that for each row checks if text is in each row and if so do th

Solution 1:

No need to use numpy, pandas has a few different options for this sort of operation.

import pandas as pd

defparse_row_col1(row):
    result = ""if'TL~'in row.COL1:
        result = row.COL1.split('TL~')[1].split('_SP~')[0]
    elif'TB~'in row.COL1:
        result = row.COL1.split('TB~')[1].split('_SP~')[0]
    elif'PE~'in row.COL1:
        result = row.COL1.split('PE~')[1].split('_BA~')[0]
    return result


parse_res = pd.Series((parse_row_col1(curr) for curr in df.itertuples(index=False)))

This method, iterating over row tuples, isn't as fast as using numpy's select, but should be far less complex when dealing with a large number of conditions. Not only that, but as @rpanai points out in his answer, select can only handle mutually exclusive conditions, whereas the solution above functions regardless.

Solution 2:

IIUC this is a case where you can apply np.select see doc

import numpy as np
import pandas as pd
from io import StringIO

txt ="""COL1
0 PB~Cucumber_IT~_TL~Vegatables_SP~
1 PB~Potato_IT~_TB~Starch_SP~
2 PB~Onion_IT~_PE~Vegatables_BA~"""

df = pd.read_csv(StringIO(txt),
                 delim_whitespace=True)

condList = [df["COL1"].str.contains("TL~"),
            df["COL1"].str.contains("TB~"),
            df["COL1"].str.contains("PE~")]

choiceList = [df["COL1"].str.split('TL~').str[1].str[:-4],
              df["COL1"].str.split('TB~').str[1].str[:-4],
              df["COL1"].str.split('PE~').str[1].str[:-4]]

df["COL2"] = np.select(condList, choiceList)

You have to be sure that the conditions are all mutually exclusive.

Post a Comment for "Parse Data From Column Using If's"