Skip to content Skip to sidebar Skip to footer

Credit Card Transaction Classification In Python

I'm curious to see if anyone has any thoughts on how to accomplish this in Python with Pandas. I have a dataframe (df1) with the credit card transaction details that contains the P

Solution 1:

I came up with a solution but it could probably take long time for large DataFrames:

def func(x):
    global df_lookup
    for i in df_lookup['Name'].values:
        if i in x:
            return df_lookup.loc[df_lookup['Name'] == i, 'Category'].values[0]
    df_lookup = df_lookup.append({'Name': x, 'Category': 'Needs Category'}, ignore_index=True)
    return'Needs Category'

df1['Category'] = df1['Description'].apply(lambda x: func(x))

If you have Data for which there is no category in df_lookup, e.g. GOOGLE 5555555555, then you would get the following outputs.

output for df1:

                     Description  Amount        Category
0  AMAZON.COM*ajlja09ja AMZN.COM      10          Amazon
1          AMZN Mktp US *ajlkadf      15          Amazon
2           AMZN Prime *an9adjah      20          Amazon
3           Shell Oil 4106541031      20             Gas
4           Shell Oil 4163046510      25             Gas
5                 GOOGLE 5555555      10  Needs Category

output for df_lookup:

             Name        Category
0          AMAZON          Amazon
1            AMZN          Amazon
2       Shell Oil             Gas
3  GOOGLE 5555555  Needs Category

With this code you iterate over df_lookup for each row in df1 so it could not be the most efficient method with lots of categories in df_lookup

Solution 2:

You can try the following. It makes a Series that contains sets with all the matching categories (empty if none are matching, or with multiple values if there are multiple matches). There is an explicit loop, but it is on the lookup table (presumably much smaller than df1, the DataFrame to categorize):

result = pd.Series([set()] * len(df1), index=df1.index, name='Categories')
dstr = df1['Description'].strfor k, name in df_lookup.set_index('Category')['Name'].items():
    idx = dstr.contains(name)
    result.loc[idx] = result.loc[idx].apply(lambda s: s|{k})

You could assign this to a new column of df1, or use it in any way you like.

On your example:

>>> df1.assign(categories=result)
                     Description  Amount categories
0  AMAZON.COM*ajlja09ja AMZN.COM      10   {Amazon}
1          AMZN Mktp US *ajlkadf      15   {Amazon}
2           AMZN Prime *an9adjah      20   {Amazon}
3           Shell Oil 410654103120      {Gas}
4           Shell Oil 416304651025      {Gas}

Post a Comment for "Credit Card Transaction Classification In Python"