How Do You Import A Numerically Encoded Column In Pandas?
Solution 1:
You can use categories
for this:
df = pd.DataFrame({"Sex": [1, 2, 1, 1, 2, 1, 2]})
Change the dtype:
df["Sex"] = df["Sex"].astype("category")
print(df["Sex"])
Out[33]:
0 1
1 2
2 1
3 1
4 2
5 1
6 2
Name: Sex, dtype: category
Categories (2, int64): [1, 2]
Rename categories:
df["Sex"] = df["Sex"].cat.rename_categories(["Male", "Female"])
print(df)
Out[36]:
Sex
0 Male
1 Female
2 Male
3 Male
4 Female
5 Male
6 Female
I tried it on a ~75k dataset (choosing the most reviewed 30 beers from beer reviews dataset)
rep_dict = dict(zip(df.beer_name.unique(), range(len(df.beer_name.unique())))) #it constructs a dictionary where the beer names are assigned a number from 0 to 29.
replace
is quite slow:
%timeit df["beer_name"].replace(rep_dict)
10 loops, best of 3: 139 ms per loop
map
is faster as expected (because it looks for the exact matching):
%timeit df["beer_name"].map(rep_dict)
100 loops, best of 3: 2.78 ms per loop
Changing the category of a column takes almost as much as map
:
%timeit df["beer_name"].astype("category")
100 loops, best of 3: 2.57 ms per loop
However, after changing, category renames are way faster:
df["beer_name"] = df["beer_name"].astype("category")
%timeit df["beer_name"].cat.rename_categories(range(30))
10000 loops, best of 3: 149 µs per loop
So, a second map
would take as much time as the first map
but once you change the category, rename_categories
will be faster. Unfortunately, category
dtype cannot be assigned while reading the file. You need to change the types afterwards.
As of version 0.19.0, you can pass dtype='category'
to read_csv (or specify which columns to be parsed as categories with a dictionary). (docs)
Post a Comment for "How Do You Import A Numerically Encoded Column In Pandas?"