Scikit-Learn One-hot-encode Before Or After Train/test Split
Solution 1:
While the previous comments correctly suggest it is best to map over your entire feature space first, in your case both the Train and Test contain all of the feature values in all of the columns.
If you compare the vectorizer.vocabulary_
between the two versions, they are exactly the same, so there is no difference in mapping. Hence, it cannot be causing the problem.
The reason Method 2 fails is because your dat_dict
gets re-sorted by the original index when you execute this command.
dat_dict=train_X.T.to_dict().values()
In other words, train_X
has a shuffled index going into this line of code. When you turn it into a dict
, the dict
order re-sorts into the numerical order of the original index. This causes your Train and Test data become completely de-correlated with y
.
Method 1 doesn't suffer from this problem, because you shuffle the data after the mapping.
You can fix the issue by adding a .reset_index()
both times you assign the dat_dict
in Method 2, e.g.,
dat_dict=train_X.reset_index(drop=True).T.to_dict().values()
This ensures the data order is preserved when converting to a dict
.
When I add that bit of code, I get the following results:
- Method 1: Validation Sample Score: 0.3454355044 (normalized gini)
- Method 2: Validation Sample Score: 0.3438430991 (normalized gini)
Solution 2:
I can't get your code to run, but my guess is that in the test dataset either
- you're not seeing all the levels of some of the categorical variables, and hence if you calculate your dummy variables just on this data, you'll actually have different columns.
- Otherwise, maybe you have the same columns but they're in a different order?
Post a Comment for "Scikit-Learn One-hot-encode Before Or After Train/test Split"