Skip to content Skip to sidebar Skip to footer

Scikit-Learn One-hot-encode Before Or After Train/test Split

I am looking at two scenarios building a model using scikit-learn and I can not figure out why one of them is returning a result that is so fundamentally different than the other.

Solution 1:

While the previous comments correctly suggest it is best to map over your entire feature space first, in your case both the Train and Test contain all of the feature values in all of the columns.

If you compare the vectorizer.vocabulary_ between the two versions, they are exactly the same, so there is no difference in mapping. Hence, it cannot be causing the problem.

The reason Method 2 fails is because your dat_dict gets re-sorted by the original index when you execute this command.

dat_dict=train_X.T.to_dict().values()

In other words, train_X has a shuffled index going into this line of code. When you turn it into a dict, the dict order re-sorts into the numerical order of the original index. This causes your Train and Test data become completely de-correlated with y.

Method 1 doesn't suffer from this problem, because you shuffle the data after the mapping.

You can fix the issue by adding a .reset_index() both times you assign the dat_dict in Method 2, e.g.,

dat_dict=train_X.reset_index(drop=True).T.to_dict().values()

This ensures the data order is preserved when converting to a dict.

When I add that bit of code, I get the following results:
- Method 1: Validation Sample Score: 0.3454355044 (normalized gini)
- Method 2: Validation Sample Score: 0.3438430991 (normalized gini)


Solution 2:

I can't get your code to run, but my guess is that in the test dataset either

  • you're not seeing all the levels of some of the categorical variables, and hence if you calculate your dummy variables just on this data, you'll actually have different columns.
  • Otherwise, maybe you have the same columns but they're in a different order?

Post a Comment for "Scikit-Learn One-hot-encode Before Or After Train/test Split"