Sklearn Stratified Sampling Based On A Column
I have a fairly large CSV file containing amazon review data which I read into a pandas data frame. I want to split the data 80-20(train-test) but while doing so I want to ensure t
Solution 1:
>>> import pandas as pd
>>> Meta = pd.read_csv('C:\\Users\\*****\\Downloads\\so\\Book1.csv')
>>> import numpy as np
>>> from sklearn.model_selection import train_test_split
>>> y = Meta.pop('Categories')
>>> Meta
ReviewerID ReviewText ProductId
01212 good product 1444442511233 will buy again 32453225432not recomended 789654123
>>> y
0 Mobile
1 drugs
2 dvd
Name: Categories, dtype: object
>>> X = Meta
>>> X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.33, random_state=42, stratify=y)
>>> X_test
ReviewerID ReviewText ProductId
01212 good product 14444425
Solution 2:
sklearn.model_selection.train_test_split
stratify : array-like or None (default is None)
If not None, data is split in a stratified fashion, using this as the class labels.
Along the API docs, I think you have to try like X_train, X_test, y_train, y_test = train_test_split(Meta_X, Meta_Y, test_size = 0.2, stratify=Meta_Y)
.
Meta_X
, Meta_Y
should be assigned properly by you(I think Meta_Y
should be Meta.categories
based on your code).
Solution 3:
I am not sure why StratifiedShuffleSplit isn't mentioned by anyone
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=42)
for train_index, test_index in split.split(df, df['Categories']):
strat_train_set = df.loc[train_index]
strat_test_set = df.loc[test_index]
For documentation refer StratifiedShuffleSplit
Post a Comment for "Sklearn Stratified Sampling Based On A Column"