Sklearn Stratified Sampling Based On A Column

I have a fairly large CSV file containing amazon review data which I read into a pandas data frame. I want to split the data 80-20(train-test) but while doing so I want to ensure t

Solution 1:

    >>> import pandas as pd
    >>> Meta = pd.read_csv('C:\\Users\\*****\\Downloads\\so\\Book1.csv')
    >>> import numpy as np
    >>> from sklearn.model_selection import train_test_split
    >>> y = Meta.pop('Categories')
    >>> Meta
        ReviewerID      ReviewText  ProductId
        0        1212    good product   14444425
        1        1233  will buy again     324532
        2        5432  not recomended  789654123
    >>> y
        0    Mobile
        1     drugs
        2       dvd
        Name: Categories, dtype: object
    >>> X = Meta
    >>> X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.33, random_state=42, stratify=y)
    >>> X_test
        ReviewerID    ReviewText  ProductId
        0        1212  good product   14444425

Solution 2:


stratify : array-like or None (default is None)

If not None, data is split in a stratified fashion, using this as the class labels.

Along the API docs, I think you have to try like X_train, X_test, y_train, y_test = train_test_split(Meta_X, Meta_Y, test_size = 0.2, stratify=Meta_Y).

Meta_X, Meta_Y should be assigned properly by you(I think Meta_Y should be Meta.categories based on your code).

Solution 3:

I am not sure why StratifiedShuffleSplit isn't mentioned by anyone

from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=42)
for train_index, test_index in split.split(df, df['Categories']):
    strat_train_set = df.loc[train_index]
    strat_test_set = df.loc[test_index]

For documentation refer StratifiedShuffleSplit

