Skip to content Skip to sidebar Skip to footer

Shape Gets Changed When Preprocessing With Column Transformer And Predicting The Testing Data

The data structure is like below. df_train.head() ID y X0 X1 X2 X3 X4 X5 X6 X8 ... X375 X376 X377 X378 X379 X380 X382 X383 X384 X385 0 0

Solution 1:

I have tried to create a Minimal Reproducible Example of your problem, and I do not run into any errors myself. Can you run it on your side? See if there are any important differences between the dataframe created here and yours?

Note that:

  • When transforming your test data, you should only transform the data with the ColumnTransformer and not fit it
  • The OneHotEncoder is initialized with handle_unknown = 'ignore'
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# Parameters to tweak
n_categories = 10# Number of categorical columns
groups_by_cat = [3 , 10] # Number of groups which a category will have, to be chosen # randomly between these two numbers
n_rows = 20
n_binary_cols = 10# code
list_alpha = list('abcdefghijklmnopqrstuvwxyz')
np.random.seed(42)
groups = []

# names of the columns of the dataframe
col_names = ['X'+str(i) for i inrange(n_categories + n_binary_cols)]

# first we generate randomly a set of groups that each category can havefor i inrange(n_categories):
    np.random.randn()
    temp_groups = []
    temp_n_groups = np.random.randint(*groups_by_cat)
    for k inrange(temp_n_groups):
        group = "".join(np.random.choice(list_alpha,2, replace = True))
        temp_groups.append(group)
    groups.append(temp_groups)

# then we generate n_rows taking samples from the groups generated previously
array_categories = np.random.choice(groups[0],(n_rows,1), replace = True)
for i inrange(1,n_categories):
    temp_column = np.random.choice(groups[i],(n_rows,1), replace = True)
    array_categories = np.hstack((array_categories, temp_column))
    

# we generate an array containing the binary columns
array_binaries = np.random.randint(0, 2, (n_rows, n_binary_cols))


# we create the dataframe concatenating together the two arrays
df = pd.DataFrame(np.hstack((array_categories, array_binaries)), columns = col_names)

y = np.random.random_sample((n_rows,1))

# split
X_train, X_test, y_train, y_test = train_test_split(df, y)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

# create column transformer
cat_cols = df.select_dtypes(include="object").columns
ct = make_column_transformer((OneHotEncoder(handle_unknown='ignore'),cat_cols),
                             remainder='passthrough')

# fit transform the ColumnTransformer
X_train_transformed = ct.fit_transform(X_train)

# fit linearRegression and predict
linereg = LinearRegression()
linereg.fit(X_train_transformed,y_train)
X_test_transformed = ct.transform(X_test)

print("\nSizes of transformed arrays")
print(X_train_transformed.shape)
print(X_test_transformed.shape)

linereg.predict(X_test_transformed)

Note that the test data, is only transformed with the ColumnTransformer:

X_test_transformed = ct.transform(X_test)

Otherwise the OneHotEncoder() will calculate again the necessary columns for your test data, which might not be exactly the same columns than for your training data (if for example the test data does not have some of the groups that were found on your training data). Here you have more information in the differences between fitfit_transform and transform

Post a Comment for "Shape Gets Changed When Preprocessing With Column Transformer And Predicting The Testing Data"