How To Perform StandardScaler On Pandas Dataframe With A Column/columns Containing Numpy.ndarrays?
Solution 1:
StandardScaler
expects each column to have numeric values but col2
and col4
have sequences and hence the error.
I think it would be best to treat columns with sequences separately and then combine back with rest of data.
For now, I will assume for all rows, no. of elements in sequence for a given column is same, e.g. all rows of col_2
have 3 value array.
Since, StandardScaler
calculates mean
and std
for all columns individually. There are two approaches for sequence columns:
Approach 1: Elements at all positions of sequence come from same distribution.
In this case, you should get mean
and std
over all values. After fitting StandardScaler
on flattened array, reshape it back to original shape.
Approach 2: Elements at different position of sequence come from different distributions.
In this scenario, a single column can be converted to a 2D numpy array. You can fit StandardScaler
on that 2D array (each column mean
and std
will be calculated separately) and bring it back to single column after transformation.
Below is code for both approaches:
# numeric columns should work as expected
X_train_1 = X_train[['col1', 'col3']]
X_test_1 = X_test[['col1', 'col3']]
sc = StandardScaler()
X_train_1 = sc.fit_transform(X_train_1)
X_test_1 = sc.transform(X_test_1)
# first convert seq column to a 2d array
X_train_col2 = np.vstack(X_train['col2'].values).astype(float)
X_test_col2 = np.vstack(X_test['col2'].values).astype(float)
# for sequence columns, there are two approaches:
# Approach 1
sc_col2 = StandardScaler()
X_train_2 = sc_col2.fit_transform(X_train_col2.flatten().reshape(-1, 1))
X_train_2 = X_train_2.reshape(X_train_col2.shape)
X_test_2 = sc_col2.transform(X_test_col2.flatten().reshape(-1, 1))
X_test_2 = X_test_2.reshape(X_test_col2.shape)
# Approach 2
sc_col2 = StandardScaler()
X_train_2 = sc_col2.fit_transform(X_train_col2)
X_test_2 = sc_col2.transform(X_test_col2)
# To assign back to dataframe, you can do following:
X_test["col2_scaled"] = X_test_2.tolist()
# To stack with other numpy arrays
X_train_scaled = np.hstack((X_train_1, X_train_2))
In approach 2, it is possible to stack all columns first and then perform StandarScaler
on all of them in one shot.
Solution 2:
Try converting the array into a dataframe. My limited understanding is that it needs to work with 2-D arrays instead of a 1-D array.
import pandas as pd
import numpy as np
X = pd.DataFrame(np.array(([34, 56, 234]))
y = pd.DataFrame(np.array([11, 598, 1]))
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
X_train
Out[38]:
array([[ 1.],
[-1.]])
Post a Comment for "How To Perform StandardScaler On Pandas Dataframe With A Column/columns Containing Numpy.ndarrays?"