How To Perform StandardScaler On Pandas Dataframe With A Column/columns Containing Numpy.ndarrays?

December 09, 2022 Post a Comment

I have a pandas dataframe that has some columns with numpy.ndarrays: col1 col2 col3 col4 0 4 array([34, 56, 234]) 7 array([765, 654]) 1 3

Solution 1:

StandardScaler expects each column to have numeric values but col2 and col4 have sequences and hence the error.

I think it would be best to treat columns with sequences separately and then combine back with rest of data.

For now, I will assume for all rows, no. of elements in sequence for a given column is same, e.g. all rows of col_2 have 3 value array.

Since, StandardScaler calculates mean and std for all columns individually. There are two approaches for sequence columns:

Approach 1: Elements at all positions of sequence come from same distribution.

In this case, you should get mean and std over all values. After fitting StandardScaler on flattened array, reshape it back to original shape.

Approach 2: Elements at different position of sequence come from different distributions.

In this scenario, a single column can be converted to a 2D numpy array. You can fit StandardScaler on that 2D array (each column mean and std will be calculated separately) and bring it back to single column after transformation.

Below is code for both approaches:

# numeric columns should work as expected
X_train_1 = X_train[['col1', 'col3']]
X_test_1 = X_test[['col1', 'col3']]

sc = StandardScaler()
X_train_1 = sc.fit_transform(X_train_1)
X_test_1 = sc.transform(X_test_1)

# first convert seq column to a 2d array
X_train_col2 = np.vstack(X_train['col2'].values).astype(float)
X_test_col2 = np.vstack(X_test['col2'].values).astype(float)

# for sequence columns, there are two approaches:
# Approach 1
sc_col2 = StandardScaler()
X_train_2 = sc_col2.fit_transform(X_train_col2.flatten().reshape(-1, 1))
X_train_2 = X_train_2.reshape(X_train_col2.shape)

X_test_2 = sc_col2.transform(X_test_col2.flatten().reshape(-1, 1))
X_test_2 = X_test_2.reshape(X_test_col2.shape)


# Approach 2
sc_col2 = StandardScaler()
X_train_2 = sc_col2.fit_transform(X_train_col2)

X_test_2 = sc_col2.transform(X_test_col2)

# To assign back to dataframe, you can do following:
X_test["col2_scaled"] = X_test_2.tolist()

# To stack with other numpy arrays
X_train_scaled = np.hstack((X_train_1, X_train_2))

In approach 2, it is possible to stack all columns first and then perform StandarScaler on all of them in one shot.

Solution 2:

Try converting the array into a dataframe. My limited understanding is that it needs to work with 2-D arrays instead of a 1-D array.

import pandas as pd
import numpy as np    

X = pd.DataFrame(np.array(([34, 56, 234]))
y = pd.DataFrame(np.array([11, 598, 1]))

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)


from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

X_train
Out[38]: 
array([[ 1.],
       [-1.]])

Python Guru

How To Perform StandardScaler On Pandas Dataframe With A Column/columns Containing Numpy.ndarrays?

Solution 1:

Approach 1: Elements at all positions of sequence come from same distribution.

Approach 2: Elements at different position of sequence come from different distributions.

Solution 2:

Post a Comment for "How To Perform StandardScaler On Pandas Dataframe With A Column/columns Containing Numpy.ndarrays?"