Skip to content Skip to sidebar Skip to footer

Pipelining Transformer Stages In IBEX, Column Access Problems In Scikit-Learn And Pandas

I'm trying to create a scikit-learn based pipeline to pipeline through a pandas dataframe. At each stage, only a subset of features should be touched, the rest should pass through

Solution 1:

ibex (which I co-wrote) makes extensive use of Pandas multilevel indexes.

Suppose we start with

import pandas as pd

df = pd.DataFrame({'source': [2, 44], 'class': [0, 1], 'x': [0, 5], 'y': [0, 6], 'z': [0, 8], 'w': 10})
>>> df
   class  source   w  x  y  z
0      0       2  10  0  0  0
1      1      44  10  5  6  8

Then the beginning of your pipeline gives

>>> (trans(LabelEncoder(), in_cols=['class']) + trans(None, ['source', 'x','y','z'])).fit_transform(df)
    functiontransformer_0   functiontransformer_1
    0                       source  x   y   z
0   0                       2       0   0   0
1   1                       44      5   6   8

This is by design.

You can achieve what you want by writing the pipeline as:

p = (
    trans(LabelEncoder(), in_cols="class")
    + trans(StandardScaler(), in_cols=["x", "y", "z"])
    + trans(None, in_cols="source")
)
>>> p.fit_transform(df)
    functiontransformer_0   functiontransformer_1   functiontransformer_2
    0                       x        y         z    source
0   0                       -1.0    -1.0    -1.0    2
1   1                       1.0     1.0     1.0     44

Post a Comment for "Pipelining Transformer Stages In IBEX, Column Access Problems In Scikit-Learn And Pandas"