How Do I Sort A Dataframe By An Array Not In The Dataframe
Solution 1:
TL;DRiloc + argsort
We can approach this using iloc where we can take an array of ordinal positions and return the dataframe reordered by these positions.
With the power of iloc, we can sort with any array that specifies the order.
Now, all we need to do is identify a method for getting this ordering. Turns out there is a method called argsort which does exactly this. By passing the results of argsort to iloc, we can get our dataframe sorted out.
Example 1
Using the specified problem above
df.iloc[df.prod(1).argsort()]Same results as above
AB001112221441332552That was for simplicity. We could take this further if performance is an issue and focus on numpy
v = df.values
a = v.prod(1).argsort()
pd.DataFrame(v[a], df.index[a], df.columns)
How fast are these solutions?
We can see that pd_ext_sort is the most concise but does not scale as well as the others.
np_ext_sort gives the best performance at the expense of transparency. Though, I'd argue that it's still very clear what is going on.
backtest setup
def add_drop():
    return df.assign(P=df.prod(1)).sort_values('P').drop('P', 1)
def pd_ext_sort():
    return df.iloc[df.prod(1).argsort()]
def np_ext_sort():
    v = df.values
    a = v.prod(1).argsort()
    return pd.DataFrame(v[a], df.index[a], df.columns)
results = pd.DataFrame(
    index=pd.Index([10, 100, 1000, 10000], name='Size'),
    columns=pd.Index(['add_drop', 'pd_ext_sort', 'np_ext_sort'], name='method')
)
for i in results.index:
    df = pd.DataFrame(np.random.rand(i, 2), columns=['A', 'B'])
    for j in results.columns:
        stmt = '{}()'.format(j)
        setup = 'from __main__ import df, {}'.format(j)
        results.set_value(i, j, timeit(stmt, setup, number=100))
results.plot()
Example 2
Suppose I have a column of negative and positive values. I want to sort by increasing magnitude... however, I want the negatives to come first.
Suppose I have dataframe df
df = pd.DataFrame(dict(A=range(-2, 3)))
print(df)
   A
0 -2
1 -1
2  0
3  1
4  2
I'll set up 3 versions again. This time I'll use np.lexsort which returns the same type of array as argsort. Meaning, I can use it to reorder the dataframe.
Caveat:np.lexsort sorts by the last array in its list first. \shurg
def add_drop():
    return df.assign(P=df.A >= 0, M=df.A.abs()).sort_values(['P', 'M']).drop(['P', 'M'], 1)
def pd_ext_sort():
    v = df.A.values
    return df.iloc[np.lexsort([np.abs(v), v >= 0])]
def np_ext_sort():
    v = df.A.values
    a = np.lexsort([np.abs(v), v >= 0])
    return pd.DataFrame(v[a, None], df.index[a], df.columns)
All of which return
A1 -10 -2203142How fast this time?
In this example, both pd_ext_sort and np_ext_sort outperformed add_drop.
backtest setup
results = pd.DataFrame(
    index=pd.Index([10, 100, 1000, 10000], name='Size'),
    columns=pd.Index(['add_drop', 'pd_ext_sort', 'np_ext_sort'], name='method')
)
for i in results.index:
    df = pd.DataFrame(np.random.randn(i, 1), columns=['A'])
    for j in results.columns:
        stmt = '{}()'.format(j)
        setup = 'from __main__ import df, {}'.format(j)
        results.set_value(i, j, timeit(stmt, setup, number=100))
results.plot(figsize=(15, 6))


Post a Comment for "How Do I Sort A Dataframe By An Array Not In The Dataframe"