Skip to content Skip to sidebar Skip to footer

Function Should Clean Data To Half The Size, Instead It Enlarges It By An Order Of Magnitude

This has been driving me nuts all week weekend. I am trying merge data for different assets around a common timestamp. Each asset's data is a value in dictionary. The data of inter

Solution 1:

When you merge your dataframes, you are doing a join on values that are not unique. When you are joining all these dataframes together, you are getting many matches. As you add more and more currencies you are getting something similar to a Cartesian product rather than a join. In the snippet below, I added code to sort the values and then remove duplicates.

from pandas import Series, DataFrame
import pandas as pd

coins='''
Bitcoin
Ripple
Ethereum
Litecoin
Dogecoin
Dash
Peercoin
MaidSafeCoin
Stellar
Factom
Nxt
BitShares
'''
coins = coins.split('\n')
API = 'https://api.coinmarketcap.com/v1/datapoints/'
data = {}

for coin in coins:
    print(coin)
    try:
        data[coin]=(pd.read_json(API + coin))
    except: pass
data2 = {}
for coin in data:
    TS = data[coin].market_cap_by_available_supply.map(lambda r: r[0])
    TS = pd.to_datetime(TS,unit='ms').dt.date
    cap = data[coin].market_cap_by_available_supply.map(lambda r: r[1])
    df = DataFrame(columns=['timestamp','cap'])
    df.timestamp = TS
    df.cap = cap
    df.columns = ['timestamp',coin+'_cap']
    df.sort_values(by=['timestamp',coin+'_cap'])
    df= df.drop_duplicates(subset='timestamp',keep='last')
    data2[coin] = df

df = data2['Bitcoin']
keys = data2.keys()
keys.remove('Bitcoin')
for coin in keys:
    df = pd.merge(left=df,right=data2[coin],left_on='timestamp', right_on='timestamp', how='left')
    printlen(df),len(df.columns)
df.to_csv('caps.csv')

EDIT:I have added a table belowing showing how the size of the table grows as you do your join operation.

This table shows the number of rows after joining 5,10,15,20,25, and 30 currencies.

Rows,Columns1015 51255 105095 15132071204195303251677821530

This table shows how removing duplicates makes your joins only match a single row.

Rows,Columns1000 51000 101000 151000 201000 251000 30

Post a Comment for "Function Should Clean Data To Half The Size, Instead It Enlarges It By An Order Of Magnitude"