Pairwise Euclidean Distance With Pandas Ignoring NaNs

June 23, 2022 Post a Comment

I start with a dictionary, which is the way my data was already formatted: import pandas as pd dict2 = {'A': {'a':1.0, 'b':2.0, 'd':4.0}, 'B':{'a':2.0, 'c':2.0, 'd':5.0}, 'C':{'b'

Solution 1:

You can use numpy broadcasting to compute vectorised Euclidean distance (L2-norm), ignoring NaNs using np.nansum.

i = df.values.T
j = np.nansum((i - i[:, None]) ** 2, axis=2) ** .5

If you want a DataFrame representing a distance matrix, here's what that would look like:

df = (lambda v, c: pd.DataFrame(v, c, c))(j, df.columns)
df
          A         B    C
A  0.000000  1.414214  1.0
B  1.414214  0.000000  1.0
C  1.000000  1.000000  0.0

df[i, j] represents the distance between the i and j column in the original DataFrame.

Solution 2:

The code below iterates through columns to calculate the difference.

# Import libraries
import pandas as pd
import numpy as np

# Create dataframe
df = pd.DataFrame({'A': {'a':1.0, 'b':2.0, 'd':4.0}, 'B':{'a':2.0, 'c':2.0, 'd':5.0},'C':{'b':1.0,'c':2.0, 'd':4.0}})
df2 = pd.DataFrame()

# Calculate difference
clist = df.columns
for i in range (0,len(clist)-1):
    for j in range (1,len(clist)):
        if (clist[i] != clist[j]):
            var = clist[i] + '-' + clist[j]
            df[var] = abs(df[clist[i]] - df[clist[j]]) # optional
            df2[var] = abs(df[clist[i]] - df[clist[j]]) # optional