by gc5
Last Updated September 17, 2018 18:26 PM

I want to apply spearman correlation to two pandas dataframes with the same number of columns (correlation of each pair of rows).

My objective is to compute the distribution of spearman correlations between each pair of rows (r, s) where r is a row from the first dataframe and s is a row from the second dataframe.

I am aware that similar questions have been answered before (see this). However, this question differs because I want to compare each row of first dataframe with ALL the rows in the second. Additionally, this is computationally intensive and it takes hours due to the size of my data. I want to parallelize it and possibly to rewrite it in order to speed it up.

I tried with numba but unfortunately it fails (similar issue to this) because it seems to not recognize scipy spearmanr. My code is the following:

```
def corr(a, b):
dist = []
for i in range(a.shape[0]):
for j in range(b.shape[0]):
dist += [spearmanr(a.iloc[i, :], b.iloc[j, :])[0]]
return dist
```

Pandas has a `corr`

function with the support of `spearman`

. It works on columns, so we can just transpose the dataFrame.

We will append df1 to df2 and calculate the correlation by iterating over each row

```
len_df1 = df1.shape[0]
df2_index = df2.index.values.tolist()
df = df2.append(df1).reset_index(drop=True).T
values = {i: [df.iloc[:,df2_index+[i]].corr(method='spearman').values] for i in range(len_df1)}
```

Doing a Spearman Rank correlation is simply doing a correlation of the ranks.

We can leverage `argsort`

to get ranks. Though the `argsort`

of the `argsort`

does get us the ranks, we can limit ourselves to one sort by slice assigning.

```
def rank(a):
i, j = np.meshgrid(*map(np.arange, a.shape), indexing='ij')
s = a.argsort(1)
out = np.empty_like(s)
out[i, s] = j
return out
```

In the case of correlating ranks, the means and standard deviations are all predetermined by the size of the second dimension of the array.

You can accomplish this same thing without numba, but I'm assuming you want it.

```
from numba import njit
@njit
def c(a, b):
n, k = a.shape
m, k = b.shape
mu = (k - 1) / 2
sig = ((k - 1) * (k + 1) / 12) ** .5
out = np.empty((n, m))
a = a - mu
b = b - mu
for i in range(n):
for j in range(m):
out[i, j] = a[i] @ b[j] / k / sig ** 2
return out
```

For posterity, we could avoid the internal loop altogether but this might have memory issues.

```
@njit
def c1(a, b):
n, k = a.shape
m, k = b.shape
mu = (k - 1) / 2
sig = ((k - 1) * (k + 1) / 12) ** .5
a = a - mu
b = b - mu
return a @ b.T / k / sig ** 2
```

```
np.random.seed([3, 1415])
a = np.random.randn(2, 10)
b = np.random.randn(2, 10)
rank_a = rank(a)
rank_b = rank(b)
c(rank_a, rank_b)
array([[0.32121212, 0.01818182],
[0.13939394, 0.55151515]])
```

If you were working with `DataFrame`

```
da = pd.DataFrame(a)
db = pd.DataFrame(b)
pd.DataFrame(c(rank(da.values), rank(db.values)), da.index, db.index)
0 1
0 0.321212 0.018182
1 0.139394 0.551515
```

We can do a quick validation using `pandas.DataFrame.corr`

```
pd.DataFrame(a.T).corr('spearman') == c(rank_a, rank_a)
0 1
0 True True
1 True True
```

- ServerfaultXchanger
- SuperuserXchanger
- UbuntuXchanger
- WebappsXchanger
- WebmastersXchanger
- ProgrammersXchanger
- DbaXchanger
- DrupalXchanger
- WordpressXchanger
- MagentoXchanger
- JoomlaXchanger
- AndroidXchanger
- AppleXchanger
- GameXchanger
- GamingXchanger
- BlenderXchanger
- UxXchanger
- CookingXchanger
- PhotoXchanger
- StatsXchanger
- MathXchanger
- DiyXchanger
- GisXchanger
- TexXchanger
- MetaXchanger
- ElectronicsXchanger
- StackoverflowXchanger
- BitcoinXchanger
- EthereumXcanger