Quickest way to find correlation in a pandas dataframe
Well, pandas provides a method to do that. If you have a dataframe, simply calling df.corr() will give you a correlation matrix which you can then understand and plot. Here’s an example using UCI Health Disease Dataset:
df = pd.read_csv('../input/heart.csv')
corr = df.corr(method='pearson')
You can provide different methods to find correlation: pearson, kendall, spearman or pass a custom callable function.
The next step is you would want to visualize these correlations
import seaborn as sns
sns.heatmap(corr)
This gives a default heat-map which looks something like this:
Typically, these heat maps are mirrors across the diagonal so you can get rid of one of them.
# zero_like gives a zero numpy array similar to what is passed as first argument
# np.triu_indices_from gives the upper triangle indices (read triangle-upper-indices)mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
# Colors
cmap = sns.diverging_palette(240, 10, as_cmap=True)# Plotting the heatmap
sns.heatmap(corr, mask=mask, linewidths=.5, cmap=cmap, center=0)
That’s it! This will get you started with correlation.
See the complete notebook here. Follow me on Twitter and LinkedIn.