import sklearn, leidenalg, itertools, random
import pandas as pd
import numpy as np
import igraph as ig
import sklearn.neighbors
import scipy as sp
from statsmodels.stats.multitest import multipletests
import gseapy as gp
# Plotting
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
Objective
In this notebook you will learn how to build and analyse a network built and analysed from a gene-metabolite association analysis. Other mixed networks may also be similarly analyzed, differring only in whether and how you can apply the final functional analysis.
Data
Download the data folder. As a test case we will be using the file data/met_genes.tsv contains abundances for 125 metabolites and 1992 genes, for 24 samples.
Software
This notebook relies on python's igraph for most of the analyses. Most, if not all, functions used here can also be applied with R's igraph. Other packages exist for network analysis including networkx and graph-tool. Snap.py is also a good alternative for large networks.
An introduction to network analysis in igraph in R can be found here.
Instructions
Further instructions and sources to use to generate networks can be found here.
You will use a gene expression dataset (RNAseq, expressed as TPM) from a disease group with 24 samples to keep analysis memory and time requirements reasonable for this lesson. It is assumed that all batch effects or other possible technical artifacts are not present, and that all data is ready for analysis. However, there are several important considerations in preprocessing your data:
This will depend on the type of that that you have and what you want to do with it, and will severely affect downstream results. It is thus important to carefully think about this.
data=pd.read_csv('data/met_genes.tsv', sep="\t", index_col=0)
data.head()
No duplicated features are present.
any(data.index.duplicated())
data.groupby('Type').agg('count')[['p10']] #1992 genes, 125 metabolites
data.shape #2117 features, 25 samples
A very quick view shows that several gene clusters are found, including two major groups. However, the analysis below does not perform any statistical filtering.
Our initial network analysis will be performed on the association analysis using Spearman correlations. Because this network has a big chance of producing false positives we will consider Bonferroni correction to control for familywise error, as well as FDR.
A very quick view shows that several gene clusters are found, including two major groups. However, the analysis below does not perform any statistical filtering.
values=data.loc[:,data.columns!='Type']
meta=data[['Type']]
values.isna().any().any() #we have no rows with NA
We will perform a gene-gene, gene-metabolite, and metabolite-metabolite association analysis by computing pairwise Spearman correlations. Choosing other non-parametric (Kendall or Spearman) vs parametric (Pearson) methods depends on your data. Here, because we have a small sample size, and want to save on computational time we choose Spearman.
The following takes a few minutes to run on a normal laptop, so you may instead load the resulting correlations
table.
#Correlation and P val matrices
Rmatrix, Pmatrix= sp.stats.spearmanr(values.T)
Rmatrix=pd.DataFrame(Rmatrix, index=values.index.copy(), columns=values.index.copy())
sns.clustermap(Rmatrix, cmap="RdBu_r", center=0); #resulting R matrix, similar to that seen in the section above
If we look at the matrix of P values, we can already see that many of the correlations in the top right columns are not significant even before multiple hypothesis correction:
Pmatrix=pd.DataFrame(Pmatrix, index=values.index.copy(), columns=values.index.copy())
changed_Pmatrix=Pmatrix.copy()
changed_Pmatrix[changed_Pmatrix>0.01]=1
sns.clustermap(changed_Pmatrix); #resulting P matrix, similar to that seen in the section above
((2117**2)-2117)/2
We will now adjust the P values based on the number of comparisons done. The heatmaps above are highlighting a total of $2117^2 \approx 4.5m$ correlations. However, these numbers consider that the same correlation is computed twice (gene A vs gene B, and gene B vs gene A). If we were to include all correlations, we would thus be including many repeated analyses, and we are only interested in half of that above, and excluding the correlation of a feature with itself.
This means $\frac{2117!}{2!(2117-2)!} \approx 2.2m$ correlations. At an error rate of 0.05, this means that the probability of finding at least one false positive is nearly 100%: $1-0.95^{2000000} \approx 1$. We thus need to correct P values. We will apply a Bonferroni correction as well as a FDR correction.
###Prepares R and P matrices
Psquared=Pmatrix.where(np.triu(np.ones(Pmatrix.shape),1).astype(np.bool))
Psquared.columns.name='Feat2'
Pmatrix=Pmatrix.stack()
Pmatrix.index.names=['v1','v2']
Pmatrix=Pmatrix.reset_index()
Pmatrix.columns=['feat1','feat2','P']
Rmatrix=Rmatrix.where(np.triu(np.ones(Rmatrix.shape),1).astype(np.bool))
Rmatrix.columns.name='Feat2'
Rmatrix=Rmatrix.stack()
Rmatrix.index.names=['v1','v2'] #Avoid stacked names colliding
Rmatrix=Rmatrix.reset_index()
Rmatrix.columns=['feat1','feat2','R']
PRmatrix=pd.merge(Rmatrix.copy(), Pmatrix.copy(), on=['feat1','feat2']) #Correlation matrix with both R and P
PRmatrix=PRmatrix.loc[PRmatrix.feat1!=PRmatrix.feat2].dropna()
#Multiple hypothesis correction computed on the P column
adjP=pd.DataFrame(multipletests(PRmatrix['P'], method='bonferroni', alpha=0.01)[1], columns=['Padj'])
FDR=pd.DataFrame(multipletests(PRmatrix['P'], method='fdr_bh', alpha=0.01)[1], columns=['FDR'])
PRmatrix=pd.concat([ PRmatrix, adjP], axis=1)
PRmatrix=pd.concat([ PRmatrix, FDR], axis=1)
PRmatrix.head()
PRmatrix.shape[0]
#total number of correlations w/o repetition: 2.2m
Considering the Bonferroni correction we find 16305 correlations that are statistically significant at an $\alpha < 0.01$. If we consider instead FDR as correction method, we find 402368, which at a FDR of 0.01 implies $0.01 \times 402368 = 4023$ false positives.
sum(PRmatrix.Padj<0.01)
sum(PRmatrix.FDR<0.01)
Let's add two additional columns, where we assign R=0
for those correlations that are not statistically significant (adjusted P
or FDR > 0.01
).
PRmatrix.loc[:,'R (padj)']=PRmatrix['R'].copy()
PRmatrix.loc[:,'R (fdr)']=PRmatrix['R'].copy()
PRmatrix.loc[PRmatrix['Padj']>0.01,'R (padj)']=0
PRmatrix.loc[PRmatrix['FDR']>0.01,'R (fdr)']=0
all_mets=meta.loc[meta.Type=='met'].index
PRmatrix['feat1_type']=['met' if x in all_mets else 'gene' for x in PRmatrix.feat1 ]
PRmatrix['feat2_type']=['met' if x in all_mets else 'gene' for x in PRmatrix.feat2 ]
PRmatrix['int_type']=PRmatrix.feat1_type+'_'+PRmatrix.feat2_type
PRmatrix.to_csv('data/association_matrix.tsv', sep="\t", index=False)
Now we can see how the initial heatmap of correlations looks.
#Transforming to a squared matrix again
PRQ=pd.concat([
PRmatrix.copy(),
PRmatrix.copy().rename(columns={'feat1':'feat2','feat2':'feat1'}).loc[:,PRmatrix.columns]
]).drop_duplicates()
Rmatrix_fdr=PRQ.copy().pivot(index='feat1',columns='feat2',values='R (fdr)')
Rmatrix_fdr=Rmatrix_fdr.loc[Rmatrix_fdr.sum()!=0]
Rmatrix_padj=PRQ.copy().pivot(index='feat1',columns='feat2',values='R (padj)')
Rmatrix_fdr=Rmatrix_fdr.loc[Rmatrix_fdr.index,Rmatrix_fdr.index].fillna(0)
Rmatrix_padj=Rmatrix_padj.loc[Rmatrix_fdr.index,Rmatrix_fdr.index].fillna(0)
#Showing only the top correlated features
top_features=((Rmatrix_fdr!=0).sum()>0.6*Rmatrix_fdr.shape[0]).index
Rmatrix_fdr=Rmatrix_fdr.loc[top_features,top_features]
Rmatrix_padj=Rmatrix_padj.loc[top_features,top_features]
g=sns.clustermap(Rmatrix_padj, cmap="RdBu_r", center=0);
g.fig.suptitle('R (filtered with Bonferroni)');
plt.show()
g=sns.clustermap(Rmatrix_fdr, cmap="RdBu_r", center=0);
g.fig.suptitle('R (filtered with FDR)');
plt.show()
The plots above show that the Bonferroni correction is only selecting very high (absolute) correlations. This may remove false positives, but it may also remove slightly weaker correlations that are biologically relevant. The Bonferroni correction also removes most of the negatively-associated features. Notice this from the distribution of correlation coefficients:
shortPR=PRmatrix.copy().loc[:,['feat1','feat2','R (padj)','R (fdr)']]
shortPR=shortPR.loc[shortPR.feat1!=shortPR.feat2]
fig=plt.figure(figsize=(8,4))
p=sns.distplot(shortPR['R (padj)'][shortPR['R (padj)']!=0], color='black');
p.set(ylabel='PDF (Bonferroni)')
ax2=p.twinx()
g=sns.distplot(shortPR['R (fdr)'][shortPR['R (fdr)']!=0], ax=ax2, color='r');
g.set(ylabel='PDF (FDR)')
fig.legend(labels=['bonferroni (<0.01)','fdr (<0.01)']);
plt.xlabel('R')
plt.title('Distribution of correlation coefficients')
plt.show()
We can also observe that the Bonferroni correction yields a more homogeneous number of associated features for each feature (i.e. first neighbors), compared to the FDR filtering. Notice how for a subset of features many correlations are identified (>200).
feat_associations_padj=pd.concat([
shortPR.copy().loc[shortPR['R (padj)']!=0,][['feat1','feat2']],
shortPR.copy().loc[shortPR['R (padj)']!=0,][['feat2','feat1']].rename(columns={'feat1':'feat2','feat2':'feat1'})]).drop_duplicates().groupby('feat1').agg('count')
feat_associations_fdr=pd.concat([
shortPR.copy().loc[shortPR['R (fdr)']!=0,][['feat1','feat2']],
shortPR.copy().loc[shortPR['R (fdr)']!=0,][['feat2','feat1']].rename(columns={'feat1':'feat2','feat2':'feat1'})]).drop_duplicates().groupby('feat1').agg('count')
feat_associations=pd.concat([feat_associations_padj, feat_associations_fdr],1, sort=True)
feat_associations.fillna(0,inplace=True)
feat_associations.columns=['bonferroni','fdr']
fig=plt.figure(figsize=(8,4))
sns.boxplot( data=feat_associations, notch=True);
plt.title('Number of associations per feature')
plt.show()
Most of the statistically significant correlations are only identified within each omic.
bonferroni_significant=PRmatrix.copy().loc[PRmatrix.Padj<0.01].loc[:,['feat1','feat2','R','int_type']]
bonferroni_significant['sig_test']='bonferroni'
fdr_significant=PRmatrix.copy().loc[PRmatrix.FDR<0.01].loc[:,['feat1','feat2','R','int_type']]
fdr_significant['sig_test']='FDR'
pd.concat([bonferroni_significant,fdr_significant]).groupby(['sig_test','int_type'])['R'].agg('count').reset_index()
In building a graph from an association analysis:
#saves the dataframe
PRmatrix.to_csv('data/association_matrix.tsv', sep="\t", index=False)
We will now build 2 different networks to analyse further:
# Prepares table for being read by igraph
PRmatrix=pd.read_csv('data/association_matrix.tsv', sep="\t")
PRmatrix.loc[PRmatrix['FDR']>0.01,'R (fdr)']=0
PRmatrix=PRmatrix.loc[PRmatrix['R (fdr)']!=0,['feat1','feat2','R (fdr)']]
PRmatrix=PRmatrix.loc[PRmatrix.feat1!=PRmatrix.feat2] #drops self correlations
fdr_pos_mat=PRmatrix.loc[PRmatrix['R (fdr)']>0]
fdr_neg_mat=PRmatrix.loc[PRmatrix['R (fdr)']<0]
PRmatrix=PRmatrix.loc[PRmatrix.isin(pd.unique(fdr_pos_mat[['feat1','feat2']].values.flatten())).sum(1)==2,] #selects only nodes also found in the positive network so that we can compare networks of the same sizes
We will now build the k-NNG, using distances as input to determine the nearest neighbours. Because this data contains both gene expressions and metabolite quantifications, we need to normalize them beforehand. (We didn't need to do this above as we were comparing ranks)
We'll normalize all features so that they range between 0 and 1.
#Imports and normalizes met and gene data so that we can compute similarities between them
from sklearn.preprocessing import MinMaxScaler
data=pd.read_csv('data/met_genes.tsv', sep="\t", index_col=0)
scaled_data=pd.DataFrame(MinMaxScaler().fit_transform(data.loc[:,data.columns!='Type'].T).T, columns=data.columns[data.columns!='Type'], index=data.index)
scaled_data_values=scaled_data.copy()
scaled_data['Type']=data.Type
#Plots the data distribution
fig,ax=plt.subplots(figsize=(12,8))
sns.kdeplot(
scaled_data.loc[scaled_data.Type=='met',scaled_data.columns!='Type'].values.flatten(),
color='r', label='Mets', ax=ax, legend=False)
ax2=ax.twinx()
sns.kdeplot(
scaled_data.loc[scaled_data.Type=='genes',scaled_data.columns!='Type'].values.flatten(),
color='b', label='Genes', ax=ax2, legend=False)
fig.suptitle('Distribution for met and genes');
fig.legend();
We will now generate the graphs from the dataframes above
### Generating the kNN graph
#Computes a kNN adjacency matrix from the input dataset
#and prepares table for being read by igraph
input_ds=scaled_data_values.loc[scaled_data_values.index.isin(pd.unique(fdr_pos_mat[['feat1','feat2']].values.flatten()))]
knnG=sklearn.neighbors.kneighbors_graph(input_ds.values, 50, metric='euclidean')
knnG=pd.DataFrame(knnG.toarray(), columns=input_ds.index.copy(), index=input_ds.index.copy()) #adjacency matrix
knnG.index.name='gene1'
knnG.columns.name='gene2'
knnG=knnG.stack().reset_index().rename(columns={0:'Connectivity'})
knnG=knnG.loc[knnG['Connectivity']!=0]
### Generates each of the graphs
#positive associations, weighted
pos_w=ig.Graph.TupleList([tuple(x) for x in fdr_pos_mat.values], directed=False, edge_attrs=['w'])
#full network, unweighted
all_u=ig.Graph.TupleList([tuple(x) for x in PRmatrix.loc[PRmatrix.isin(pd.unique(fdr_pos_mat[['feat1','feat2']].values.flatten())).sum(1)==2,['feat1','feat2']].values], directed=False)
#knnG, unweighted
knn=ig.Graph.TupleList([tuple(x) for x in knnG.values], directed=False)
#random network, unweighted, node and edge number based on a network of the same size
randomG=ig.Graph.Erdos_Renyi(
n=2102,
m=len(fdr_pos_mat.values), directed=False, loops=False)
For representation purposes we will see how a short knn graph looks - be careful in drawing the others, as they have many more edges it becomes computationally heavy.
#random subset of knn graph for plotting
short_knn=ig.Graph.TupleList([tuple(x) for x in knnG.values[random.sample(list(np.arange(len(knnG.values))), 10000)]], directed=False)
#This plots each graph, using degree to present node size:
short_knn.vs['degree']=short_knn.degree()
short_knn.vs['degree_size']=[(x*15)/(max(short_knn.vs['degree'])) for x in short_knn.vs['degree']] #degree is multiplied by 10 so that we can see all nodes
layout = short_knn.layout_mds()
ig.plot(short_knn, layout=layout, vertex_color='white', edge_color='silver', vertex_size=short_knn.vs['degree_size'])
In the next table, we can see that while the same number of nodes is found in all networks, the number of edges varies greatly. We also see that the network is fully connected, which is not allways the case. If it wasn't connected, we could select the k largest connected components, and proceed the analyses with them. The largest connected component is called the giant component.
#function to get graph properties, takes a few minutes to run
def graph_prop(input_graph):
ncount=nn.vcount()
ecount=nn.ecount()
diameter=nn.diameter()
av_path=nn.average_path_length()
dens=nn.density()
clustering=nn.transitivity_undirected() #this is the global clustering coefficient
conn=nn.is_connected()
min_cut=nn.mincut_value()
out=pd.DataFrame([ncount, ecount, diameter, av_path, dens, clustering, conn, min_cut],
index=['node_count','edge_count','diameter','av_path_length','density','clustering_coef','connected?','minimum_cut']).T
return(out)
#summarizing graph properties
network_stats=pd.DataFrame()
for nn in [pos_w, all_u, knn, randomG]:
network_stats=pd.concat([network_stats,graph_prop(nn)])
network_stats.index=['pos_w','all_u','knn','random']
network_stats
We'll look into different centrality measures:
Degree, Betweenness and Closeness are additionally computed for the positive association network by taking into account each edge's weight. For instance, for degree this is done for each node by summing each edge's degree.
Because the number of shortest paths in a network scales with the network size, we normalize Eccentricity and Betweenness with respect to the network size so that they can be compared between the four networks above.
Degree distribution
Let's start by comparing the degrees of the random network against the three other networks. From the figures below it seems that there is no relationship between the degree of the random network, and any of the three others.
fig, axes=plt.subplots(nrows=1, ncols=3, figsize=(15,5), sharey='row')
for i, ax in zip(range(4), axes.flat):
sns.regplot([pos_w,all_u,knn][i].degree(), randomG.degree(), ax=ax);
ax.set_title(['pos','all','knn'][i])
ax.set(xlabel='k ('+['pos','all','knn'][i]+')', ylabel='k (random)')
def transform_degree(graph):
alldegs=graph.degree()
alldegs=pd.DataFrame([[key,len(list(group))] for key,group in itertools.groupby(alldegs)], columns=['k','count'])
alldegs['P(k)']=[x/alldegs['count'].sum() for x in alldegs['count']]
alldegs=alldegs.loc[:,['k','P(k)']]
alldegs.drop_duplicates(inplace=True)
alldegs.reset_index(drop=True, inplace=True)
return(alldegs)
fig, ax = plt.subplots(figsize=(7, 7))
# ax.set(yscale='log', xscale='log')
p=sp.stats.probplot(pos_w.degree(), plot=ax)
a=sp.stats.probplot(all_u.degree(), plot=ax)
k=sp.stats.probplot(knn.degree(), plot=ax)
r=sp.stats.probplot(randomG.degree(), plot=ax)
col=['blue','','lightsalmon','','green','','red']
for x in np.arange(0,7,2):
ax.get_lines()[x].set_markerfacecolor(col[x])
ax.get_lines()[x].set_markeredgewidth(0)
ax.get_lines()[x+1].set_color(col[x])
fig.legend(labels=['pos','pos','all','all','knn','knn','random','random']);
ax.set(xlabel='Data quantiles', ylabel='observed degree (k)')
plt.show()
Centrality
We will now compare different centrality metrics between the graphs.
### This function computes different centrality measures
def combine_raw_centralities():
def centrality_raw(input_graph, graph_name):
deg=input_graph.degree()
ecc=input_graph.eccentricity()
node_n=input_graph.vcount()
btw=[(2*x / ((node_n-1)*(node_n-2))) for x in input_graph.betweenness(directed=False)] #scaled to account for network size
cls=input_graph.closeness(normalized=True)
#if edge weight is defined, compute weighted centralities
#otherwise, return a vector of 0
if('w' in input_graph.es.attribute_names()):
deg_w=input_graph.strength(weights='w') #this is a weighted degree
btw_w=[(2*x / ((node_n-1)*(node_n-2))) for x in input_graph.betweenness(directed=False, weights='w')]
cls_w=input_graph.closeness(weights='w', normalized=True)
else:
btw_w=np.repeat(0,input_graph.vcount())
deg_w=np.repeat(0,input_graph.vcount())
cls_w=np.repeat(0,input_graph.vcount())
out=pd.DataFrame(
[deg, ecc, btw, cls, deg_w, btw_w, cls_w],
index=['degree','eccentricity','betweeness', 'closeness',
'deg_w','btw_w','cls_w']).T
out['graph']=graph_name
out=out.loc[:,np.append(['graph'],out.columns[out.columns!='graph'])]
##Adds centralities for each node in the network
input_graph.vs['degree']=out.degree
input_graph.vs['eccentricity']=out.eccentricity
input_graph.vs['betweeness']=out.betweeness
input_graph.vs['closeness']=out.closeness
input_graph.vs['deg_w']=out.deg_w
input_graph.vs['btw_w']=out.btw_w
input_graph.vs['cls_w']=out.cls_w
return(out)
#Computes centralities for all networks
network_centralities_raw=pd.DataFrame()
network_centralities_raw=pd.concat([network_centralities_raw,centrality_raw(pos_w,'pos_w')])
network_centralities_raw=pd.concat([network_centralities_raw,centrality_raw(all_u,'all_u')])
network_centralities_raw=pd.concat([network_centralities_raw,centrality_raw(knn,'knn')])
network_centralities_raw=pd.concat([network_centralities_raw,centrality_raw(randomG,'random')])
return(network_centralities_raw)
network_centralities_raw=combine_raw_centralities()
fig, axes=plt.subplots(nrows=4, ncols=2, figsize=(16,20), sharey='row')
cen=['degree','deg_w','betweeness', 'btw_w', 'closeness','cls_w','eccentricity']
for i, ax in zip(range(10), axes.flat):
if(i==7):
fig.delaxes(ax)
break
sns.boxplot(data=network_centralities_raw, x='graph', y=cen[i], #outliers are hidden from plotting
notch=True, ax=ax, showfliers=False)
ax.set_title(cen[i])
ax.set(xlabel='')
Overall, we see:
Full network > Positive assoc. network > kNN-G
.kNN-G > Full network ~ Positive assoc. network
.kNN-G > Full network > Positive assoc. network
.Positive assoc. network
and for the Full network
, but nodes in the latter tend to display higher eccentricity. In turn, most nodes in the kNN-G
tend to display an eccentricity of 5-6.We then examine genes and metabolites for each graph:
full_centralities=pd.DataFrame()
for net in [0,1,2]:
net_in=[pos_w, all_u, knn, randomG][net]
net_nm=['pos_w', 'all_u', 'knn'][net]
temp=pd.DataFrame([net_in.vs[att] for att in ['name','degree','betweeness', 'closeness','eccentricity']], index=['name','degree','betweeness', 'closeness','eccentricity']).T
temp.columns=[x+'|'+net_nm for x in temp.columns]
temp.rename(columns={'name|'+net_nm:'name'}, inplace=True)
temp.loc[:,temp.columns!='name']=temp.loc[:,temp.columns!='name'].rank(pct=True)
temp['median_centrality|'+net_nm]=temp.loc[:,temp.columns!='name'].median(1)
if(net==0):
full_centralities=temp
else:
full_centralities=pd.merge(full_centralities, temp, on='name')
full_centralities.set_index('name', inplace=True)
full_centralities=pd.merge(full_centralities, data[['Type']], left_index=True, right_index=True, how='left')
full_centralities['median|ALL']=full_centralities.loc[:,full_centralities.columns.str.contains('median')].median(1)
full_centralities.sort_values('median_centrality|pos_w',ascending=False,inplace=True)
fig,ax=plt.subplots(figsize=(10,8))
sns.boxplot(data=full_centralities, x='Type', y='median_centrality|pos_w')
ax.set(xlabel='Feature type', ylabel='Median centrality rank (%)');
fig.suptitle('Overall centrality rank in pos_w (higher = more central)');
This shows that genes, in general, tend to be more central in all networks. We can get the 10 most central metabolites and genes:
full_centralities.loc[full_centralities.Type=='met',full_centralities.columns.str.contains('pos_w|ALL')].head(10)
full_centralities.loc[full_centralities.Type=='genes',full_centralities.columns.str.contains('pos_w|ALL')].head(10)
To identify communities, we will use the Leiden algorithm. This method is similar to the Louvain method but fixes different intrinsic problems it has.
partition = leidenalg.find_partition(short_knn, leidenalg.ModularityVertexPartition)
layout = short_knn.layout_mds()
ig.plot(partition, layout=layout, edge_color='silver', vertex_size=short_knn.vs['degree_size'])
We will perform the community analysis on the 4 networks. We will perform one additional community analysis by considering the edge weights from the positively associated network. Importantly, this method searches for the largest possible communities for our network, which may not always be the desired. Alternative models such as the Constant Potts Model allow you to identify smaller communities. Should we know that our data has special feature classes, we can compare whether the communities identify those classes by examining them individually, and increasing the resolution if needed.
pos_comm = leidenalg.find_partition(pos_w, leidenalg.ModularityVertexPartition)
pos_w_comm = leidenalg.find_partition(pos_w, leidenalg.ModularityVertexPartition, weights='w')
all_comm = leidenalg.find_partition(all_u, leidenalg.ModularityVertexPartition)
knn_comm = leidenalg.find_partition(knn, leidenalg.ModularityVertexPartition)
random_comm = leidenalg.find_partition(randomG, leidenalg.ModularityVertexPartition)
Predictably, we can see that the modularity score of any of the networks is substantially larger than that of the random network.
np.round(pos_comm.modularity,3)
np.round(pos_w_comm.modularity,3)
np.round(all_comm.modularity,3)
np.round(knn_comm.modularity,3)
np.round(random_comm.modularity,3)
Comparing the different communities by size:
#Compiles feat lists per community
def get_community_table():
comm_counts=pd.DataFrame()
feat_lists=pd.DataFrame()
for i in [0,1,2,3]:
graph=[pos_w,pos_w,all_u,knn][i]
comm=[pos_comm,pos_w_comm,all_comm,knn_comm][i]
name=['pos','pos_w','all','knn'][i]
temp=pd.DataFrame(list(zip(graph.vs['name'],[x+1 for x in comm.membership]))).rename(columns={0:'feat',1:'community'})
counts=pd.DataFrame(temp.groupby('community')['feat'].agg(len))
counts.columns=[name]
comm_counts=pd.concat([comm_counts, counts],1)
gl=pd.DataFrame(temp.groupby('community')['feat'].apply(list)).reset_index()
gl['community']=['c'+str(i) for i in gl['community']]
gl['network']=name
gl=gl.loc[:,['network','community','feat']]
feat_lists=pd.concat([feat_lists, gl])
comm_counts.index=['c'+str(i) for i in comm_counts.index]
return([comm_counts,feat_lists])
#Plotting community sizes
import seaborn as sns
fig, ax = plt.subplots(figsize=(7, 4))
get_community_table()[0].fillna(0).T.plot(kind='bar', stacked=True, ax=ax);
ax.legend(get_community_table()[0].index, loc='right', bbox_to_anchor=(1.15, 1));
ax.set_title('Community size')
plt.xticks(rotation=0)
plt.show()
Can you explain why the KNN network has so many more communities than the others?
Functional analysis
In order to perform functional enrichment, we will extract each of the communities and perform a hypergeometric test on the genes to understand whether they are particularly enriched in specific biological functions.
We will use enrichr to perform the gene set enrichment analysis. As background we will use the full list of genes that were quantified.
We will look at 3 gene set libraries. Should you have other kinds of data, enrichr allows you to define your own feature sets and perform a similar analysis. The challenge is in identifying comprehensive and well curated gene sets.
#All the available human libraries
gp.get_library_name(database='Human')[:15]
#we will search 3 libraries for significantly enriched gene sets
gene_sets=['GO_Biological_Process_2018','KEGG_2019_Human','OMIM_Disease']
background=[x for x in all_u.copy().vs['name'] if x in data.loc[data.Type=='genes'].index]
all_genes=data.loc[data.Type=='genes'].index
def perform_enrich(network):
temp=get_community_table()[1].copy()
temp=temp.loc[temp['network']==network]
output_enrichr=pd.DataFrame()
for comm in temp['community'].values:
gl=list(temp.loc[temp['community']==comm, 'feat'])[0]
gl=list([x for x in gl if x in all_genes])
if(len(gl)<30):
continue
print('Found '+str(len(gl))+'genes in community.')
for bp in gene_sets:
print('Analyzing '+network+' network | Comm: '+comm+'/'+str(len(temp.index))+' | BP: '+bp)
enr=gp.enrichr(
gene_list=gl,
gene_sets=bp,
background=background,
outdir='data/Enrichr',
format='png'
)
results=enr.results.sort_values('Adjusted P-value', ascending=True)
results=results.loc[results['Adjusted P-value']<0.05,]
results['BP']=bp
results['Comm']=comm
results['Graph']=network
output_enrichr=pd.concat([output_enrichr, results])
return(output_enrichr)
all_enriched=pd.DataFrame()
for net in ['pos', 'pos_w', 'all', 'knn']:
all_enriched=pd.concat([all_enriched,perform_enrich(net)])
Running the command above not only gives you the results after significance testing (Q<0.05), but it also outputs some preliminary barplots with the statistically significant results (found under /data/Enrichr/
). For instance:
enriched_terms=all_enriched.loc[:,['Graph','Comm','Term','Adjusted P-value']].copy()
enriched_terms['Adjusted P-value']=-1*np.log10(enriched_terms['Adjusted P-value'])
enriched_terms.head()
fig, ax = plt.subplots(figsize=(7, 4))
data_bars=pd.DataFrame(enriched_terms.groupby(['Graph','Comm'])['Term'].agg('count')).stack().reset_index().rename(columns={0:'Count'})
sns.barplot(x='Graph', y='Count', data=data_bars, hue='Comm')
ax.set_title('Number of significant Terms (Q < 0.05) per community')
ax.legend(loc='right', bbox_to_anchor=(1.15, 1));
plt.xticks(rotation=0)
plt.show()
Note that some of these communities are very big, which explains the big number of biological processes found above.
###Number of genes/community
# We skipped communities with <30 genes
get_community_table()[0].fillna(0).T
pd.DataFrame(enriched_terms.groupby(['Graph','Comm'])['Term'].agg('count'))
We can now find whether the Full Network, Positively associated, and Positively associated weighted, show any common terms among their biggest communities. We do not compare with kNN-G as this shows very homogeneous and different communities than the other two networks
#Finding consensus
temp=enriched_terms.copy()
temp['comm_term']=temp.Comm+'_'+temp.Term
temp=temp.loc[:,['Graph','comm_term']]
consensus=pd.DataFrame()
consensus=pd.concat([consensus, temp.loc[temp['Graph']=='pos']])
consensus=pd.merge(consensus,
temp.loc[temp['Graph']=='pos_w'], on="comm_term", how='outer', suffixes=['pos','pos_w'])
consensus=pd.merge(consensus,
temp.loc[temp['Graph']=='all'], on="comm_term", how='outer').rename(columns={'Graph':'all'})
consensus=consensus.loc[consensus.isna().sum(1)==0].loc[:,['comm_term','Graphpos','Graphpos_w','all']]
Among the biggest communities we find several biological processes (55) that are simultaneously identified in the same community in the three graphs (Full
, Pos assoc
, and Pos assoc weighted
).
consensus['comm']=[x[0] for x in consensus.comm_term.str.split('\_')]
consensus.groupby('comm')['comm_term'].agg('count')
Questions
If you want to extract communities to use them in other sections.
#Requires running the community detection from the previous section
patlas=pd.read_csv('data/proteinatlas.tsv', sep="\t").loc[:,['Ensembl','Gene']]
#Compiles gene lists per community. We need Ensembl ids for further analyses
def get_ensembl():
comm_counts=pd.DataFrame()
gene_lists=pd.DataFrame()
for i in [0,1,2,3]:
graph=[pos_w,pos_w,all_u,knn][i]
comm=[pos_comm,pos_w_comm,all_comm,knn_comm][i]
name=['pos','pos_w','all','knn'][i]
temp=pd.DataFrame(list(zip(graph.vs['name'],[x+1 for x in comm.membership]))).rename(columns={0:'gene',1:'community'})
gl=pd.DataFrame(temp.groupby('community')['gene'].apply(list)).reset_index()
gl['community']=['c'+str(i) for i in gl['community']]
gl['network']=name
gl=gl.loc[:,['network','community','gene']]
gene_lists=pd.concat([gene_lists, gl])
gene_communities=gene_lists
gene_mat=pd.DataFrame()
for net in gene_communities['network'].unique():
temp=gene_communities.copy().loc[gene_communities['network']==net,]
for comm in temp['community'].unique():
gl=list(temp.copy().loc[temp['community']==comm,'gene'])[0]
el=[patlas.loc[patlas['Gene']==x,'Ensembl'].iloc[0] for x in gl if x in patlas['Gene'].values]
df=pd.DataFrame([net,comm,gl,el]).T
df.columns=['network','community','Gene','Ensembl']
gene_mat=pd.concat([gene_mat, df])
return(gene_mat)
get_ensembl().to_csv('data/gene_communities.tsv', sep="\t", index=False)
Here we've performed some network analyses based on a met-gene association network. We've explored different centrality measures to characterize the networks, and identified the communities of genes in these networks. We have also used gene set enrichment analysis to characterize these communities based on the genes, but it remains to show whether similar results would be attained if we considered the metabolites in each community.