import sklearn, leidenalg, itertools, random
import pandas as pd
import numpy as np
import igraph as ig
import sklearn.neighbors
import scipy as sp
from statsmodels.stats.multitest import multipletests
import gseapy as gp

# Plotting
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

/usr/local/lib/python3.7/site-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm

Biological network topology analysis¶

Objective
In this notebook you will learn how to build and analyse a network built and analysed from a gene-metabolite association analysis. Other mixed networks may also be similarly analyzed, differring only in whether and how you can apply the final functional analysis.

Data
Download the data folder. As a test case we will be using the file data/met_genes.tsv contains abundances for 125 metabolites and 1992 genes, for 24 samples.

Software
This notebook relies on python's igraph for most of the analyses. Most, if not all, functions used here can also be applied with R's igraph. Other packages exist for network analysis including networkx and graph-tool. Snap.py is also a good alternative for large networks.

An introduction to network analysis in igraph in R can be found here.

Instructions
Further instructions and sources to use to generate networks can be found here.

Data preparation¶

You will use a gene expression dataset (RNAseq, expressed as TPM) from a disease group with 24 samples to keep analysis memory and time requirements reasonable for this lesson. It is assumed that all batch effects or other possible technical artifacts are not present, and that all data is ready for analysis. However, there are several important considerations in preprocessing your data:

How should you deal with missing values? Should you impute them? How?
Should you remove samples based on number of missing values?
How should you normalize your data in order to make it comparable throughout?

This will depend on the type of that that you have and what you want to do with it, and will severely affect downstream results. It is thus important to carefully think about this.

data=pd.read_csv('data/met_genes.tsv', sep="\t", index_col=0)
data.head()

No duplicated features are present.

any(data.index.duplicated())

False

data.groupby('Type').agg('count')[['p10']] #1992 genes, 125 metabolites

data.shape #2117 features, 25 samples

(2117, 25)

A very quick view shows that several gene clusters are found, including two major groups. However, the analysis below does not perform any statistical filtering.

Association analysis¶

Our initial network analysis will be performed on the association analysis using Spearman correlations. Because this network has a big chance of producing false positives we will consider Bonferroni correction to control for familywise error, as well as FDR.

A very quick view shows that several gene clusters are found, including two major groups. However, the analysis below does not perform any statistical filtering.

values=data.loc[:,data.columns!='Type']
meta=data[['Type']]

values.isna().any().any() #we have no rows with NA

False

We will perform a gene-gene, gene-metabolite, and metabolite-metabolite association analysis by computing pairwise Spearman correlations. Choosing other non-parametric (Kendall or Spearman) vs parametric (Pearson) methods depends on your data. Here, because we have a small sample size, and want to save on computational time we choose Spearman.

The following takes a few minutes to run on a normal laptop, so you may instead load the resulting correlations table.

#Correlation and P val matrices
Rmatrix, Pmatrix= sp.stats.spearmanr(values.T)
Rmatrix=pd.DataFrame(Rmatrix, index=values.index.copy(), columns=values.index.copy())
sns.clustermap(Rmatrix, cmap="RdBu_r", center=0); #resulting R matrix, similar to that seen in the section above

If we look at the matrix of P values, we can already see that many of the correlations in the top right columns are not significant even before multiple hypothesis correction:

Pmatrix=pd.DataFrame(Pmatrix, index=values.index.copy(), columns=values.index.copy())
changed_Pmatrix=Pmatrix.copy()
changed_Pmatrix[changed_Pmatrix>0.01]=1
sns.clustermap(changed_Pmatrix); #resulting P matrix, similar to that seen in the section above

/usr/local/lib/python3.7/site-packages/seaborn/matrix.py:603: ClusterWarning: scipy.cluster: The symmetric non-negative hollow observation matrix looks suspiciously like an uncondensed distance matrix
  metric=self.metric)

((2117**2)-2117)/2

2239786.0

We will now adjust the P values based on the number of comparisons done. The heatmaps above are highlighting a total of $2117^2 \approx 4.5m$ correlations. However, these numbers consider that the same correlation is computed twice (gene A vs gene B, and gene B vs gene A). If we were to include all correlations, we would thus be including many repeated analyses, and we are only interested in half of that above, and excluding the correlation of a feature with itself.
This means $\frac{2117!}{2!(2117-2)!} \approx 2.2m$ correlations. At an error rate of 0.05, this means that the probability of finding at least one false positive is nearly 100%: $1-0.95^{2000000} \approx 1$. We thus need to correct P values. We will apply a Bonferroni correction as well as a FDR correction.

###Prepares R and P matrices
Psquared=Pmatrix.where(np.triu(np.ones(Pmatrix.shape),1).astype(np.bool))
Psquared.columns.name='Feat2'
Pmatrix=Pmatrix.stack()
Pmatrix.index.names=['v1','v2']
Pmatrix=Pmatrix.reset_index()
Pmatrix.columns=['feat1','feat2','P']

Rmatrix=Rmatrix.where(np.triu(np.ones(Rmatrix.shape),1).astype(np.bool))
Rmatrix.columns.name='Feat2'
Rmatrix=Rmatrix.stack()
Rmatrix.index.names=['v1','v2'] #Avoid stacked names colliding
Rmatrix=Rmatrix.reset_index()
Rmatrix.columns=['feat1','feat2','R']

PRmatrix=pd.merge(Rmatrix.copy(), Pmatrix.copy(), on=['feat1','feat2']) #Correlation matrix with both R and P
PRmatrix=PRmatrix.loc[PRmatrix.feat1!=PRmatrix.feat2].dropna()

#Multiple hypothesis correction computed on the P column
adjP=pd.DataFrame(multipletests(PRmatrix['P'], method='bonferroni', alpha=0.01)[1], columns=['Padj'])
FDR=pd.DataFrame(multipletests(PRmatrix['P'], method='fdr_bh', alpha=0.01)[1], columns=['FDR'])

PRmatrix=pd.concat([ PRmatrix, adjP], axis=1)
PRmatrix=pd.concat([ PRmatrix, FDR], axis=1)

PRmatrix.head()

PRmatrix.shape[0]
#total number of correlations w/o repetition: 2.2m

2239786

Considering the Bonferroni correction we find 16305 correlations that are statistically significant at an $\alpha < 0.01$. If we consider instead FDR as correction method, we find 402368, which at a FDR of 0.01 implies $0.01 \times 402368 = 4023$ false positives.

sum(PRmatrix.Padj<0.01)

16305

sum(PRmatrix.FDR<0.01)

402368

Let's add two additional columns, where we assign R=0 for those correlations that are not statistically significant (adjusted P or FDR > 0.01).

PRmatrix.loc[:,'R (padj)']=PRmatrix['R'].copy()
PRmatrix.loc[:,'R (fdr)']=PRmatrix['R'].copy()
PRmatrix.loc[PRmatrix['Padj']>0.01,'R (padj)']=0
PRmatrix.loc[PRmatrix['FDR']>0.01,'R (fdr)']=0

all_mets=meta.loc[meta.Type=='met'].index
PRmatrix['feat1_type']=['met' if x in all_mets else 'gene' for x in PRmatrix.feat1 ]
PRmatrix['feat2_type']=['met' if x in all_mets else 'gene' for x in PRmatrix.feat2 ]
PRmatrix['int_type']=PRmatrix.feat1_type+'_'+PRmatrix.feat2_type
PRmatrix.to_csv('data/association_matrix.tsv', sep="\t", index=False)

Now we can see how the initial heatmap of correlations looks.

#Transforming to a squared matrix again
PRQ=pd.concat([
    PRmatrix.copy(), 
    PRmatrix.copy().rename(columns={'feat1':'feat2','feat2':'feat1'}).loc[:,PRmatrix.columns]
         ]).drop_duplicates()

Rmatrix_fdr=PRQ.copy().pivot(index='feat1',columns='feat2',values='R (fdr)')
Rmatrix_fdr=Rmatrix_fdr.loc[Rmatrix_fdr.sum()!=0]
Rmatrix_padj=PRQ.copy().pivot(index='feat1',columns='feat2',values='R (padj)')

Rmatrix_fdr=Rmatrix_fdr.loc[Rmatrix_fdr.index,Rmatrix_fdr.index].fillna(0)
Rmatrix_padj=Rmatrix_padj.loc[Rmatrix_fdr.index,Rmatrix_fdr.index].fillna(0)


#Showing only the top correlated features
top_features=((Rmatrix_fdr!=0).sum()>0.6*Rmatrix_fdr.shape[0]).index
Rmatrix_fdr=Rmatrix_fdr.loc[top_features,top_features]
Rmatrix_padj=Rmatrix_padj.loc[top_features,top_features]

g=sns.clustermap(Rmatrix_padj, cmap="RdBu_r", center=0);
g.fig.suptitle('R (filtered with Bonferroni)');
plt.show()

g=sns.clustermap(Rmatrix_fdr, cmap="RdBu_r", center=0);
g.fig.suptitle('R (filtered with FDR)');
plt.show()

The plots above show that the Bonferroni correction is only selecting very high (absolute) correlations. This may remove false positives, but it may also remove slightly weaker correlations that are biologically relevant. The Bonferroni correction also removes most of the negatively-associated features. Notice this from the distribution of correlation coefficients:

shortPR=PRmatrix.copy().loc[:,['feat1','feat2','R (padj)','R (fdr)']]
shortPR=shortPR.loc[shortPR.feat1!=shortPR.feat2]

fig=plt.figure(figsize=(8,4))
p=sns.distplot(shortPR['R (padj)'][shortPR['R (padj)']!=0], color='black');
p.set(ylabel='PDF (Bonferroni)')
ax2=p.twinx()
g=sns.distplot(shortPR['R (fdr)'][shortPR['R (fdr)']!=0], ax=ax2, color='r');
g.set(ylabel='PDF (FDR)')

fig.legend(labels=['bonferroni (<0.01)','fdr (<0.01)']);
plt.xlabel('R')
plt.title('Distribution of correlation coefficients')
plt.show()

We can also observe that the Bonferroni correction yields a more homogeneous number of associated features for each feature (i.e. first neighbors), compared to the FDR filtering. Notice how for a subset of features many correlations are identified (>200).

feat_associations_padj=pd.concat([
    shortPR.copy().loc[shortPR['R (padj)']!=0,][['feat1','feat2']],
    shortPR.copy().loc[shortPR['R (padj)']!=0,][['feat2','feat1']].rename(columns={'feat1':'feat2','feat2':'feat1'})]).drop_duplicates().groupby('feat1').agg('count')

feat_associations_fdr=pd.concat([
    shortPR.copy().loc[shortPR['R (fdr)']!=0,][['feat1','feat2']],
    shortPR.copy().loc[shortPR['R (fdr)']!=0,][['feat2','feat1']].rename(columns={'feat1':'feat2','feat2':'feat1'})]).drop_duplicates().groupby('feat1').agg('count')

feat_associations=pd.concat([feat_associations_padj, feat_associations_fdr],1, sort=True)
feat_associations.fillna(0,inplace=True)
feat_associations.columns=['bonferroni','fdr']

fig=plt.figure(figsize=(8,4))
sns.boxplot( data=feat_associations, notch=True);
plt.title('Number of associations per feature')
plt.show()

Most of the statistically significant correlations are only identified within each omic.

bonferroni_significant=PRmatrix.copy().loc[PRmatrix.Padj<0.01].loc[:,['feat1','feat2','R','int_type']]
bonferroni_significant['sig_test']='bonferroni'
fdr_significant=PRmatrix.copy().loc[PRmatrix.FDR<0.01].loc[:,['feat1','feat2','R','int_type']]
fdr_significant['sig_test']='FDR'

pd.concat([bonferroni_significant,fdr_significant]).groupby(['sig_test','int_type'])['R'].agg('count').reset_index()

Questions¶

In building a graph from an association analysis:

Why do you think that most significant correlations are found only within each omic?
How will you deal with the positive and negative sets of correlations above?
Will you consider the network as weighted? Directed?
Which dataset would you select for further analysis: the Bonferroni or the FDR-corrected? Why?
What preliminary tests would you perform on the graph to assess whether node relationships are random?

#saves the dataframe
PRmatrix.to_csv('data/association_matrix.tsv', sep="\t", index=False)

Network construction and preliminary analysis¶

We will now build 2 different networks to analyse further:

A full association network filtered using FDR-corrected P values (<0.01). This is an unweighted network.
The subset of positively associated features, where correlation coefficient is used as weight.
k-NNG that we will generate from the expression profile. This will be unweighted.
A random network based on the Erdos-Renyi model, with the same node and edge number as the first (full) network. This will be a null-model for our analyses. The idea is that if a certain property found in one of our graphs is reproduced in a random graph, then we do not need to account for any other possible explanations for that feature. In other words, if a property of a graph (e.g. clustering) is not found in a random network, we can assume that it does not appear in our biological network due to randomness.

# Prepares table for being read by igraph
PRmatrix=pd.read_csv('data/association_matrix.tsv', sep="\t")
PRmatrix.loc[PRmatrix['FDR']>0.01,'R (fdr)']=0
PRmatrix=PRmatrix.loc[PRmatrix['R (fdr)']!=0,['feat1','feat2','R (fdr)']]
PRmatrix=PRmatrix.loc[PRmatrix.feat1!=PRmatrix.feat2] #drops self correlations

fdr_pos_mat=PRmatrix.loc[PRmatrix['R (fdr)']>0]
fdr_neg_mat=PRmatrix.loc[PRmatrix['R (fdr)']<0]

PRmatrix=PRmatrix.loc[PRmatrix.isin(pd.unique(fdr_pos_mat[['feat1','feat2']].values.flatten())).sum(1)==2,] #selects only nodes also found in the positive network so that we can compare networks of the same sizes

We will now build the k-NNG, using distances as input to determine the nearest neighbours. Because this data contains both gene expressions and metabolite quantifications, we need to normalize them beforehand. (We didn't need to do this above as we were comparing ranks)

We'll normalize all features so that they range between 0 and 1.

#Imports and normalizes met and gene data so that we can compute similarities between them
from sklearn.preprocessing import MinMaxScaler
data=pd.read_csv('data/met_genes.tsv', sep="\t", index_col=0)

scaled_data=pd.DataFrame(MinMaxScaler().fit_transform(data.loc[:,data.columns!='Type'].T).T, columns=data.columns[data.columns!='Type'], index=data.index)
scaled_data_values=scaled_data.copy()
scaled_data['Type']=data.Type

#Plots the data distribution
fig,ax=plt.subplots(figsize=(12,8))
sns.kdeplot(
    scaled_data.loc[scaled_data.Type=='met',scaled_data.columns!='Type'].values.flatten(), 
    color='r', label='Mets', ax=ax, legend=False)
ax2=ax.twinx()
sns.kdeplot(
    scaled_data.loc[scaled_data.Type=='genes',scaled_data.columns!='Type'].values.flatten(),
    color='b', label='Genes', ax=ax2, legend=False)
fig.suptitle('Distribution for met and genes');
fig.legend();

We will now generate the graphs from the dataframes above

### Generating the kNN graph
#Computes a kNN adjacency matrix from the input dataset
#and prepares table for being read by igraph
input_ds=scaled_data_values.loc[scaled_data_values.index.isin(pd.unique(fdr_pos_mat[['feat1','feat2']].values.flatten()))]
knnG=sklearn.neighbors.kneighbors_graph(input_ds.values, 50, metric='euclidean')
knnG=pd.DataFrame(knnG.toarray(), columns=input_ds.index.copy(), index=input_ds.index.copy())  #adjacency matrix
knnG.index.name='gene1'
knnG.columns.name='gene2'
knnG=knnG.stack().reset_index().rename(columns={0:'Connectivity'})
knnG=knnG.loc[knnG['Connectivity']!=0]

### Generates each of the graphs
#positive associations, weighted
pos_w=ig.Graph.TupleList([tuple(x) for x in fdr_pos_mat.values], directed=False, edge_attrs=['w'])

#full network, unweighted
all_u=ig.Graph.TupleList([tuple(x) for x in PRmatrix.loc[PRmatrix.isin(pd.unique(fdr_pos_mat[['feat1','feat2']].values.flatten())).sum(1)==2,['feat1','feat2']].values], directed=False)

#knnG, unweighted
knn=ig.Graph.TupleList([tuple(x) for x in knnG.values], directed=False)

#random network, unweighted, node and edge number based on a network of the same size
randomG=ig.Graph.Erdos_Renyi(
    n=2102, 
    m=len(fdr_pos_mat.values), directed=False, loops=False)

For representation purposes we will see how a short knn graph looks - be careful in drawing the others, as they have many more edges it becomes computationally heavy.

#random subset of knn graph for plotting
short_knn=ig.Graph.TupleList([tuple(x) for x in knnG.values[random.sample(list(np.arange(len(knnG.values))), 10000)]], directed=False)

#This plots each graph, using degree to present node size:
short_knn.vs['degree']=short_knn.degree() 
short_knn.vs['degree_size']=[(x*15)/(max(short_knn.vs['degree'])) for x in short_knn.vs['degree']] #degree is multiplied by 10 so that we can see all nodes

layout = short_knn.layout_mds()
ig.plot(short_knn, layout=layout, vertex_color='white', edge_color='silver', vertex_size=short_knn.vs['degree_size'])

In the next table, we can see that while the same number of nodes is found in all networks, the number of edges varies greatly. We also see that the network is fully connected, which is not allways the case. If it wasn't connected, we could select the k largest connected components, and proceed the analyses with them. The largest connected component is called the giant component.

#function to get graph properties, takes a few minutes to run
def graph_prop(input_graph):
    ncount=nn.vcount()
    ecount=nn.ecount()
    diameter=nn.diameter()
    av_path=nn.average_path_length()
    dens=nn.density()
    clustering=nn.transitivity_undirected() #this is the global clustering coefficient
    conn=nn.is_connected()
    min_cut=nn.mincut_value()
    out=pd.DataFrame([ncount, ecount, diameter, av_path, dens, clustering, conn, min_cut],
                 index=['node_count','edge_count','diameter','av_path_length','density','clustering_coef','connected?','minimum_cut']).T
    return(out)

#summarizing graph properties
network_stats=pd.DataFrame()
for nn in [pos_w, all_u, knn, randomG]:
    network_stats=pd.concat([network_stats,graph_prop(nn)])
    
network_stats.index=['pos_w','all_u','knn','random']
network_stats

Questions:¶

Why is the diameter and average path length lower in the case of the full network and the random network, compared to the other two networks?
Why do you think the clustering coefficient is lower for the knn compared with the other networks?
Why is the minimum cut much larger in the random network compared to the others?
Do you agree with the k selected for the kNN-g?

Centrality analysis¶

We'll look into different centrality measures:

Degree) - number of neighbors of a node
Betweenness - measures how many shortest paths in the network pass through a node.
Closeness - the average length of the shortest paths between a node and all other nodes
Eccentricity) - largest shortest path from a node to any other node. Nodes with high eccentricity tend to be on the periphery.

Degree, Betweenness and Closeness are additionally computed for the positive association network by taking into account each edge's weight. For instance, for degree this is done for each node by summing each edge's degree.

Because the number of shortest paths in a network scales with the network size, we normalize Eccentricity and Betweenness with respect to the network size so that they can be compared between the four networks above.

Degree distribution
Let's start by comparing the degrees of the random network against the three other networks. From the figures below it seems that there is no relationship between the degree of the random network, and any of the three others.

fig, axes=plt.subplots(nrows=1, ncols=3, figsize=(15,5), sharey='row')

for i, ax in zip(range(4), axes.flat):
    sns.regplot([pos_w,all_u,knn][i].degree(), randomG.degree(), ax=ax);
    ax.set_title(['pos','all','knn'][i])
    ax.set(xlabel='k ('+['pos','all','knn'][i]+')', ylabel='k (random)')

def transform_degree(graph):
    alldegs=graph.degree()
    alldegs=pd.DataFrame([[key,len(list(group))] for key,group in itertools.groupby(alldegs)], columns=['k','count'])
    alldegs['P(k)']=[x/alldegs['count'].sum() for x in alldegs['count']]
    alldegs=alldegs.loc[:,['k','P(k)']]
    alldegs.drop_duplicates(inplace=True)
    alldegs.reset_index(drop=True, inplace=True)
    return(alldegs)

fig, ax = plt.subplots(figsize=(7, 7))
# ax.set(yscale='log', xscale='log')
p=sp.stats.probplot(pos_w.degree(), plot=ax)
a=sp.stats.probplot(all_u.degree(), plot=ax)
k=sp.stats.probplot(knn.degree(), plot=ax)
r=sp.stats.probplot(randomG.degree(), plot=ax)

col=['blue','','lightsalmon','','green','','red']
for x in np.arange(0,7,2):
    ax.get_lines()[x].set_markerfacecolor(col[x])
    ax.get_lines()[x].set_markeredgewidth(0)
    ax.get_lines()[x+1].set_color(col[x])


fig.legend(labels=['pos','pos','all','all','knn','knn','random','random']);

ax.set(xlabel='Data quantiles', ylabel='observed degree (k)')
plt.show()

Centrality
We will now compare different centrality metrics between the graphs.

### This function computes different centrality measures
def combine_raw_centralities():
    def centrality_raw(input_graph, graph_name):
        deg=input_graph.degree()
        ecc=input_graph.eccentricity()
        node_n=input_graph.vcount()
        btw=[(2*x / ((node_n-1)*(node_n-2))) for x in input_graph.betweenness(directed=False)] #scaled to account for network size
        cls=input_graph.closeness(normalized=True)

        #if edge weight is defined, compute weighted centralities
        #otherwise, return a vector of 0
        if('w' in input_graph.es.attribute_names()): 
            deg_w=input_graph.strength(weights='w')    #this is a weighted degree
            btw_w=[(2*x / ((node_n-1)*(node_n-2))) for x in input_graph.betweenness(directed=False, weights='w')]
            cls_w=input_graph.closeness(weights='w', normalized=True)
        else:
            btw_w=np.repeat(0,input_graph.vcount())
            deg_w=np.repeat(0,input_graph.vcount())
            cls_w=np.repeat(0,input_graph.vcount())

        out=pd.DataFrame(
            [deg, ecc, btw, cls, deg_w, btw_w, cls_w],
            index=['degree','eccentricity','betweeness', 'closeness',
                   'deg_w','btw_w','cls_w']).T
        out['graph']=graph_name
        out=out.loc[:,np.append(['graph'],out.columns[out.columns!='graph'])]
        
        ##Adds centralities for each node in the network
        input_graph.vs['degree']=out.degree
        input_graph.vs['eccentricity']=out.eccentricity
        input_graph.vs['betweeness']=out.betweeness
        input_graph.vs['closeness']=out.closeness
        input_graph.vs['deg_w']=out.deg_w
        input_graph.vs['btw_w']=out.btw_w
        input_graph.vs['cls_w']=out.cls_w
        
        return(out)
    
    #Computes centralities for all networks
    network_centralities_raw=pd.DataFrame()
    network_centralities_raw=pd.concat([network_centralities_raw,centrality_raw(pos_w,'pos_w')])
    network_centralities_raw=pd.concat([network_centralities_raw,centrality_raw(all_u,'all_u')])
    network_centralities_raw=pd.concat([network_centralities_raw,centrality_raw(knn,'knn')])
    network_centralities_raw=pd.concat([network_centralities_raw,centrality_raw(randomG,'random')])
    return(network_centralities_raw)

network_centralities_raw=combine_raw_centralities()

fig, axes=plt.subplots(nrows=4, ncols=2, figsize=(16,20), sharey='row')
cen=['degree','deg_w','betweeness', 'btw_w', 'closeness','cls_w','eccentricity']
for i, ax in zip(range(10), axes.flat):
    if(i==7):
        fig.delaxes(ax)
        break
    sns.boxplot(data=network_centralities_raw, x='graph', y=cen[i],  #outliers are hidden from plotting
                notch=True, ax=ax, showfliers=False)
    ax.set_title(cen[i])
    ax.set(xlabel='')

Overall, we see:

The median degree centrality decreases from Full network > Positive assoc. network > kNN-G.
The median betweenness centrality tends to decrease from kNN-G > Full network ~ Positive assoc. network.
The median closeness centrality tends to decrease from kNN-G > Full network > Positive assoc. network.
The eccentricity is homogeneous for the Positive assoc. network and for the Full network, but nodes in the latter tend to display higher eccentricity. In turn, most nodes in the kNN-G tend to display an eccentricity of 5-6.

Questions:¶

Can you explain these observations?
Based on the plots above, which graphs do you think follow a small world behavior?

We then examine genes and metabolites for each graph:

full_centralities=pd.DataFrame()
for net in [0,1,2]:
    net_in=[pos_w, all_u, knn, randomG][net]
    net_nm=['pos_w', 'all_u', 'knn'][net]
    temp=pd.DataFrame([net_in.vs[att] for att in ['name','degree','betweeness', 'closeness','eccentricity']], index=['name','degree','betweeness', 'closeness','eccentricity']).T
    temp.columns=[x+'|'+net_nm for x in temp.columns]
    temp.rename(columns={'name|'+net_nm:'name'}, inplace=True)
    temp.loc[:,temp.columns!='name']=temp.loc[:,temp.columns!='name'].rank(pct=True)
    temp['median_centrality|'+net_nm]=temp.loc[:,temp.columns!='name'].median(1)
    if(net==0):
        full_centralities=temp
    else:
        full_centralities=pd.merge(full_centralities, temp, on='name')
full_centralities.set_index('name', inplace=True)
full_centralities=pd.merge(full_centralities, data[['Type']], left_index=True, right_index=True, how='left')
full_centralities['median|ALL']=full_centralities.loc[:,full_centralities.columns.str.contains('median')].median(1)
full_centralities.sort_values('median_centrality|pos_w',ascending=False,inplace=True)

fig,ax=plt.subplots(figsize=(10,8))
sns.boxplot(data=full_centralities, x='Type', y='median_centrality|pos_w')
ax.set(xlabel='Feature type', ylabel='Median centrality rank (%)');
fig.suptitle('Overall centrality rank in pos_w (higher = more central)');

This shows that genes, in general, tend to be more central in all networks. We can get the 10 most central metabolites and genes:

full_centralities.loc[full_centralities.Type=='met',full_centralities.columns.str.contains('pos_w|ALL')].head(10)

full_centralities.loc[full_centralities.Type=='genes',full_centralities.columns.str.contains('pos_w|ALL')].head(10)

Questions:¶

Can you explain the differences in centrality between genes and metabolites?
Can you explain why a gene with a very high degree can have a low eccentricity?

Community detection¶

To identify communities, we will use the Leiden algorithm. This method is similar to the Louvain method but fixes different intrinsic problems it has.

partition = leidenalg.find_partition(short_knn, leidenalg.ModularityVertexPartition)
layout = short_knn.layout_mds()
ig.plot(partition, layout=layout, edge_color='silver', vertex_size=short_knn.vs['degree_size'])

We will perform the community analysis on the 4 networks. We will perform one additional community analysis by considering the edge weights from the positively associated network. Importantly, this method searches for the largest possible communities for our network, which may not always be the desired. Alternative models such as the Constant Potts Model allow you to identify smaller communities. Should we know that our data has special feature classes, we can compare whether the communities identify those classes by examining them individually, and increasing the resolution if needed.

pos_comm = leidenalg.find_partition(pos_w, leidenalg.ModularityVertexPartition)
pos_w_comm = leidenalg.find_partition(pos_w, leidenalg.ModularityVertexPartition, weights='w')
all_comm = leidenalg.find_partition(all_u, leidenalg.ModularityVertexPartition)
knn_comm = leidenalg.find_partition(knn, leidenalg.ModularityVertexPartition)
random_comm = leidenalg.find_partition(randomG, leidenalg.ModularityVertexPartition)

Predictably, we can see that the modularity score of any of the networks is substantially larger than that of the random network.

np.round(pos_comm.modularity,3)

0.59

np.round(pos_w_comm.modularity,3)

0.588

np.round(all_comm.modularity,3)

0.427

np.round(knn_comm.modularity,3)

0.711

np.round(random_comm.modularity,3)

0.04

Comparing the different communities by size:

#Compiles feat lists per community
def get_community_table():
    comm_counts=pd.DataFrame()
    feat_lists=pd.DataFrame()
    for i in [0,1,2,3]:
        graph=[pos_w,pos_w,all_u,knn][i]
        comm=[pos_comm,pos_w_comm,all_comm,knn_comm][i]
        name=['pos','pos_w','all','knn'][i]
        temp=pd.DataFrame(list(zip(graph.vs['name'],[x+1 for x in comm.membership]))).rename(columns={0:'feat',1:'community'})
        counts=pd.DataFrame(temp.groupby('community')['feat'].agg(len))
        counts.columns=[name]
        comm_counts=pd.concat([comm_counts, counts],1)
        
        gl=pd.DataFrame(temp.groupby('community')['feat'].apply(list)).reset_index()
        gl['community']=['c'+str(i) for i in gl['community']]
        gl['network']=name
        gl=gl.loc[:,['network','community','feat']]
        feat_lists=pd.concat([feat_lists, gl])
        
    comm_counts.index=['c'+str(i) for i in comm_counts.index]
    return([comm_counts,feat_lists])

#Plotting community sizes
import seaborn as sns
fig, ax = plt.subplots(figsize=(7, 4))
get_community_table()[0].fillna(0).T.plot(kind='bar', stacked=True, ax=ax);
ax.legend(get_community_table()[0].index, loc='right', bbox_to_anchor=(1.15, 1));
ax.set_title('Community size')
plt.xticks(rotation=0)
plt.show()

Can you explain why the KNN network has so many more communities than the others?

Functional analysis
In order to perform functional enrichment, we will extract each of the communities and perform a hypergeometric test on the genes to understand whether they are particularly enriched in specific biological functions.
We will use enrichr to perform the gene set enrichment analysis. As background we will use the full list of genes that were quantified.

We will look at 3 gene set libraries. Should you have other kinds of data, enrichr allows you to define your own feature sets and perform a similar analysis. The challenge is in identifying comprehensive and well curated gene sets.

#All the available human libraries
gp.get_library_name(database='Human')[:15]

['ARCHS4_Cell-lines',
 'ARCHS4_IDG_Coexp',
 'ARCHS4_Kinases_Coexp',
 'ARCHS4_TFs_Coexp',
 'ARCHS4_Tissues',
 'Achilles_fitness_decrease',
 'Achilles_fitness_increase',
 'Aging_Perturbations_from_GEO_down',
 'Aging_Perturbations_from_GEO_up',
 'Allen_Brain_Atlas_down',
 'Allen_Brain_Atlas_up',
 'BioCarta_2013',
 'BioCarta_2015',
 'BioCarta_2016',
 'BioPlanet_2019']

#we will search 3 libraries for significantly enriched gene sets
gene_sets=['GO_Biological_Process_2018','KEGG_2019_Human','OMIM_Disease']
background=[x for x in all_u.copy().vs['name'] if x in data.loc[data.Type=='genes'].index]
all_genes=data.loc[data.Type=='genes'].index

def perform_enrich(network):
    temp=get_community_table()[1].copy()
    temp=temp.loc[temp['network']==network]
    output_enrichr=pd.DataFrame()
    for comm in temp['community'].values:
        gl=list(temp.loc[temp['community']==comm, 'feat'])[0]
        gl=list([x for x in gl if x in all_genes])
        if(len(gl)<30):
            continue
        print('Found '+str(len(gl))+'genes in community.')
        for bp in gene_sets:
            print('Analyzing '+network+' network | Comm: '+comm+'/'+str(len(temp.index))+'  | BP: '+bp)
            enr=gp.enrichr(
                gene_list=gl,
                gene_sets=bp,
                background=background,
                outdir='data/Enrichr',
                format='png'
            )

            results=enr.results.sort_values('Adjusted P-value', ascending=True)
            results=results.loc[results['Adjusted P-value']<0.05,]
            results['BP']=bp
            results['Comm']=comm
            results['Graph']=network
            output_enrichr=pd.concat([output_enrichr, results])
            
    return(output_enrichr)

all_enriched=pd.DataFrame()
for net in ['pos', 'pos_w', 'all', 'knn']: 
    all_enriched=pd.concat([all_enriched,perform_enrich(net)])

Found 905genes in community.
Analyzing pos network | Comm: c1/7  | BP: GO_Biological_Process_2018

2020-02-13 15:57:09,715 Warning: No enrich terms using library GO_Biological_Process_2018 when cutoff = 0.05

Analyzing pos network | Comm: c1/7  | BP: KEGG_2019_Human

2020-02-13 15:57:13,314 Warning: No enrich terms using library KEGG_2019_Human when cutoff = 0.05

Analyzing pos network | Comm: c1/7  | BP: OMIM_Disease

2020-02-13 15:57:16,801 Warning: No enrich terms using library OMIM_Disease when cutoff = 0.05

Found 579genes in community.
Analyzing pos network | Comm: c2/7  | BP: GO_Biological_Process_2018
Analyzing pos network | Comm: c2/7  | BP: KEGG_2019_Human
Analyzing pos network | Comm: c2/7  | BP: OMIM_Disease
Found 480genes in community.
Analyzing pos network | Comm: c3/7  | BP: GO_Biological_Process_2018
Analyzing pos network | Comm: c3/7  | BP: KEGG_2019_Human
Analyzing pos network | Comm: c3/7  | BP: OMIM_Disease

2020-02-13 15:57:50,362 Warning: No enrich terms using library OMIM_Disease when cutoff = 0.05

Found 901genes in community.
Analyzing pos_w network | Comm: c1/7  | BP: GO_Biological_Process_2018

2020-02-13 15:57:54,827 Warning: No enrich terms using library GO_Biological_Process_2018 when cutoff = 0.05

Running the command above not only gives you the results after significance testing (Q<0.05), but it also outputs some preliminary barplots with the statistically significant results (found under /data/Enrichr/). For instance:

enriched_terms=all_enriched.loc[:,['Graph','Comm','Term','Adjusted P-value']].copy()

enriched_terms['Adjusted P-value']=-1*np.log10(enriched_terms['Adjusted P-value'])

enriched_terms.head()

fig, ax = plt.subplots(figsize=(7, 4))
data_bars=pd.DataFrame(enriched_terms.groupby(['Graph','Comm'])['Term'].agg('count')).stack().reset_index().rename(columns={0:'Count'})
sns.barplot(x='Graph', y='Count', data=data_bars, hue='Comm')
ax.set_title('Number of significant Terms (Q < 0.05) per community')
ax.legend(loc='right', bbox_to_anchor=(1.15, 1));
plt.xticks(rotation=0)
plt.show()

Note that some of these communities are very big, which explains the big number of biological processes found above.

###Number of genes/community
# We skipped communities with <30 genes
get_community_table()[0].fillna(0).T

pd.DataFrame(enriched_terms.groupby(['Graph','Comm'])['Term'].agg('count'))

We can now find whether the Full Network, Positively associated, and Positively associated weighted, show any common terms among their biggest communities. We do not compare with kNN-G as this shows very homogeneous and different communities than the other two networks

#Finding consensus
temp=enriched_terms.copy()
temp['comm_term']=temp.Comm+'_'+temp.Term
temp=temp.loc[:,['Graph','comm_term']]

consensus=pd.DataFrame()
consensus=pd.concat([consensus, temp.loc[temp['Graph']=='pos']])
consensus=pd.merge(consensus,
                   temp.loc[temp['Graph']=='pos_w'], on="comm_term", how='outer', suffixes=['pos','pos_w'])
consensus=pd.merge(consensus, 
                   temp.loc[temp['Graph']=='all'], on="comm_term", how='outer').rename(columns={'Graph':'all'})

consensus=consensus.loc[consensus.isna().sum(1)==0].loc[:,['comm_term','Graphpos','Graphpos_w','all']]

Among the biggest communities we find several biological processes (55) that are simultaneously identified in the same community in the three graphs (Full, Pos assoc, and Pos assoc weighted).

consensus['comm']=[x[0] for x in consensus.comm_term.str.split('\_')]
consensus.groupby('comm')['comm_term'].agg('count')

comm
c2    76
c3    10
Name: comm_term, dtype: int64

Questions

Would you exclude any communities based on its size?
Having identified these communities, how would you try to validate them?
Would you now determine the relevant community to investigate further?

Extracting communities¶

If you want to extract communities to use them in other sections.

#Requires running the community detection from the previous section
patlas=pd.read_csv('data/proteinatlas.tsv', sep="\t").loc[:,['Ensembl','Gene']]

#Compiles gene lists per community. We need Ensembl ids for further analyses
def get_ensembl():
    comm_counts=pd.DataFrame()
    gene_lists=pd.DataFrame()
    for i in [0,1,2,3]:
        graph=[pos_w,pos_w,all_u,knn][i]
        comm=[pos_comm,pos_w_comm,all_comm,knn_comm][i]
        name=['pos','pos_w','all','knn'][i]
        temp=pd.DataFrame(list(zip(graph.vs['name'],[x+1 for x in comm.membership]))).rename(columns={0:'gene',1:'community'})

        gl=pd.DataFrame(temp.groupby('community')['gene'].apply(list)).reset_index()
        gl['community']=['c'+str(i) for i in gl['community']]
        gl['network']=name
        gl=gl.loc[:,['network','community','gene']]
        gene_lists=pd.concat([gene_lists, gl])

    gene_communities=gene_lists
    gene_mat=pd.DataFrame()
    for net in gene_communities['network'].unique():
        temp=gene_communities.copy().loc[gene_communities['network']==net,]
        for comm in temp['community'].unique():
            gl=list(temp.copy().loc[temp['community']==comm,'gene'])[0]
            el=[patlas.loc[patlas['Gene']==x,'Ensembl'].iloc[0] for x in gl if x in patlas['Gene'].values]

            df=pd.DataFrame([net,comm,gl,el]).T
            df.columns=['network','community','Gene','Ensembl']
            gene_mat=pd.concat([gene_mat, df])
                
    return(gene_mat)

get_ensembl().to_csv('data/gene_communities.tsv', sep="\t", index=False)

Conclusion¶

Here we've performed some network analyses based on a met-gene association network. We've explored different centrality measures to characterize the networks, and identified the communities of genes in these networks. We have also used gene set enrichment analysis to characterize these communities based on the genes, but it remains to show whether similar results would be attained if we considered the metabolites in each community.

	Type	p10	p11	p12	p14	p15	p16	p18	p22	p23	...	p37	p38	p4	p40	p41	p46	p48	p5	p8	p9
feature
C0_accarnitines	met	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.103152	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.049757
C2_accarnitines	met	0.007356	0.038454	0.011738	0.011923	0.015104	0.012317	0.021905	0.017777	0.039760	...	0.000000	0.012542	0.018896	0.020212	0.015263	0.012510	0.009677	0.015757	0.010706	0.018615
C3_accarnitines	met	0.000000	0.000000	0.000000	0.000000	0.000888	0.001116	0.001682	0.001024	0.002179	...	0.000000	0.000827	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000995
C3DC_C4OH_accarnitines	met	0.000000	0.000000	0.000455	0.001163	0.000000	0.000000	0.000000	0.000000	0.001065	...	0.000000	0.000480	0.000000	0.000717	0.000000	0.000782	0.000398	0.000000	0.000000	0.000312
C5DC_C6OH_accarnitines	met	0.001035	0.001479	0.000430	0.001527	0.000485	0.000792	0.000954	0.000995	0.001356	...	0.005792	0.000400	0.000987	0.001255	0.000885	0.001710	0.000597	0.001218	0.000848	0.000507

	feat1	feat2	R	P	Padj	FDR
0	C0_accarnitines	C2_accarnitines	0.375326	0.070718	1.0	0.149290
1	C0_accarnitines	C3_accarnitines	0.519174	0.009329	1.0	0.033441
2	C0_accarnitines	C3DC_C4OH_accarnitines	0.395363	0.055848	1.0	0.125357
3	C0_accarnitines	C5DC_C6OH_accarnitines	-0.029917	0.889631	1.0	0.928289
4	C0_accarnitines	C5MDC_accarnitines	-0.137801	0.520795	1.0	0.643703

	sig_test	int_type	R
0	FDR	gene_gene	398274
1	FDR	met_gene	118
2	FDR	met_met	3976
3	bonferroni	gene_gene	15434
4	bonferroni	met_met	871

	degree\|pos_w	betweeness\|pos_w	closeness\|pos_w	eccentricity\|pos_w	median_centrality\|pos_w	median\|ALL
name
C16_accarnitines	0.141532	0.71313	0.00856327	0.980019	0.427331	0.401284
PCaeC40_3_glycerophospholipids	0.2098	0.647479	0.067079	0.617269	0.413535	0.413535
PCaeC42_2_glycerophospholipids	0.208373	0.624167	0.0666032	0.617269	0.412821	0.286751
PCaeC36_1_glycerophospholipids	0.206708	0.617507	0.0642245	0.617269	0.411989	0.411989
PCaeC38_4_glycerophospholipids	0.203378	0.950048	0.0884872	0.617269	0.410324	0.362155
PCaeC36_2_glycerophospholipids	0.203378	0.796384	0.0746908	0.617269	0.410324	0.410324
PCaaC30_0_glycerophospholipids	0.203378	0.952426	0.0875357	0.617269	0.410324	0.410324
PCaaC36_3_glycerophospholipids	0.206708	0.613701	0.0661275	0.617269	0.410205	0.410205
PCaaC32_0_glycerophospholipids	0.199096	0.947193	0.100381	0.617269	0.408183	0.408183
PCaeC32_2_glycerophospholipids	0.190771	0.940057	0.0908658	0.617269	0.404020	0.404020

	degree\|pos_w	betweeness\|pos_w	closeness\|pos_w	eccentricity\|pos_w	median_centrality\|pos_w	median\|ALL
name
GNAI2	0.927926	0.928639	0.981208	0.617269	0.928283	0.816246
NUDT9	0.921265	0.930542	0.9353	0.617269	0.925904	0.921147
RAB10	0.902474	0.910561	0.978592	0.617269	0.906518	0.906518
FCRL5	0.936013	0.917697	0.887964	0.138202	0.902831	0.886537
S100A11	0.857755	0.939581	0.998573	0.617269	0.898668	0.898668
ACTR3	0.848002	0.941009	0.997859	0.617269	0.894505	0.831232
PARP15	0.898906	0.925785	0.881779	0.138202	0.890343	0.890343
DLGAP1	0.786156	0.990485	0.993815	0.617269	0.888321	0.888321
CDV3	0.841817	0.934824	0.99667	0.617269	0.888321	0.878806
CALM1	0.90509	0.887726	0.882731	0.138202	0.885228	0.872621

	node_count	edge_count	diameter	av_path_length	density	clustering_coef	connected?	minimum_cut
pos_w	2102	293544	8	2.93304	0.132937	0.71453	False	0
all_u	2102	402360	7	2.11979	0.182216	0.610098	False	0
knn	2102	105100	7	3.47042	0.0475964	0.457175	True	50
random	2102	293544	2	1.86706	0.132937	0.133025	True	228

	Graph	Comm	Term	Adjusted P-value
0	pos	c2	cotranslational protein targeting to membrane ...	78.440831
1	pos	c2	SRP-dependent cotranslational protein targetin...	76.929046
2	pos	c2	protein targeting to ER (GO:0045047)	76.860318
3	pos	c2	nuclear-transcribed mRNA catabolic process, no...	68.752216
4	pos	c2	viral gene expression (GO:0019080)	64.426772

	c1	c2	c3	c4	c5	c6	c7	c8
pos	908.0	579.0	482.0	119.0	9.0	3.0	2.0	0.0
pos_w	901.0	580.0	484.0	121.0	11.0	3.0	2.0	0.0
all	670.0	663.0	611.0	153.0	3.0	2.0	0.0	0.0
knn	449.0	343.0	315.0	314.0	253.0	217.0	110.0	101.0

		Term
Graph	Comm
all	c1	290
	c2	191
	c3	12
	c4	9
knn	c2	1
	c3	174
	c4	171
	c5	116
	c6	84
	c8	6
pos	c2	267
pos	c3	365
pos_w	c2	487
pos_w	c3	181

Table of Contents

Biological network topology analysis¶

Data preparation¶

Association analysis¶

Questions¶

Network construction and preliminary analysis¶

Questions:¶

Centrality analysis¶

Questions:¶

Questions:¶

Community detection¶

Extracting communities¶

Conclusion¶