genesgenes A Bipartite Stochastic Block Model

Publication

Abstract

Bipartite networks are a common type of network data in which there are two types of vertices, and only vertices of different types can be connected. While bipartite networks exhibit community structure like their unipartite counterparts, existing approaches to bipartite community detection have drawbacks, including implicit parameter choices, loss of information through one-mode projections, and lack of interpretability. Here we solve the community detection problem for bipartite networks by formulating a bipartite stochastic block model, which explicitly includes vertex type information and may be trivially extended to k-partite networks. This bipartite stochastic block model yields a projection-free and statistically principled method for community detection that makes clear assumptions and parameter choices and yields interpretable results. We demonstrate this model's ability to efficiently and accurately find community structure in synthetic bipartite networks with known structure and in real-world bipartite networks with unknown structure, and we characterize its performance in practical contexts.

Code

In the paper, we describe an algorithm, and use it to produce the results shown. Here we have reimplemented a more convenient version of the algorithm for MATLAB users, but its functionality is the same. The code is freely distributable under a Creative Commons Non-Commercial Share Alike license, meaning you are free to use and distribute it for non-commercial purposes, provided that it continues to be shared for non-commercial purposes.

Downloads

  • Version 1.0 - 2014-04-14 - First release of working MATLAB/mex code, tested on MATLAB R2014a and OSX 10.9.2. Inludes README, C++, MATLAB driver, and test script. There are known issues with compiling mex files with particular combinations of OSX and MATLAB versions. If you find solutions to particular issues, please drop me a note so I can post those here for other users!
  • Version 1.1 - 2014-05-20 - Increased the maximum number of edges, vertices, and communities. Code also now exits if Log(x) is called for x<0, instead of only warning.
  • Version 1.2 - 2014-07-29 - NEW wrapper for R, thanks to Luca Marotta. NEW pure C++ version. Updated wrapper for MATLAB. All core C++ elements consolidated to header files. Two test examples included.
  • [Download biSBM_v1.2 for C++, MATLAB or R]
Creative Commons License
This bipartite stochastic blockmodel (biSBM) implementation by Daniel Larremore is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

In the paper, we compared the biSBM to the degree-corrected stochastic block model described by Brian Karrer and Mark Newman. That code is also freely available here.

Data sets

In the paper, we analyzed two synthetic data sets and two empirical data sets. The synthetic data sets are stochastic draws from the biSBM, so we provide MATLAB scripts below to generate draws using the exact parameters used in the paper. The empirical data sets are linked below as well.

  • "Easy case" synthetic test network
  • [makeEasyCaseNetworks.m]
  • "Difficult case" synthetic test network
  • [makeDifficultCaseNetworks.m]
  • Southern Women Data
  • [southernWomenData.zip]
  • •Cite data set as: A. Davis, B.B. Gardner, and M.R. Gardner, Deep South (Univ. of Chicago Press, Chicago, 1941).
  • Malaria var Gene Data
  • [malariaData.zip]
  • •Cite network data set as: Larremore, D. B., Clauset, A., and Buckee, C. O. (2013). A Network Approach to Analyzing Highly Recombinant Malaria Parasite Genes. PLoS Computational Biology, 9(10), e1003268.
  • •Cite original (larger, non-network) genetic data set as: Rask, T. S. et al. (2010). Plasmodium falciparum Erythrocyte Membrane Protein 1 Diversity in Seven Genomes – Divide and Conquer. PLoS Computational Biology, 6(9), e1000933.

Thanks