Parallel Algorithm for Overlapping Community Detection in Complex Networks
-
摘要: 随着网络规模的快速增长,传统社区发现算法难以处理大规模网络数据和满足复杂网络的可扩展分析需求. 本文提出一种适用于大规模复杂网络的重叠社区发现算法PHLink. 该算法根据复杂网络的无标度特性将节点建立连边的原因进行分析和归类,用以识别网络中具有重叠性的社区结构,并采用MapReduce计算框架对网络进行分割和冗余存储,减弱了图计算的耦合性,解决了社区发现算法的分布式计算问题. 通过真实网络测试,PHLink算法可以大幅度降低边计算的复杂度,对于无标度特性明显的复杂网络提取0.1%的枢纽节点即可节省94%以上的计算量,较传统算法具有较高的稳定性和准确性,并且在Hadoop平台有良好的加速性和伸缩性,可以处理千万级连边规模的大规模复杂网络.Abstract: With the increasing size of complex networks, traditional detection methods are difficult to scale to larger networks. This paper proposes a parallel hierarchical link(PHLink) algorithm to discover overlapping communities. By studying the power-law degree distribution of complex networks, we distinguish the motivations of why two nodes intend to establish a connection so that the complexity for community detection can be reduced. PHLink is implemented using distributed computing for large-scale networks via graph segmentation and redundant storage. Experiments validate that PHLink can accurately discover network communities and the overlaps between them. For scale-free networks, removing 0.1% of the hub nodes reduces the computation time by 94%. Meanwhile, PHLink demonstrates good speed-up and scale-up on Hadoop, which can scale to very large complex networks with millions of edges.
-
Key words:
- complex network /
- overlapping community /
- community detection /
- scale-free /
- parallel algorithm /
- Hadoop
-
表 1 基准测试数据集
Table 1. Benchmark datasets
网络 节点数 边数 描述 Football 115 616 足球联盟 Jazz 198 2 742 爵士合作 Facebook 5 000 8 194 社交网络 DBLP* 317 080 1 049 866 引文网络 Amazon* 334 863 925 872 商品网络 YouTube* 1 134 890 2 987 624 视频网络 Skitter 3 997 962 11 095 298 网络拓扑 表 2 比例阈值对计算量的影响
Table 2. Effect of proportional threshold on computation
网络 1.0% 0.5% 0.1% Facebook 73 48 22 DBLP 56 37 21 Amazon 46 35 20 YouTube 99 98 94 Skitter 99 98 96 表 3 运行时间比较
Table 3. Comparison of computation time
网络 PHLink Link CPM Bigclam Football 0.21 0.24 0.73 0.69 (30) Jazz 16.9 25.5 — 56.3 (30) Facebook 3.62 3.92 — 135 (30) DBLP 291 395 — 5 358 (13.5k) Amazon 75.5 85.8 — 4 442 (75k) YouTube 346 — — — Skitter 6 242 — — — 表 4 PHLink算法加速性能
Table 4. Speedup for PHLink
网络 4节点 8节点 12节点 YouTube 603 444 346 Skitter 17 984 9 483 6 242 -
PALLA G, DERÉNYI I, FARKAS I, et al. Uncovering the overlapping community structure of complex networks in nature and society[J]. Nature, 2005, 435(7043): 814-818 ADAMCSEK B, PALLA G, FARKAS I J, et al. CFinder:locating cliques and overlapping modules in biological networks[J]. Bioinformatics, 2006, 22(8): 1021-1023 doi: 10.1093/bioinformatics/btl039 LANCICHINETTI A, FORTUNATO S, KERTESZ J. Detecting the overlapping and hierarchical community structure in complex networks[J]. New Journal of Physicals, 2008, 11(3): 19-44 李首庆,徐洋. 基于自适应聚概率矩阵的JPDA算法研究[J]. 西南交通大学学报,2017,52(2): 340-347 doi: 10.3969/j.issn.0258-2724.2017.02.018LI Shouqing, XU Yang. Joint probabilistic data association algorithm based on adaptive cluster probability matrix[J]. Journal of Southwest Jiaotong University, 2017, 52(2): 340-347 doi: 10.3969/j.issn.0258-2724.2017.02.018 刘世超,朱福喜,甘琳. 基于标签传播概率的重叠社区发现算法[J]. 计算机学报,2016,39(4): 717-729LIU Shichao, ZHU Fuxi, GAN Lin. A label-propagation probability based algorithm for overlappling community detection[J]. Chinese Journal of Computers, 2016, 39(4): 717-729 YANG J, LESKOVEC J. Overlapping communities explain core-periphery organization of networks[J]. Proceedings of the IEEE, 2014, 102(12): 1892-1902 doi: 10.1109/JPROC.2014.2364018 WANG X, LIU G, LI J, et al. Locating structural centers:a density-based clustering method for community detection[J]. Plos One, 2017, 12(1): e0169355 doi: 10.1371/journal.pone.0169355 AHN Y Y, BAGROW J P, LEHMANN S. Link communities reveal multiscale complexity in networks[J]. Nature, 2010, 466(7307): 761-764 doi: 10.1038/nature09182 SUN H, LIU J, HUANG J, et al. LinkLPA:a link-based label propagation algorithm for overlapping community detection in networks[J]. Computational Intelligence, 2017, 33(2): 308-331 doi: 10.1111/coin.12087 乔少杰,郭俊,韩楠,等. 大规模复杂网络社区并行发现算法[J]. 计算机学报,2017,40(3): 687-700QIAO Shaojie, GUO Jun, HAN Nan, et al. Parallel algorithm for discovering communities in large-scale complex networks[J]. Chinese Journal of Computers, 2017, 40(3): 687-700 BU Z, WU Z, CAO J, et al. Local community mining on distributed and dynamic networks from a multiagent perspective[J]. IEEE Transactions on Cybernetics, 2017, 46(4): 986-999 CLAUSET A, SHALIZI C R, NEWMAN M E J. Power-law distributions in empirical data[J]. Siam Review, 2012, 51(4): 661-703 YANG J, LESKOVEC J. Defining and evaluating network communities based on ground-truth[J]. Knowledge & Information Systems, 2012, 42(1): 745-754 CLAUSET A, NEWMAN M E J, MOORE C. Finding community structure in very large networks[J]. Physical Review E, 2004, 70(2): 066111-1-066111-6