Document Semantic Compression Algorithm Based on Phrase Topic Model
-
摘要: 为了实现文本代表性语义词汇的抽取,提出一种基于词组主题建模的文本语义压缩算法SCPTM(semantic compression based on phrase topic modeling).该算法首先将代表性语义词汇抽取问题转化为最大化优化模型,并通过贪心搜索策略实现该模型的近似求解.然后,利用词组挖掘模型LDACOL实现词组主题建模,得到SCPTM算法的输入参数;同时,针对该模型中词组的主题分配不稳定的问题进行改进,使得取得的代表性语义词汇更加符合人们对语义的认知习惯.最后,将改进LDACOL模型与LDA模型、LDACOL模型以及TNG模型的主题挖掘性能进行实验比较,并利用SCPTM算法针对不同语料库进行语义压缩,根据聚类结果评价其有效性.实验结果表明,在多数情况下,改进LDACOL模型的主题抽取效果优于其他3种模型;通过SCPTM算法抽取代表性语义词汇能达到70%~100%的精度,相比PCA、MDS、ISOMAP等传统降维算法能获得更高的聚类效果.Abstract: To extract representative semantic terms, a document SCPTM (semantic compression based on phrase topic modeling) algorithm was proposed. Firstly, SCPTM converts semantic terms extraction to the optimization model of maximization, and uses a greedy search algorithm to generate approximate solution. Then, in order to compute input parameters for SCPTM, phrase discovery model LDACOL was employed to extract important topics in phrase pattern. Meanwhile, the instability of topic allocation in LDACOL model was improved, so that the extracted semantic terms can satisfy the demand of human cognition. Finally, to evaluate the performance of topic discovery, the improved LDACOL was compared with LDA, LDACOL and TNG, and SCPTM was used for semantic compression on different corpora. Then the effectiveness of the algorithm was evaluated by clustering results. Empirical experimental results show that the preformance of topic discovery of improved LDACOL is superior over other three models in most cases. The accuracy of extracting the representative semantic terms by the proposed algorithm can reach 70%-100%, and can achieve better results for document clustering compared with other dimension-reduction algorithms, such as PCA, MDS and ISOMAP.
-
Key words:
- topic model /
- representative semantic terms /
- text mining /
- semantic compression /
- SCPTM
-
LAZER C, TAMINAU J, MEGANCK S, et al. A survey on filter techniques for feature selection in gene expression microarray analysis [J]. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 2012, 9(4): 1106-1119. 贾真,何大可,尹红风,等. 基于无监督学习的部分-整体关系获取 BLEI D M. Probabilistic topic models [J]. 西南交通大学学报,2014, 49(4): 590-596. JIA Zhen, HE Dake, YIN Hongfeng, et al. Acquisition of part-whole relations based on unsupervised learning STEYVERS M, SMYTH P, ZVI M R, et al. Probabilistic author-topic models for information discovery [J]. Journal of Southwest Jiaotong University, 2014, 49(4): 590-596. TANG J, ZHANG J, YAO L, et al. Arnetminer: extraction and mining of academic social networks WANG C, BLEI D, HECKERMAN D. Continuous time dynamic topic models YANG P, GAO W, TAN Q, et al. A link-bridged topic model for cross-domain document classification [J]. Communications of the ACM, 2012, 55(4): 77-84. ZHU J, AHMED A, XING ERIC P. MedLDA: maximum margin supervised topic models for regression and classification [C]//Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). Seattle: ACM, 2004: 306-315. WANG L D, YUAN J. Enhancing digital book clustering by LDAC model WANG Y, AGICHTEIN E, BENZI M. TM-LDA: efficient online modeling of latent topic transitions in social media 熊忠阳,付玲玲,张玉芳,等. 结合语义的特征选择方法 [C]//Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD). Nevada: ACM, 2008: 990-998. YOONJAE J, SUNG-HYON M. IFeature selection using a semantic hierarchy for event recognition and type classification 何晓亮,宋威,梁久祯. 基于资源分配网络和语义特征选取的文本分类 [C]//Proceedings of the 24th Conference in Uncertainty in Artificial Intelligence (UAI). Arlington: AUAI, 2009: 110-119. 孙福振,李贞双. 概念语义生成与文本特征选择研究 SONG Y Q, PAN S, LIU S X, et al. Topic and keyword re-ranking for LDA-based topic modeling [J]. Information Processing Management, 2013, 49(6): 1181-1193. HOCHBAUM D. Approximation algorithms for NP-hard problems [C]//Proceedings of the 26th Annual International Conference on Machine Learning(ICML). Hyderabad: ICML, 2009: 1257-1264. AGRAWAL R, GOLLAPUDI S, HALVERSON A, et al. Diversifying search results LINDSEY R V, HEADDEN W P. A Phrase-discovering topic model using hierarchical pitman-yor processes WALLACH H. Topic modeling: beyond bag-of-words [J]. IEICE Transactions on Information and Systems, 2012, 95-D(4): 982-988. GRIFFITHS T L, STEYVERS M, TENENBAUM J B T. Topics in semantic representation [C]//Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD). Beijing: ACM, 2012: 123-131. WANG X, MCCALLUM A, WEI X. Topical N-grams: phrase and topic discovery, with an application to information retrieval JOHNSON M P. Topic models, adaptor grammars and learning topical collocations and the structure of proper names [J]. 计算机应用,2010,30(10): 2621-2624. XIONG Zhongyang, FU Lingling, ZHANG Yufang, et al. Improved feature selection approach combined with semantic [J]. Joumal of Computer Applications, 2010, 30(10): 2621-2624. [C]//International Joint Conference on Natural Language Processing. Nagoya: Asian Federation of Natural Language Processing, 2013: 136-144. [J]. 计算机工程与科学,2014,36(2): 340-347. HE Xiaoliang, SONG Wei, LIANG Jiuzhen. Text categorization based on resource allocating network and semantic feature selection [J]. Computer Engineering Science, 2014, 36(2): 340-347. [J]. 计算机工程与应用,2011,41(30): 116-118. SUN Fuzhen, LI Zhenshuang. Research on concept semantic space and text feature selection [J]. Computer Engineering and Applications, 2011, 41(30): 116-118. [C]//Proceeding of the International Conference on Information and Knowledge Management (CIKM). New York: ACM, 2009: 1757-1760. [M]. London: Springer, 1999: 34-34. [C]//Proceedings of the Second ACM International Conference on Web Search and Data Mining(WSDM). New York: ACM, 2009: 5-14. [C]//Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning(NLPCNLL). Stroudsburg: ACL, 2012: 214-222. [C]//Proceeding of the International Conference on Machine Learning(ICML). New York: ACM, 2006: 977-984. [J]. Psychological Review, 2007, 114(2): 211-244. [C]//Proceedings of the 7th IEEE International Conference on Data Mining(ICDM). Washington D. C.: IEEE, 2007: 697-702. [C]//Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics(ACL). Stroudsburg: ACL, 2010: 1148-1157.
点击查看大图
计量
- 文章访问数: 843
- HTML全文浏览量: 60
- PDF下载量: 430
- 被引次数: 0