• ISSN 0258-2724
  • CN 51-1277/U
  • EI Compendex
  • Scopus
  • Indexed by Core Journals of China, Chinese S&T Journal Citation Reports
  • Chinese S&T Journal Citation Reports
  • Chinese Science Citation Database
Volume 28 Issue 4
Jul.  2015
Turn off MathJax
Article Contents
WANG Lidong, ZHANG Yin, LÜ, Mingqi. Document Semantic Compression Algorithm Based on Phrase Topic Model[J]. Journal of Southwest Jiaotong University, 2015, 28(4): 755-763. doi: 10.3969/j.issn.0258-2724.2015.04.027
Citation: WANG Lidong, ZHANG Yin, LÜ, Mingqi. Document Semantic Compression Algorithm Based on Phrase Topic Model[J]. Journal of Southwest Jiaotong University, 2015, 28(4): 755-763. doi: 10.3969/j.issn.0258-2724.2015.04.027

Document Semantic Compression Algorithm Based on Phrase Topic Model

doi: 10.3969/j.issn.0258-2724.2015.04.027
  • Received Date: 16 Jun 2014
  • Publish Date: 25 Aug 2015
  • To extract representative semantic terms, a document SCPTM (semantic compression based on phrase topic modeling) algorithm was proposed. Firstly, SCPTM converts semantic terms extraction to the optimization model of maximization, and uses a greedy search algorithm to generate approximate solution. Then, in order to compute input parameters for SCPTM, phrase discovery model LDACOL was employed to extract important topics in phrase pattern. Meanwhile, the instability of topic allocation in LDACOL model was improved, so that the extracted semantic terms can satisfy the demand of human cognition. Finally, to evaluate the performance of topic discovery, the improved LDACOL was compared with LDA, LDACOL and TNG, and SCPTM was used for semantic compression on different corpora. Then the effectiveness of the algorithm was evaluated by clustering results. Empirical experimental results show that the preformance of topic discovery of improved LDACOL is superior over other three models in most cases. The accuracy of extracting the representative semantic terms by the proposed algorithm can reach 70%-100%, and can achieve better results for document clustering compared with other dimension-reduction algorithms, such as PCA, MDS and ISOMAP.

     

  • loading
  • LAZER C, TAMINAU J, MEGANCK S, et al. A survey on filter techniques for feature selection in gene expression microarray analysis
    [J]. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 2012, 9(4): 1106-1119.
    贾真,何大可,尹红风,等. 基于无监督学习的部分-整体关系获取
    BLEI D M. Probabilistic topic models
    [J]. 西南交通大学学报,2014, 49(4): 590-596. JIA Zhen, HE Dake, YIN Hongfeng, et al. Acquisition of part-whole relations based on unsupervised learning
    STEYVERS M, SMYTH P, ZVI M R, et al. Probabilistic author-topic models for information discovery
    [J]. Journal of Southwest Jiaotong University, 2014, 49(4): 590-596.
    TANG J, ZHANG J, YAO L, et al. Arnetminer: extraction and mining of academic social networks
    WANG C, BLEI D, HECKERMAN D. Continuous time dynamic topic models
    YANG P, GAO W, TAN Q, et al. A link-bridged topic model for cross-domain document classification
    [J]. Communications of the ACM, 2012, 55(4): 77-84.
    ZHU J, AHMED A, XING ERIC P. MedLDA: maximum margin supervised topic models for regression and classification
    [C]//Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). Seattle: ACM, 2004: 306-315.
    WANG L D, YUAN J. Enhancing digital book clustering by LDAC model
    WANG Y, AGICHTEIN E, BENZI M. TM-LDA: efficient online modeling of latent topic transitions in social media
    熊忠阳,付玲玲,张玉芳,等. 结合语义的特征选择方法
    [C]//Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD). Nevada: ACM, 2008: 990-998.
    YOONJAE J, SUNG-HYON M. IFeature selection using a semantic hierarchy for event recognition and type classification
    何晓亮,宋威,梁久祯. 基于资源分配网络和语义特征选取的文本分类
    [C]//Proceedings of the 24th Conference in Uncertainty in Artificial Intelligence (UAI). Arlington: AUAI, 2009: 110-119.
    孙福振,李贞双. 概念语义生成与文本特征选择研究
    SONG Y Q, PAN S, LIU S X, et al. Topic and keyword re-ranking for LDA-based topic modeling
    [J]. Information Processing Management, 2013, 49(6): 1181-1193.
    HOCHBAUM D. Approximation algorithms for NP-hard problems
    [C]//Proceedings of the 26th Annual International Conference on Machine Learning(ICML). Hyderabad: ICML, 2009: 1257-1264.
    AGRAWAL R, GOLLAPUDI S, HALVERSON A, et al. Diversifying search results
    LINDSEY R V, HEADDEN W P. A Phrase-discovering topic model using hierarchical pitman-yor processes
    WALLACH H. Topic modeling: beyond bag-of-words
    [J]. IEICE Transactions on Information and Systems, 2012, 95-D(4): 982-988.
    GRIFFITHS T L, STEYVERS M, TENENBAUM J B T. Topics in semantic representation
    [C]//Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD). Beijing: ACM, 2012: 123-131.
    WANG X, MCCALLUM A, WEI X. Topical N-grams: phrase and topic discovery, with an application to information retrieval
    JOHNSON M P. Topic models, adaptor grammars and learning topical collocations and the structure of proper names
    [J]. 计算机应用,2010,30(10): 2621-2624. XIONG Zhongyang, FU Lingling, ZHANG Yufang, et al. Improved feature selection approach combined with semantic
    [J]. Joumal of Computer Applications, 2010, 30(10): 2621-2624.
    [C]//International Joint Conference on Natural Language Processing. Nagoya: Asian Federation of Natural Language Processing, 2013: 136-144.
    [J]. 计算机工程与科学,2014,36(2): 340-347. HE Xiaoliang, SONG Wei, LIANG Jiuzhen. Text categorization based on resource allocating network and semantic feature selection
    [J]. Computer Engineering Science, 2014, 36(2): 340-347.
    [J]. 计算机工程与应用,2011,41(30): 116-118. SUN Fuzhen, LI Zhenshuang. Research on concept semantic space and text feature selection
    [J]. Computer Engineering and Applications, 2011, 41(30): 116-118.
    [C]//Proceeding of the International Conference on Information and Knowledge Management (CIKM). New York: ACM, 2009: 1757-1760.
    [M]. London: Springer, 1999: 34-34.
    [C]//Proceedings of the Second ACM International Conference on Web Search and Data Mining(WSDM). New York: ACM, 2009: 5-14.
    [C]//Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning(NLPCNLL). Stroudsburg: ACL, 2012: 214-222.
    [C]//Proceeding of the International Conference on Machine Learning(ICML). New York: ACM, 2006: 977-984.
    [J]. Psychological Review, 2007, 114(2): 211-244.
    [C]//Proceedings of the 7th IEEE International Conference on Data Mining(ICDM). Washington D. C.: IEEE, 2007: 697-702.
    [C]//Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics(ACL). Stroudsburg: ACL, 2010: 1148-1157.
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索
    Article views(843) PDF downloads(430) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return