• ISSN 0258-2724
  • CN 51-1277/U
  • EI Compendex
  • Scopus 收录
  • 全国中文核心期刊
  • 中国科技论文统计源期刊
  • 中国科学引文数据库来源期刊

基于词组主题建模的文本语义压缩算法

王李冬 张引 吕明琪

王李冬, 张引, 吕明琪, . 基于词组主题建模的文本语义压缩算法[J]. 西南交通大学学报, 2015, 28(4): 755-763. doi: 10.3969/j.issn.0258-2724.2015.04.027
引用本文: 王李冬, 张引, 吕明琪, . 基于词组主题建模的文本语义压缩算法[J]. 西南交通大学学报, 2015, 28(4): 755-763. doi: 10.3969/j.issn.0258-2724.2015.04.027
WANG Lidong, ZHANG Yin, LÜ, Mingqi. Document Semantic Compression Algorithm Based on Phrase Topic Model[J]. Journal of Southwest Jiaotong University, 2015, 28(4): 755-763. doi: 10.3969/j.issn.0258-2724.2015.04.027
Citation: WANG Lidong, ZHANG Yin, LÜ, Mingqi. Document Semantic Compression Algorithm Based on Phrase Topic Model[J]. Journal of Southwest Jiaotong University, 2015, 28(4): 755-763. doi: 10.3969/j.issn.0258-2724.2015.04.027

基于词组主题建模的文本语义压缩算法

doi: 10.3969/j.issn.0258-2724.2015.04.027
基金项目: 

浙江省自然科学基金资助项目(Q14F020032,LY15F020025)

国家自然科学基金资助项目(61202282)

大学数字图书馆国际合作计划资助项目

详细信息
    作者简介:

    王李冬(1982-),女,副教授,博士,研究方向为数据挖掘、信息检索,E-mail:violet_wld@163.com

Document Semantic Compression Algorithm Based on Phrase Topic Model

  • 摘要: 为了实现文本代表性语义词汇的抽取,提出一种基于词组主题建模的文本语义压缩算法SCPTM(semantic compression based on phrase topic modeling).该算法首先将代表性语义词汇抽取问题转化为最大化优化模型,并通过贪心搜索策略实现该模型的近似求解.然后,利用词组挖掘模型LDACOL实现词组主题建模,得到SCPTM算法的输入参数;同时,针对该模型中词组的主题分配不稳定的问题进行改进,使得取得的代表性语义词汇更加符合人们对语义的认知习惯.最后,将改进LDACOL模型与LDA模型、LDACOL模型以及TNG模型的主题挖掘性能进行实验比较,并利用SCPTM算法针对不同语料库进行语义压缩,根据聚类结果评价其有效性.实验结果表明,在多数情况下,改进LDACOL模型的主题抽取效果优于其他3种模型;通过SCPTM算法抽取代表性语义词汇能达到70%~100%的精度,相比PCA、MDS、ISOMAP等传统降维算法能获得更高的聚类效果.

     

  • LAZER C, TAMINAU J, MEGANCK S, et al. A survey on filter techniques for feature selection in gene expression microarray analysis
    [J]. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 2012, 9(4): 1106-1119.
    贾真,何大可,尹红风,等. 基于无监督学习的部分-整体关系获取
    BLEI D M. Probabilistic topic models
    [J]. 西南交通大学学报,2014, 49(4): 590-596. JIA Zhen, HE Dake, YIN Hongfeng, et al. Acquisition of part-whole relations based on unsupervised learning
    STEYVERS M, SMYTH P, ZVI M R, et al. Probabilistic author-topic models for information discovery
    [J]. Journal of Southwest Jiaotong University, 2014, 49(4): 590-596.
    TANG J, ZHANG J, YAO L, et al. Arnetminer: extraction and mining of academic social networks
    WANG C, BLEI D, HECKERMAN D. Continuous time dynamic topic models
    YANG P, GAO W, TAN Q, et al. A link-bridged topic model for cross-domain document classification
    [J]. Communications of the ACM, 2012, 55(4): 77-84.
    ZHU J, AHMED A, XING ERIC P. MedLDA: maximum margin supervised topic models for regression and classification
    [C]//Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). Seattle: ACM, 2004: 306-315.
    WANG L D, YUAN J. Enhancing digital book clustering by LDAC model
    WANG Y, AGICHTEIN E, BENZI M. TM-LDA: efficient online modeling of latent topic transitions in social media
    熊忠阳,付玲玲,张玉芳,等. 结合语义的特征选择方法
    [C]//Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD). Nevada: ACM, 2008: 990-998.
    YOONJAE J, SUNG-HYON M. IFeature selection using a semantic hierarchy for event recognition and type classification
    何晓亮,宋威,梁久祯. 基于资源分配网络和语义特征选取的文本分类
    [C]//Proceedings of the 24th Conference in Uncertainty in Artificial Intelligence (UAI). Arlington: AUAI, 2009: 110-119.
    孙福振,李贞双. 概念语义生成与文本特征选择研究
    SONG Y Q, PAN S, LIU S X, et al. Topic and keyword re-ranking for LDA-based topic modeling
    [J]. Information Processing Management, 2013, 49(6): 1181-1193.
    HOCHBAUM D. Approximation algorithms for NP-hard problems
    [C]//Proceedings of the 26th Annual International Conference on Machine Learning(ICML). Hyderabad: ICML, 2009: 1257-1264.
    AGRAWAL R, GOLLAPUDI S, HALVERSON A, et al. Diversifying search results
    LINDSEY R V, HEADDEN W P. A Phrase-discovering topic model using hierarchical pitman-yor processes
    WALLACH H. Topic modeling: beyond bag-of-words
    [J]. IEICE Transactions on Information and Systems, 2012, 95-D(4): 982-988.
    GRIFFITHS T L, STEYVERS M, TENENBAUM J B T. Topics in semantic representation
    [C]//Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD). Beijing: ACM, 2012: 123-131.
    WANG X, MCCALLUM A, WEI X. Topical N-grams: phrase and topic discovery, with an application to information retrieval
    JOHNSON M P. Topic models, adaptor grammars and learning topical collocations and the structure of proper names
    [J]. 计算机应用,2010,30(10): 2621-2624. XIONG Zhongyang, FU Lingling, ZHANG Yufang, et al. Improved feature selection approach combined with semantic
    [J]. Joumal of Computer Applications, 2010, 30(10): 2621-2624.
    [C]//International Joint Conference on Natural Language Processing. Nagoya: Asian Federation of Natural Language Processing, 2013: 136-144.
    [J]. 计算机工程与科学,2014,36(2): 340-347. HE Xiaoliang, SONG Wei, LIANG Jiuzhen. Text categorization based on resource allocating network and semantic feature selection
    [J]. Computer Engineering Science, 2014, 36(2): 340-347.
    [J]. 计算机工程与应用,2011,41(30): 116-118. SUN Fuzhen, LI Zhenshuang. Research on concept semantic space and text feature selection
    [J]. Computer Engineering and Applications, 2011, 41(30): 116-118.
    [C]//Proceeding of the International Conference on Information and Knowledge Management (CIKM). New York: ACM, 2009: 1757-1760.
    [M]. London: Springer, 1999: 34-34.
    [C]//Proceedings of the Second ACM International Conference on Web Search and Data Mining(WSDM). New York: ACM, 2009: 5-14.
    [C]//Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning(NLPCNLL). Stroudsburg: ACL, 2012: 214-222.
    [C]//Proceeding of the International Conference on Machine Learning(ICML). New York: ACM, 2006: 977-984.
    [J]. Psychological Review, 2007, 114(2): 211-244.
    [C]//Proceedings of the 7th IEEE International Conference on Data Mining(ICDM). Washington D. C.: IEEE, 2007: 697-702.
    [C]//Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics(ACL). Stroudsburg: ACL, 2010: 1148-1157.
  • 加载中
计量
  • 文章访问数:  825
  • HTML全文浏览量:  57
  • PDF下载量:  430
  • 被引次数: 0
出版历程
  • 收稿日期:  2014-06-16
  • 刊出日期:  2015-08-25

目录

    /

    返回文章
    返回