Topic Link Detection Method Based on Semantic Similarity
-
摘要: 为有效识别任意两篇报道的相似性,提出了一种基于语义相似度的话题关联检测算法.该算法首先通过计算特征词之间的相对熵作为两篇报道中特征词之间的语义相似度;其次,通过计算平均语义相似度获得特征词和报道之间的关联度;最后,结合特征词在语料库中的TF-IF(term frequency-inverse document frequency)权重计算两篇报道之间的关联度,实现报道之间的关联度检测.本文提出的方法与现有的向量空间模型方法和仅依赖于平均点互信息的方法进行了比较,并通过TDT4中文语料进行测评,结果表明,基于语义相似度的关联检测方法能够更好地利用文本的语境信息,提高了现有检测系统的性能,其最小DET(detection error tradeoff)代价降低了3%.Abstract: To effectively judge the similarity between the topics of any two of stories, a topic link detection method was proposed on the basis of semantic similarity. First, the relative entropy between the feature words in two stories was calculated to work as the semantic similarity. Furthermore, the relevance between the feature words and the other story was obtained by calculating the average semantic similarity. At last, the relevance degree between two stories was calculated by considering TF-IF(term frequency-nverse document frequency)weights of the feature words in the corpus and the semantic similarity simultaneously, completing the link detection of the story pairs. The proposed algorithm was compared with the VSM (vector space model) method and average point-wise mutual information. The experimental results for Chinese Corpus of TDT4 show that minimum DET(detection error tradeoff)cost of the proposed algorithm is reduced by about 3%, which demonstrates that the proposed algorithm can impose the context information effectively and improve the performance of the topic link detection system simultaneously.
-
Key words:
- link detection /
- semantic similarity /
- relative entropy /
- relevance degree
-
洪宇,张宇,刘挺,等. 话题检测与跟踪的评测及研究综述 [J]. 中文信息学报,2007,21(6): 71-87. HONG Yu, ZHANG Yu, LIU Ting, et al. Topic detection and tracking review ALLAN J, LAVRENKO V, MALIN D, et al. Detections, bounds and timelines: UMASS and TDT-3 KUMARAN G, ALLAN J. Text classification and named entities for new event detection [J]. Journal of Chinese Information Processing, 2007, 21(6): 71-87. 贾真,何大可,尹红风,等. 基于无监督学习的部分-整体关系获取 [C]//Proceedings of Topic Detection and Tracking(TDT-3). Vienna:, 2000: 167-174. 庞海杰. 基于动态共现的中文话题关联检测 杨玉珍,刘培玉,费绍栋,等. 融合扩展信息瓶颈理论的话题关联检测方法研究 [C]//Proc. of the SIGIR 2004. New York: Association for Computing Machinery Press, 2004: 297-304. CHEN Y J, CHEN H H, NLP I R. Approaches to monolingual and multilingual link detection SHAH C, EGUCHI K. Improving document representation for story link detection by modeling term topicality [J]. 西南交通大学学报,2014,49(4): 590-596. JIA Zhen, HE Dake, YIN Hongfeng, et al. Acquisition of part-whole relations based on unsupervised learning DAGAN I, MARCUS S, MARKOVITCH S. Contextual word similarity and estimation from sparse data 袁里驰. 一种基于互信息的词聚类算法 [J]. Journal of Southwest Jiaotong University, 2014, 49(4): 590-596. 龙志祎,程葳. 基于词聚类的热点话题检测算法 [J]. 计算机应用与软件,2012,29(3): 115-117. PANG Haijie. Chinese story link detection based on dynamic co-occurrance CHEN P I, LIN S J. Word Ad-Hoc network: using Google core distance to extract the most relevant information PAN Y, LUO H X, TANG Y, et al. Learning to rank with document ranks and scores [J]. Computer Applications and Software, 2012, 29(3): 115-117. BURGESS C, LIVESAY K, LUND K. Explorations in context space: words, sentences, discourse SONG D, BRUZA P D. Towards context sensitive information inference [J]. 自动化学报,2014,40(3): 471-479. YANG Yuzhen, LIU Peiyu, FEI Shaodong, et al. A topic link detection method based on improved information bottleneck theory BAI J, SONG D, BRUZA P, et al. Query expansionusing term relationships in language models for information retrieval [J]. Acta Automatica Sinica, 2014, 40(3): 471-479. YU L C, WU C H, YEH J F, et al. HAL-based evolutionary inference for pattern induction from psychiatry Web resources [C]//Proceedings of the 19th International Conference on Computational Linguistics-Volume 1. Taipei: Association for Computational Linguistics, 2002: 1-7. BUDANITSKY A, HIRST G. Evaluating word net-based measures of lexical semantic relatedness KULLBACK S. Information theory and statistics [J]. Information and Media Technologies, 2009, 4(2): 433-441. HIJAZI M H A, COENEN F, ZHENG Y. Data mining techniques for the screening of age-related macular degeneration [C]//Proceedings of the 31st Annual Meeting on Association for Computational Linguistics. Morristown: Association for Computational Linguistics, 1993: 164-171. [J]. 系统工程,2008,26(5): 120-122. YUAN Lichi. A word clustering method based onmutual information [J]. Systems Engineering, 2008, 26(5): 120-122. [J]. 计算机工程与设计,2011(6): 60-84. LONG Zhiyi, CHENG Wei. Kind of hot topic detection algorithm based on clustering keywords [J]. Computer Engineering and Design, 2011(6): 60-84. [J]. Knowledge-Based Systems, 2011, 24: 393-405. [J]. Knowledge-Based Systems, 2011, 24: 478-483. [J]. Discourse Processes, 1998, 25(2/3): 211-257. [J]. Journal of the American Society for Information Science and Technology, 2003, 54(4): 321-334. [C]//Proc. 14th ACM Int. Conf. Inf. Knowl. Manage. (CIKM'05). Ann Arbor:, 2005: 688-695. [J]. IEEE Transactions on Evolutionary Computation, 2008, 12(2): 160-170. [J]. Computational Linguistics, 2006, 32(1): 13-47. [M]. New York: John-Wiley Sons, 1959: 30-50. [J]. Knowledge-Based Systems, 2012, 29: 83-92.
点击查看大图
计量
- 文章访问数: 859
- HTML全文浏览量: 65
- PDF下载量: 632
- 被引次数: 0