基于属性关联及匹配差异度的数据流异常检测
doi: 10.3969/j.issn.0258-2724.2013.01.017
Outlier Detection Model for Data Streams Based on Attribute Associations and Match Difference Degree
-
摘要: 为解决类别属性数据流异常点检测问题,针对事务数据流环境,提出了基于属性关联及匹配差异度的数据流异常检测模型AAMDD(attribute associations and match difference degree).AAMDD模型离线构建一个关联规则库,并对其进行增量式更新.同时,利用时间敏感型滑动窗口(time-sensitive sliding windows,TimeSW)维护数据流数据,每经过一个时间跨度,就将当前窗口中每条数据包含的项集与关联规则库进行匹配,计算匹配差异度,根据匹配差异度的不同在线检测异常点.此外,给出了与AAMDD模型相对应的算法AAMDD-algorithm.实验结果表明,AAMDD-algorithm比FODFP-Stream算法的效率和检测精确度分别平均提高了37.43%和5.51%,并且AAMDD-algorithm的查全率保持在77%以上,可用于事务型数据流异常检测.Abstract: In order to solve the problem of outlier detection for categorical data streams, an outlier detection model for data streams based on attribute associations and match difference degree was proposed, called as AAMDD. This model builds an association rule library off-line and updates it with the incremental method. Meanwhile, it maintains the data streams by using time-sensitive sliding windows (TimeSW). In a time step, the AAMDD matches data in current window with association rules in the association rule library and calculates the match difference degree (MDD). Then, outliers can be identified on-line through different MDDs. An algorithm for the AAMDD was given, called as AAMDD-algorithm. The experiment results show that compared with the FODFP-Stream algorithm, the AAMDD-algorithm has on average 5.51%and 37.43%improvements respectively in detection precision and efficiency, and its recall is above 77%. It can be used to detect outliers in transaction data streams.
-
Key words:
- data stream /
- association rule /
- difference degree /
- incremental outlier detection /
- concept drifting
-
李存华,孙志挥. GridOF:面向大规模数据集的高效离群点检测算法[J]. 计算机研究与发展,2003,40(11): 1585-1592. LI Cunhua, SUN Zhihui. GridOF: an efficient outlier detection algorithm for very large datasets[J]. Journal of Computer Research and Development, 2003, 40(11): 1585-1592. MUTHUKRISHNAN S, SHAH R, VETTER J S. Mining deviants in time series data stream[C]//Proceedings of the 16th International Conference on Scientific and Statistical Database Management. Los Alamitos: IEEE Computer Society Press, 2004: 41-50. ANGIULLI F, FASSETTI F. Detecting distance-based outliers in streams of data[C]//Proceedings of the 60th ACM Conference on Information and Knowledge Management. New York: ACM, 2007: 811-820. POKRAJAC D, LAZAREVIC A, LATECKI L J. Incremental local outlier detection for data streams[C]//IEEE Symposium on Computational Intelligence and Data Mining. [S.l.]: IEEE, 2007: 504-515. ZHU Xingquan, WU Xindong, YANG Ying. Effective classification of noisy data streams with attribute oriented dynamic classifier selection[J]. Knowledge and Information Systems, 2006, 9(3): 339-363. LI Peipei, HU Xuegang, LIANG Qianhui, et al. Concept drifting detection on noisy streaming data in random ensemble decision trees[C]// Proceedings of the 6th International Conference on Machine Learning and Data Mining. Berlin: Lecture Notes in Computer Science(LNCS), 2009: 236-250. CHAN P K, MAHONEY M V, ARSHAD M H. A machine learning approach to anomaly detection. Melbourne: Florida Institute of Technology, 2003: 1-13. DAS K, SCHNEIDER J G. Detecting anomalous records in categorical datasets[C]//Proceedings of the 13th International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2007: 220-229. NARITA K, KITAGAWA H. Detecting outliers in categorical record databases based on attribute associations[C]//Proceedings of the 10th Asia-Pacific Web Conference on Progress in WWW Research and Development. Heidelberg: Springer-Verlag, 2008: 111-123. 江峰,杜军威,葛艳,等. 基于粗糙集理论的序列离群点检测[J]. 电子学报,2011,39(2): 345-350. JIANG Feng, DU Junwei, GE Yan, et al. Sequence outlier detection based on rough set theory[J]. Acta Electronica Sinica, 2011, 39(2): 345-350. 苏晓珂,兰洋. 一种高效混合属性离群检测算法[J]. 小微型计算机系统,2010,31(11): 2282-2286. SU Xiaoke, LAN Yang. Efficient outlier detection algorithm for mixed attributes[J]. Journal of Chinese Computer System, 2010, 31(11): 2282-2286. 周晓云,孙志挥,张柏礼,等. 高维类别属性数据流离群点快速检测算法[J]. 软件学报,2007,18(4): 933-942. ZHOU Xiaoyun, SUN Zhihui, ZHANG Baili, et al. A fast outlier detection algorithm for high dimensional categorical data streams[J]. Journal of Software, 2007, 18(4): 933-942. 徐雪松,李玲娟,郭立玮. 基于稀疏表示的数据流异常数据预测方法[J]. 计算机应用,2010,30(11): 2956-2959. XU Xuesong, LI Lingjuan, GUO Liwei. Prediction method of outliers over data stream based on sparse representation[J]. Journal of Computer Applications, 2010, 30(11): 2956-2959. 李文忠,左万利,赫枫龄. 一种基于信息熵的多维流数据噪声检测算法[J]. 计算机科学,2012,39(2): 191-194. LI Wenzhong, ZUO Wanli, HE Fengling. Entrop-based algorithm for noise detection in multi-dimensional stream data[J]. Computer Science, 2012, 39(2): 191-194. GIANNELLA C, HAN J W, PEI J, et al. Mining frequent patterns in data streams at multiple time granularities[C]//Proceedings of NSF Workshop on Next Generation Data Mining. Cambridge: MIT Press, 2002: 191-212. LI H F, LEE S Y. Mining frequent itemsets over data streams using efficient window sliding techniques[J]. Expert Systems with Applications, 2009, 36(2): 1466-1477. TSAI P S M. Mining frequent itemsets in data streams using the weighted sliding window model[J]. Expert Systems with Applications, 2009, 36(9): 11617-11625. AGRAWAL R, SRIKANT R. Fast algorithm for mining association rules[C]//Proceedings of the 20th International Conference on Very Large Data Bases. San Francisco: Morgan Kaufmann Publishers, 1994: 487-499. GUO Deke, WU Jie, CHEN Honghui, et al. The dynamic bloom filters[C]//IEEE Transactions on Knowledge and Data Engineering, 2010, 22(1): 120-133. BRIJS T, SWINNEN G, VANHOOF K, et al. Using association rules for product assortment decisions: a case study[C]//Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining, San Diego: [s.n.], 1999: 254-260.(中文编辑:唐 晴 英文编辑:付国彬)
点击查看大图
计量
- 文章访问数: 983
- HTML全文浏览量: 71
- PDF下载量: 357
- 被引次数: 0