分享
分销 收藏 举报 申诉 / 44
播放页_导航下方通栏广告

类型数据挖掘之异常检测.ppt

  • 上传人:a199****6536
  • 文档编号:1635293
  • 上传时间:2024-05-07
  • 格式:PPT
  • 页数:44
  • 大小:5.22MB
  • 下载积分:12 金币
  • 播放页_非在线预览资源立即下载上方广告
    配套讲稿:

    如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。

    特殊限制:

    部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。

    关 键  词:
    数据 挖掘 异常 检测
    资源描述:
    Anomaly Detection:A introduction Source of slides:Tutorial At American Statistical Association(ASA2008)Jiawei Han-data mining:concepts and techniquesTutorial at the European Conference on Principles and Practice of Knowledge Discovery in DatabasesSpeaker:Wentao LiOutlineDefinitionApplicationMethodsLimited time,So I just draw the picture of anomaly detection,for more detail,please turn to the paper for help.What are Anomalies?Anomaly is a pattern in the data that does not conform to the expected behaviorAnomaly is A data object that deviates significantly from the normal objects as if it were generated by a different mechanismAlso referred to as outliers,exceptions,peculiarities,surprises,etc.Anomalies translate to significant(often critical)real life entitiesCyber intrusionsCredit card fraudFaults in mechanical systemsRelated problemsOutliers are different from the noise data Noise is random error or variance in a measured variableNoise should be removed before outlier detectionOutliers are interesting:It violates the mechanism that generates the normal dataOutlier detection vs.novelty detection:early stage,outlier;but later merged into the modelKey ChallengesDefining a representative normal region is challengingThe boundary between normal and outlying behavior is often not preciseAvailability of labeled data for training/validationThe exact notion of an outlier is different for different application domainsData might contain noiseNormal behavior keeps evolvingAppropriate selection of relevant featuresMapRelated areas(theory)Application(practice)Problem formulationDetection effect+Aspects of Anomaly Detection ProblemNature of input data What is the characteristic of input dataAvailability of supervision Number of labelType of anomaly:point,contextual,structuralType of anomaly Output of anomaly detection Score vs labelEvaluation of anomaly detection techniques What kind of detection is goodInput DataMost common form of data handled by anomaly detection techniques is Record DataUnivariateMultivariateInput DataMost common form of data handled by anomaly detection techniques is Record DataUnivariateMultivariateInput Data Nature of AttributesNature of attributesBinaryCategoricalContinuousHybridcategoricalcontinuouscontinuouscategoricalbinaryInput Data Complex Data TypesRelationship among data instancesSequential TemporalSpatialSpatio-temporalGraphData LabelsSupervised Anomaly DetectionLabels available for both normal data and anomaliesSemi-supervised Anomaly DetectionLabels available only for normal dataUnsupervised Anomaly DetectionNo labels assumedBased on the assumption that anomalies are very rare compared to normal dataPay attention:here some materials give different descriptions,and we treat adopt the definition here though it is a bit ambiguous with the traditional definitionalType of Anomalies*Point AnomaliesContextual AnomaliesCollective AnomaliesPoint AnomaliesAn individual data instance is anomalous w.r.t.the dataXYN1N2o1o2O3Contextual AnomaliesAn individual data instance is anomalous within a contextRequires a notion of contextAlso referred to as conditional anomalies*Dangerous+theft condition=theftMoney consumer:the poor and the rich*Xiuyao Song,Mingxi Wu,Christopher Jermaine,Sanjay Ranka,Conditional Anomaly Detection,IEEE Transactions on Data and Knowledge Engineering,2006.NormalAnomalyCollective AnomaliesA collection of related data instances is anomalousRequires a relationship among data instancesSequential DataSpatial DataGraph DataThe individual instances within a collective anomaly are not anomalous by themselvesAnomalous SubsequenceOutput of Anomaly DetectionLabelEach test instance is given a normal or anomaly labelThis is especially true of classification-based approachesScoreEach test instance is assigned an anomaly scoreAllows the output to be rankedRequires an additional threshold parameterEvaluation of Anomaly Detection F-valueAccuracy is not sufficient metric for evaluationExample:network traffic data set with 99.9%of normal data and 0.1%of intrusionsTrivial classifier that labels everything with the normal class can achieve 99.9%accuracy!anomaly class Cnormal class NCFocus on both recall and precisionRecall (R)=TP/(TP+FN)true predicted anomaly/all anomalyPrecision(P)=TP/(TP+FP)true predicted anomaly/all predictedF measure=2*R*P/(R+P)=Evaluation of Outlier Detection ROC&AUCStandard measures for evaluating anomaly detection problems:Recall(Detection rate)-ratio between the number of correctly detected anomalies and the total number of anomaliesFalse alarm(false positive)rate ratio between the number of data records from normal class that are misclassified as anomalies and the total number of data records from normal class ROC Curve is a trade-off between detection rate and false alarm rateArea under the ROC curve(AUC)is computed using a trapezoid ruleThe best:|_ the worest:_|anomaly class Cnormal class NCAUCIdeal ROC curveApplications of Anomaly DetectionNetwork intrusion detectionInsurance/Credit card fraud detectionHealthcare Informatics/Medical diagnosticsIndustrial Damage DetectionImage Processing/Video surveillance Novel Topic Detection in Text MiningFraud DetectionFraud detection refers to detection of criminal activities occurring in commercial organizationsMalicious users might be the actual customers of the organization or might be posing as a customer(also known as identity theft).Types of fraudCredit card fraudInsurance claim fraudMobile/cell phone fraudInsider tradingChallengesFast and accurate real-time detectionMisclassification cost is very highHealthcare InformaticsDetect anomalous patient recordsIndicate disease outbreaks,instrumentation errors,etc.Key ChallengesOnly normal labels availableMisclassification cost is very highData can be complex:spatio-temporalImage ProcessingDetecting outliers in a image or video monitored over timeDetecting anomalous regions within an imageUsed in mammography image analysisvideo surveillance satellite image analysisKey ChallengesDetecting collective anomaliesData sets are very largeAnomalyTaxonomy*Anomaly DetectionContextual Anomaly DetectionCollective Anomaly DetectionOnline Anomaly DetectionDistributed Anomaly DetectionPoint Anomaly DetectionClassification BasedRule BasedNeural Networks BasedSVM BasedNearest Neighbor BasedDensity BasedDistance BasedStatisticalParametricNon-parametricClustering BasedOthersInformation Theory BasedSpectral Decomposition BasedVisualization BasedStatistical ApproachesStatistical approaches assume that the objects in a data set are generated by a stochastic process(a generative model)Idea:learn a generative model fitting the given data set,and then identify the objects in low probability regions of the model as outliersMethods are divided into two categories:parametric vs.non-parametric Parametric methodAssumes that the normal data is generated by a parametric distribution with parameter The probability density function of the parametric distribution f(x,)gives the probability that object x is generated by the distributionThe smaller this value,the more likely x is an outlierNon-parametric methodNot assume an a-priori statistical model and determine the model from the input dataNot completely parameter free but consider the number and nature of the parameters are flexible and not fixed in advanceExamples:histogram and kernel density estimationParametric Methods I:Detection Univariate Outliers Based on Normal DistributionUnivariate data:A data set involving only one attribute or variableOften assume that data are generated from a normal distribution,learn the parameters from the input data,and identify the points with low probability as outliersEx:Avg.temp.:24.0,28.9,28.9,29.0,29.1,29.1,29.2,29.2,29.3,29.4Use the maximum likelihood method to estimate and nTaking derivatives with respect to and 2,we derive the following maximum likelihood estimatesnFor the above data with n=10,we havenThen(24 28.61)/1.51=3.04 rEfficient computation:Nested loop algorithmFor any object oi,calculate its distance from other objects,and count the#of other objects in the r-neighborhood.If n other objects are within r distance,terminate the inner loopOtherwise,oi is a DB(r,)outlierEfficiency:Actually CPU time is not O(n2)but linear to the data set size since for most non-outlier objects,the inner loop terminates early35Density-Based Outlier DetectionLocal outliers:Outliers comparing to their local neighborhoods,instead of the global data distributionIn Fig.,o1 and o2 are local outliers to C1,o3 is a global outlier,but o4 is not an outlier.However,proximity-based clustering cannot find o1 and o2 are outlier(e.g.,comparing with O4).36nIntuition(density-based outlier detection):The density around an outlier object is significantly different from the density around its neighborsnMethod:Use the relative density of an object against its neighbors as the indicator of the degree of the object being outliersnk-distance of an object o,distk(o):distance between o and its k-th NNnk-distance neighborhood of o,Nk(o)=o|o in D,dist(o,o)distk(o)nNk(o)could be bigger than k since multiple objects may have identical distance to oLocal Outlier Factor:LOFReachability distance from o to o:where k is a user-specified parameterLocal reachability density of o:37nLOF(Local outlier factor)of an object o is the average of the ratio of local reachability of o and those of os k-nearest neighborsnThe lower the local reachability density of o,and the higher the local reachability density of the kNN of o,the higher LOFnThis captures a local outlier whose local density is relatively low comparing to the local densities of its kNNClustering-Based Outlier Detection(1&2):Not belong to any cluster,or far from the closest oneAn object is an outlier if(1)it does not belong to any cluster,(2)there is a large distance between the object and its closest cluster,or(3)it belongs to a small or sparse cluster nCase I:Not belong to any clusternIdentify animals not part of a flock:Using a density-based clustering method such as DBSCANnCase 2:Far from its closest cluster nUsing k-means,partition data points of into clusters nFor each object o,assign an outlier score based on its distance from its closest center nIf dist(o,co)/avg_dist(co)is large,likely an outliernEx.Intrusion detection:Consider the similarity between data points and the clusters in a training data setnUse a training set to find patterns of“normal”data,e.g.,frequent itemsets in each segment,and cluster similar connections into groupsnCompare new data points with the clusters minedOutliers are possible attacks39FindCBLOF:Detect outliers in small clustersFind clusters,and sort them in decreasing sizeTo each data point,assign a cluster-based local outlier factor(CBLOF):If obj p belongs to a large cluster,CBLOF=cluster_size X similarity between p and clusterIf p belongs to a small one,CBLOF=cluster size X similarity betw.p and the closest large cluster40Clustering-Based Outlier Detection(3):Detecting Outliers in Small ClustersnEx.In the figure,o is outlier since its closest large cluster is C1,but the similarity between o and C1 is small.For any point in C3,its closest large cluster is C2 but its similarity from C2 is low,plus|C3|=3 is smallClustering-Based Method:Strength and WeaknessStrengthDetect outliers without requiring any labeled data Work for many types of dataClusters can be regarded as summaries of the dataOnce the cluster are obtained,need only compare any object against the clusters to determine whether it is an outlier(fast)WeaknessEffectiveness depends highly on the clustering method usedthey may not be optimized for outlier detectionHigh computational cost:Need to first find clustersA method to reduce the cost:Fixed-width clusteringA point is assigned to a cluster if the center of the cluster is within a pre-defined distance threshold from the pointIf a point cannot be assigned to any existing cluster,a new cluster is created and the distance threshold may be learned from the training data under certain conditionsClassification-Based Method I:One-Class ModelIdea:Train a classification model that can distinguish“normal”data from outliersA brute-force approach:Consider a training set that contains samples labeled as“normal”and others labeled as“outlier”But,the training set is typically heavily biased:#of“normal”samples likely far exceeds#of outlier samplesCannot detect unseen anomaly43nOne-class model:A classifier is built to describe only the normal class.nLearn the decision boundary of the normal class using classification methods such as SVMnAny samples that do not belong to the normal class(not within the decision boundary)are declared as outliersnAdv:can detect new outliers that may not appear close to any outlier objects in the training setnExtension:Normal objects may belong to multiple classesClassification-Based Method II:Semi-Supervised LearningSemi-supervised learning:Combining classification-based and clustering-based methodsMethodUsing a clustering-based approach,find a large cluster,C,and a small cluster,C1Since some objects in C carry the label“normal”,treat all objects in C as normalUse the one-class model of this cluster to identify normal objects in outlier detectionSince some objects in cluster C1 carry the label“outlier”,declare all objects in C1 as outliersAny object that does not fall into the model for C(such as a)is considered an outlier as well44nComments on classification-based outlier detection methodsnStrength:Outlier detection is fastnBottleneck:Quality heavily depends on the availability and quality of the training set,but often difficult to obtain representative and high-quality training data
    展开阅读全文
    提示  咨信网温馨提示:
    1、咨信平台为文档C2C交易模式,即用户上传的文档直接被用户下载,收益归上传人(含作者)所有;本站仅是提供信息存储空间和展示预览,仅对用户上传内容的表现方式做保护处理,对上载内容不做任何修改或编辑。所展示的作品文档包括内容和图片全部来源于网络用户和作者上传投稿,我们不确定上传用户享有完全著作权,根据《信息网络传播权保护条例》,如果侵犯了您的版权、权益或隐私,请联系我们,核实后会尽快下架及时删除,并可随时和客服了解处理情况,尊重保护知识产权我们共同努力。
    2、文档的总页数、文档格式和文档大小以系统显示为准(内容中显示的页数不一定正确),网站客服只以系统显示的页数、文件格式、文档大小作为仲裁依据,个别因单元格分列造成显示页码不一将协商解决,平台无法对文档的真实性、完整性、权威性、准确性、专业性及其观点立场做任何保证或承诺,下载前须认真查看,确认无误后再购买,务必慎重购买;若有违法违纪将进行移交司法处理,若涉侵权平台将进行基本处罚并下架。
    3、本站所有内容均由用户上传,付费前请自行鉴别,如您付费,意味着您已接受本站规则且自行承担风险,本站不进行额外附加服务,虚拟产品一经售出概不退款(未进行购买下载可退充值款),文档一经付费(服务费)、不意味着购买了该文档的版权,仅供个人/单位学习、研究之用,不得用于商业用途,未经授权,严禁复制、发行、汇编、翻译或者网络传播等,侵权必究。
    4、如你看到网页展示的文档有www.zixin.com.cn水印,是因预览和防盗链等技术需要对页面进行转换压缩成图而已,我们并不对上传的文档进行任何编辑或修改,文档下载后都不会有水印标识(原文档上传前个别存留的除外),下载后原文更清晰;试题试卷类文档,如果标题没有明确说明有答案则都视为没有答案,请知晓;PPT和DOC文档可被视为“模板”,允许上传人保留章节、目录结构的情况下删减部份的内容;PDF文档不管是原文档转换或图片扫描而得,本站不作要求视为允许,下载前可先查看【教您几个在下载文档中可以更好的避免被坑】。
    5、本文档所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用;网站提供的党政主题相关内容(国旗、国徽、党徽--等)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。
    6、文档遇到问题,请及时联系平台进行协调解决,联系【微信客服】、【QQ客服】,若有其他问题请点击或扫码反馈【服务填表】;文档侵犯商业秘密、侵犯著作权、侵犯人身权等,请点击“【版权申诉】”,意见反馈和侵权处理邮箱:1219186828@qq.com;也可以拔打客服电话:0574-28810668;投诉电话:18658249818。

    开通VIP折扣优惠下载文档

    自信AI创作助手
    关于本文
    本文标题:数据挖掘之异常检测.ppt
    链接地址:https://www.zixin.com.cn/doc/1635293.html
    页脚通栏广告

    Copyright ©2010-2026   All Rights Reserved  宁波自信网络信息技术有限公司 版权所有   |  客服电话:0574-28810668    微信客服:咨信网客服    投诉电话:18658249818   

    违法和不良信息举报邮箱:help@zixin.com.cn    文档合作和网站合作邮箱:fuwu@zixin.com.cn    意见反馈和侵权处理邮箱:1219186828@qq.com   | 证照中心

    12321jubao.png12321网络举报中心 电话:010-12321  jubao.png中国互联网举报中心 电话:12377   gongan.png浙公网安备33021202000488号  icp.png浙ICP备2021020529号-1 浙B2-20240490   


    关注我们 :微信公众号  抖音  微博  LOFTER               

    自信网络  |  ZixinNetwork