Outliers

An outlier is a data object that deviates significantly from the rest of the objects, suspected of being generated by a different mechanism.

Basic Concepts

Outlier Detection and Novelty Detection

  • monitoring a social media web site where new content is incoming, novelty detection may identify new topics and trends in a timely manner, novel topics may initially appear as outliers
  • outlier detection and novelty detection share some similarity in modeling and detection methods
  • main difference: in novelty detection, once new topics are confirmed, they are usually incorporated into the model of normal behavior so that follow-up instances are not treated as outliers anymore

Detecting Global Outliers

  • find an appropriate measurement of deviation with respect to the application in question
  • important in many applications
    • intrusion detection in computer networks
    • trading transaction auditing systems

Contextual vs. Behavioral Attributes

  • in contextual outlier detection, the attributes of the data objects in question are divided into two groups
    • contextual attributes
    • behavioral attribues
  • the contextual attributes of a data object define the object’s context
    • example: temperature contextual attributes may be date and location
  • the behavioral attributes define the object’s characteristics, and are used to evaluate whether the object is an outlier in the context to which it belongs
    • example: temperature behavioral attributes may be temperature, humidity, and pressure
  • Whether a data object is a contextual outlier depends on not only the behavioral attributes but also the contextual attributes

Contextual, Global, and Local Outliers

  • global outlier detection can be regarded as a special case of contextual outlier detection, where the set of contextual attributes is empty - global outlier detection uses the whole dataset as the context.
  • contextual outlier analysis provideds flexibility to users in that one can examine outliers in different contexts, which can be highly desirable in many applications
  • the quality of contextual outlier detection in an application depends on the meaningfulness of the contextual attributes, in addition to the measurement of the deviation of an object to the majority in the space of behavior attributes
  • applications of collective outliers:
    • intrusion detection, if several computers are sending denial-of-service packages to each other, this could indicate an attack
    • a large set of transactions of the same stock between a small number of parties in a short period could indicate market manipulation

Comparison Between Multiple Types of Outliers

  • a dataset can have many types of outliers
  • an object may belong to multiple types of outliers
  • different applications or purposes could require detection of different types of outliers
  • global outlier detection is the simplest
  • context outlier detection requires background information
  • collective outlier detection requires background information to model the relationship among objects to find groups of outliers

Challenges of Outlier Detection

  • modeling normal objects and outliers effectively
    • the border between data normality and abnormality (outliers) is often not clear-cut
  • application-specific outlier detection
    • the relationship among objects highly depends on applications
  • handling noise in outlier detection
    • noise often unavoidably exists in data collected in many applications
    • low data quality and the presence of noise bring a huge challenge to outlier detection
  • interpretability
    • detection and understanding of outliers

Detection Methods

  • supervised methods
  • semi-supervised methods
  • unsupervised methods
  • statistical methods
  • proximity-based methods
  • reconstruction-based methods

Supervised, Semi-Supervised, and Unsupervised Methods

  • supervised methods model data normality and abnormality
    • imbalanced datasets
    • avoid many false positives in outlier detection
  • unsupervised methods make an implicit assumption
    • normal objects are somewhat clustered
  • semi-supervised methods
    • although obtaining some labeled examples is feasible, the number of such labeled examples is often small

Statistical Methods, Proximity-based Methods, and Reconstruction-based Methods

  • Statistical methods (also known as model-based methods) make assumptions of data normality
    • Example: detecting outliers using a statistical (Gaussian) model
  • Proximity-based methods assume that an object is an outlier if the nearest neighbors of the object are far away in feature space, that is, the proximity of the object to its neighbors significantly deviates from the proximity of most of the other objects to their neighbors in the same data set
  • Reconstruction-based methods: matrix-factorization based methods and pattern-based compression methods
    • normal data samples often share certain similarities, can often be represented in a more succinct way, compared with their original representation
    • with the succinct representation, we can reconstruct the original representation of the normal samples well
    • for the samples that cannot be reconstructed well by the succinct representation, we flag them as outliers

The overall idea behind statistical methods for outlier detection is to learn a generative model fitting the given dataset, and then identify those objects in low-probability regions of the model as outliers.

A parametric method assumes that the normal data objects are generated by a parametric distribution with a finite number of parameters \(\Theta\).

  • the probability density function of the parametric distribution \(f(x, \Theta)\) gives the probability that object \(x\) is generated by the distribution
  • the smaller this value, the more likely \(x\) is an outlier

A nonparametric method tries to determine the model from the input data.

Detection of Univariate Outliers Based on Normal Distribution

  • assumption: data are generated from a normal distribution
  • learn the parameters of the normal (Gaussian) distribution from the input data, and identify the points with low probability as outliers
  • example: suppose we have a city’s average temperature values for a single month in the last 10 years:
    • 24.0, 28.9, 28.9, 29.0, 29.1, 29.1, 29.2, 29.2, 29.3, and 29.4
  • a normal distribution is determined by two parameters
    • the mean \(\mu\)
    • the standard deviation \(\sigma\)
  • use the maximum likelihood method to estimate the parameters \(\mu\) and \(\sigma\):

Parametric Detection - Interquartile Range (IQR) - Boxplot (5-number Summary)

Using the IQR, we can treat the following as outliers:

  • Given the five-number summary:
    • min
    • lower quartile (Q1)
    • median (Q2)
    • upper quartile (Q3)
    • max
  • \(IQR = Q3 - Q1\)
  • Outliers:
    • \(Outlier_{max} > 1.5*IQR\) or \(Outlier_{max} > Q3\)
    • \(Outlier_{min} < 1.5*IQR\) or \(Outlier_{min} < Q1\)
    • concept: \([Q1 - 1.5*IQR, Q3 + 1.5*IQR]\) contains 99.3% of the objects

Nonparametric Detection - Histogram

  • construct a histogram using the input data (training data)
  • if the object falls in one of the histogram’s bins, the object is regarded as normal, otherwise it is considered an outlier
  • use the histogram to assign an outlier score to an object, such as the reciprocal of the volume of the bin in which the object falls
  • drawbacks: hard to choose an appropriate bin size

Pros and Cons of Statistical Methods

  • Advantage: outlier detection may be statistically justifiable
  • Challenge: statistical methods for outlier detection on high-dimensional data
  • The computational cost of statistical methods depends on the models

Implementation

IQR Method

Set Up

  • q1 = df['column'].quantile(0.25)
  • q3 = df['column'].quantile(0.75)
  • iqr = q3 - q1
  • upper_limit = q3 + (1.5 * iqr)
  • lower_limit = q1 - (1.5 * iqr)

Outliers and Trimming Data

  • outliers: df.loc[(df['column'] > upper_limit) | (df['column'] < lower_limit)]
  • trimmed: df.loc[(df['column'] < upper_limit) & (df['column'] > lower_limit)]

Percentile Method

Remove data below 1% quantile and above 99% quantile.

Set Up

  • upper_limit = df['column'].quantile(0.99)
  • lower_limit = df['column'].quantile(0.01)

Outliers and Trimming Data

  • outliers: df.loc[(df['column'] > upper_limit) | (df['column'] < lower_limit)]
  • trimmed: df.loc[(df['column'] < upper_limit) & (df['column'] > lower_limit)]

Use distribution plotting (boxplots, violinplots, etc.) to review data spread before and after