Classification and Regression

Decision Tree Induction

Mostly used in classification, i.e. decision tree classification, and is known as a greedy algorithm.

A flowchart-like tree structure, where each:

  • Root Node: the topmost node in a tree
  • Internal Node: denotes a test on an attribute
  • Branch: represents an outcome of the test
  • Leaf Node (terminal node): holds a class label

In splitting attributes:

  • Discrete Values: directly
  • Continuous Values: split_point
  • Discrete Values:
    • binary tree
    • splitting_subset

Attributes are selected at each node based on a heuristic or statistical measure. An attribute selection measure (splitting rules) is a heuristic (enabling someone to discover or learn something for themselves) for selecting the splitting criterion that “best” separates a given data partition of class-labeled training uples into individual classes.

Some common selection measures:

  • information gain - Iterative Dichotomiser 3 (ID3): multi-valued attributes
  • gain ratio (C4.5): unbalanced splits
  • Gini impurity: multi-valued, equal-sized & pure partitions, not great when the number of classes is large

Information Gain

The purer the dataset:

  • the less information we need to remember
  • the easier we can make predictions

Entropy is to measure the impurity of a dataset.

We call \(p_i = \frac{C_{i,D}}{D}\) the probability that a tuple belong to class \(C_i\), \(m\) that number of classes, and \(v\) are partitions or subsets.

\(Info(D) = - \sum\limits_{i=1}^{m} p_i log_2(p_i)\) (entropy)

\(Info_A(D) = \sum\limits_{j=1}^{v} \frac{|D_j|}{|D|} Info(D_j)\)

\(Gain(A) = Info(D) - Info_A(D)\)

Here’s what this process looks like by hand:

Gain Ratio

Information gain is biased towards tests with many outcomes, to overcome this, gain ratio applied a kind of normalization to information gain.

\(SplitInfo_A(D) = - \sum\limits_{j=1}^{v} \frac{|D_j|}{|D|} log_2(\frac{|D_j|}{|D|})\)

\(GainRatio(A) = \frac{Gain(A)}{SplitInfo_A(D)}\)

Gini Impurity

\(Gini(D) = 1 - \sum\limits_{i=1}^m p_i^2\)

and given \(v\) partitions (\(v=2\) would indicate binary),

\(Gini_A(D) = \sum\limits_{j=1}^v \frac{D_j}{D} Gini(D_j)\)

\(\Delta Gini(A) = Gini(D) - Gini_A(D)\)

Naive Bayesian Classifier (Naives Bayes)

A classifier based on Bayes’ Theorem which uses an incremental approach.

The Theory

Bayes’ Theorem

\(P(H|X) = \frac{P(HX)}{P(X)} = \frac{P(X|H)P(H)}{P(X)}\)

The Classifier

Given

\(P(C_i|X) = \frac{P(X|C_i)P(C_i)}{P(X)}\)

\(P(X)\) is constant for all classes, thus we only need to maximize:

\(\frac{P(X|C_i)}{P(C_i)}\)

\(\rightarrow P(X|C_i) = \prod_{k=1}^n P(x_k|C_i)\)

Naive Assumption: class conditional independence (no dependence between attributes).

The Zero-Probability Problem (Laplacian Correction)

The prediction requires each conditional probability to be non-zero, if a single \(P(x_k|C_i) = 0 \rightarrow P(X|C_i) = \prod_{k=1}^n P(x_k|C_i) = 0\).

We account for this by applying the Laplacian Correction (Laplacian Estimator), by adding 1 (or a small integer) to each case’s count.

Strengths and Weaknesses

Strengths

  • Performance: A naive Bayesian classifier has comparable performance with decision tree and selected neural network classifiers.
  • Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct - prior knowledge can be combined with observed data.

Weaknesses

  • Assumption: Attributes conditional independence, therefore loss of accuracy.

We can deal with dependencies through Bayesian Belief Networks.

k-Nearest Neighbor

  • All instances correspond to points in the n-D space
  • The nearest neighbor is defined in terms of Euclidean distance
  • Target function could be discrete or real valued
  • For discrete valued, k-NN returns the most common value among the k training examples nearest \(x_q\)
  • Vonorio diagram: decision surface induced by a 1-NN for a typical set of training examples

  • k-NN for real-valued prediction for a given unknown tuple
    • returns the mean values of the k-nearest neighbor
  • Distance-weighted nearest neighbor algorithm
    • weight the contribution of each of the k neighbors according to their distance to the query \(x_q\)
    • give greater weight to closer neighbors \(w = \frac{1}{d(x_q, x_i)^2}\)
  • Robust to noisy data by averaging k-nearest neighbors
  • Curse of dimensionality: distance between neighbors could be dominated by irrelevant attributes
    • To overcome it, axes stretch or elimination of he least relevant attributes

Selecting k (the number of neighbors)

  • Small k: potential for overfitting (high variance, low bias)
  • Big k: potential for bringing too many irrelevant points (low variance, high bias)

Important Information

  • lazy method: doesn’t have a learning phase
  • non-parametric data: no assumption about data
  • outputs labels
  • object classified by a plurality vote of its neighbors, with the object being assigned to the class most common among k nearest neighbors (k is positive and normally small)

k-NN Parameters

Since k-NN is relying on the votes neighbors, the parameters are:

  • k: how many neighbors
  • distance: which distance measure to use
    • Minkowski distance (\(L_p\) norm or p-Norm)
    • Manhattan distance (\(L_1\))
    • Euclidean distance (\(L_2\))

Bias Prevelant within k-NN

  • Inductive bias: is the assumption of “nearby means alike” true?
  • Distance bias: feature scaling matters
  • High-Dimension bias: if there are too many dimensions, every data is far away from all others, the meaning of “neighbor” is not valid anymore

Support Vector Machine (SVM)

Above is a 2-dimensional example of linearly separable data. In n-dimensions, we attempt to separate data with hyperplanes. Specifically, we want to find the maximum marginal hyperplane (MMH), and then set a margin of separation. The sides of the margin are called hyperplanes H1 and H2.

Linearly Separable

SVM turns classification into an optimization problem: find the separating hyperplane with the largest margin possible.

Hard-Margin SVM: an assumption that there is a separating hyperplane.

Soft-Margin SVM: rather than assume the hyperplane separates ALL data points, we assume it separates the MOST data points.

Slack Margin: allows us to move points around to make a separating hyperplane possible.

Linearly Inseparable

When the data is linearly inseparable, we can transform our data into a higher dimension, then search for the optimal linear separating hyperplane in the new space. Compute dot product on the transformed data mathematically equivalent to applying a kernel function to original data.

Overview

  • Classification for both linear and nonlinear data
  • Transforms data to higher dimensions using nonlinear mapping
  • Searches for optimal linear separation hyperplanes in a new dimension
  • SVM finds this hyperplane using support vectors (“essential” training tuples) and margins (defined by the support vectors)

Rule-Based Classification

Essentially IF-THEN rules.

Assessment of Rules:

  • Coverage
    • quantifies the rule’s applicability across the dataset. It’s the fraction of the dataset where the rule is relevant.
    • \(coverage(R) = \frac{n_{covers}}{|D|}\)
  • Accuracy
    • measures the correctness of the rule when it’s applied
    • \(accuracy(R) = \frac{n_{correct}}{n_{covers}}\)

Example

Note that rule(s) can be triggered or not. A decision tree is an example of rule-based classification.

Neural Network as a Classifier

Weaknesses

  • long training time
  • parameters determined empirically
  • poor interpretability

Strengths:

  • high tolerance to noisy data
  • can classify untrained patterns
  • well-suited for continuous valued inputs & outputs
  • success on a wide array of real-world data
  • inherently parallel
  • rule extraction

Linear Regression

Mapping from independent attributes to continuous values.

Data: n independent objects

  • observed value: \(y_i\), \(i = 1, 2, \dots, n\)
  • p-dimensional attributes: \(x_i = (x_{i1}, x_{i2}, \dots, x_{ip})^T\), \(i = 1, 2, \dots, n\)

Model

  • weight vector: \(w = (w_1, w_2, \dots, w_p)\)
  • \(y_i = w^Tx_i + b\)

The weight vector, \(w\), are bias, \(b\), are the model parameters learned by the data.

Solution

  • Least Square Method
    • Cost/Loss Function: \(L(w, b) = \sum\limits_{i=1}^n (y_i - wx_i - b)^2\)
    • Optimization Goal: \(argmin_{(w, b)} = \sum\limits_{i=1}^n (y_i - wx_i - b)^2\)
    • Closed-Form Solution
      • \(w = \frac{\sum\limits_{i=1}^n x_i(y_i - \bar{y})}{\sum\limits_{i=1}^n x^2_i - n(\sum\limits_{i=1}^n x_i)^2}\)
      • \(b = \frac{1}{n} \sum\limits_{i=1}^n (y_i - wx_i)\)

Logistic Regression

  • How we can solve classification problems by regression
  • Key Idea: We need to transform a real value Y into a probability value \(\in [0, 1]\)
  • Sigmoid Function (differentiable activation function)
    • \(\sigma(z) = \frac{1}{1 + e^{-z}} = \frac{e^z}{e^z + 1}\)
  • The projected value changes sharply around the zero point
  • \(ln(\frac{y}{1-y}) = w^Tx + b\)

Implementation (sklearn)

Import Libraries and Data

Code
# regular libraries
import numpy as np
import pandas as pd

# import general sklearn libraries
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, precision_score, recall_score

# import specific sklearn libraries

# decision tree classifier
from sklearn.tree import DecisionTreeClassifier
# sklearn logistic regression
from sklearn.linear_model import LogisticRegression
# sklearn knn
from sklearn.neighbors import KNeighborsClassifier 
# sklearn svm (could import svm and use svm.SVC, but we'll directly import SVC)
from sklearn.svm import SVC
# sklearn naive bayes
from sklearn.naive_bayes import GaussianNB
Code
# load data
df = pd.read_csv('data/diabetes.csv')

# separate predictor and response
X = df.drop('Outcome', axis = 1)
y = df['Outcome']

# separate train and test set
X_train, X_test, y_train, y_test =  train_test_split(X, y, test_size = 0.30, random_state = 42)

Decision Tree

Code
# train data
clf = DecisionTreeClassifier().fit(X_train, y_train)

# validate model through testing
y_pred = clf.predict(X_test)

# calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
confusion_matrix = confusion_matrix(y_test, y_pred)

print(f'Test Accuracy: {accuracy}')
print(f'Test Precision: {precision}')
print(f'Test Recall: {recall}')
print(f'Test F1: {f1}')
print(f'Test Confusion Matrix: \n{confusion_matrix}')
Test Accuracy: 0.7142857142857143
Test Precision: 0.5760869565217391
Test Recall: 0.6625
Test F1: 0.6162790697674418
Test Confusion Matrix: 
[[112  39]
 [ 27  53]]

Logistic Regression

Note that logistic regression works better with array type data in non-scaled data, but ultimately performs better with scaled data.

Non-Scaled Data

Apply array to data:

Code
X_train_array = np.array(X_train)
X_test_array = np.array(X_test)
y_train_array = np.array(y_train)
y_test_array = np.array(y_test)
Code
# train data
clf = LogisticRegression().fit(X_train_array, y_train_array)

# validate model through testing
y_pred = clf.predict(X_test_array)

# calculate metrics
accuracy = accuracy_score(y_test_array, y_pred)
precision = precision_score(y_test_array, y_pred)
recall = recall_score(y_test_array, y_pred)
f1 = f1_score(y_test_array, y_pred)

print(f'Test Accuracy: {accuracy}')
print(f'Test Precision: {precision}')
print(f'Test Recall: {recall}')
print(f'Test F1: {f1}')
Test Accuracy: 0.7402597402597403
Test Precision: 0.625
Test Recall: 0.625
Test F1: 0.625
C:\Users\carlj\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:460: ConvergenceWarning:

lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

Scaled Data

Scale Data

Code
scaler = StandardScaler()
X_train_normal = scaler.fit_transform(X_train)
X_test_normal = scaler.fit_transform(X_test)
Code
# train data
clf = LogisticRegression().fit(X_train_normal, y_train)

# validate model through testing
y_pred = clf.predict(X_test_normal)

# calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f'Test Accuracy: {accuracy}')
print(f'Test Precision: {precision}')
print(f'Test Recall: {recall}')
print(f'Test F1: {f1}')
Test Accuracy: 0.7445887445887446
Test Precision: 0.64
Test Recall: 0.6
Test F1: 0.6193548387096774

K-Nearest Neighbors (KNN)

Code
# train data
clf = KNeighborsClassifier().fit(X_train, y_train)

# validate model through testing
y_pred = clf.predict(X_test)

# calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f'Test Accuracy: {accuracy}')
print(f'Test Precision: {precision}')
print(f'Test Recall: {recall}')
print(f'Test F1: {f1}')
Test Accuracy: 0.6883116883116883
Test Precision: 0.5487804878048781
Test Recall: 0.5625
Test F1: 0.5555555555555556

Support Vector Machines (SVM)

Code
# train data
clf = SVC().fit(X_train, y_train)

# validate model through testing
y_pred = clf.predict(X_test)

# calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f'Test Accuracy: {accuracy}')
print(f'Test Precision: {precision}')
print(f'Test Recall: {recall}')
print(f'Test F1: {f1}')
Test Accuracy: 0.7359307359307359
Test Precision: 0.6610169491525424
Test Recall: 0.4875
Test F1: 0.5611510791366906

Confusion Matrix wasn’t performing properly in a few methods above, and was thus removed for the purposes of this website.

GridSearchCV

GridSearchCV can be used to test multiple parameters at once.

Code
# decision tree parameters to run through
tree_parameters = {'criterion': ('gini', 'entropy', 'log_loss'),
              'splitter': ('best', 'random'),
              'max_depth': (None, 2, 4, 6, 8, 10),
              'max_features': (None, 'sqrt', 'log2'),
              'class_weight': (None, 'balanced')}
              
# create gridsearchcv object
tree_clf_hyper = GridSearchCV(DecisionTreeClassifier(), tree_parameters)

# train the model
tree_clf_hyper.fit(X_train, y_train)

# results
tree_clf_hyper_results = pd.DataFrame(tree_clf_hyper.cv_results_)
tree_clf_hyper_results
mean_fit_time std_fit_time mean_score_time std_score_time param_class_weight param_criterion param_max_depth param_max_features param_splitter params split0_test_score split1_test_score split2_test_score split3_test_score split4_test_score mean_test_score std_test_score rank_test_score
0 0.005719 0.000966 0.001904 0.000525 None gini None None best {'class_weight': None, 'criterion': 'gini', 'm... 0.768519 0.703704 0.757009 0.719626 0.635514 0.716874 0.047071 65
1 0.002301 0.000604 0.001882 0.000231 None gini None None random {'class_weight': None, 'criterion': 'gini', 'm... 0.740741 0.638889 0.700935 0.663551 0.672897 0.683403 0.034874 160
2 0.002790 0.000282 0.001506 0.000444 None gini None sqrt best {'class_weight': None, 'criterion': 'gini', 'm... 0.750000 0.703704 0.710280 0.644860 0.691589 0.700087 0.033854 126
3 0.001910 0.000184 0.001261 0.000386 None gini None sqrt random {'class_weight': None, 'criterion': 'gini', 'm... 0.768519 0.750000 0.672897 0.719626 0.663551 0.714919 0.041304 76
4 0.002949 0.000457 0.001405 0.000481 None gini None log2 best {'class_weight': None, 'criterion': 'gini', 'm... 0.768519 0.675926 0.738318 0.691589 0.700935 0.715057 0.033706 72
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
211 0.002653 0.000374 0.001172 0.000203 balanced log_loss 10 None random {'class_weight': 'balanced', 'criterion': 'log... 0.731481 0.694444 0.794393 0.738318 0.719626 0.735652 0.032954 17
212 0.003248 0.000428 0.001005 0.000328 balanced log_loss 10 sqrt best {'class_weight': 'balanced', 'criterion': 'log... 0.712963 0.740741 0.710280 0.738318 0.672897 0.715040 0.024517 73
213 0.001813 0.000410 0.001435 0.000451 balanced log_loss 10 sqrt random {'class_weight': 'balanced', 'criterion': 'log... 0.601852 0.666667 0.579439 0.710280 0.588785 0.629405 0.050666 212
214 0.003050 0.000345 0.001443 0.000641 balanced log_loss 10 log2 best {'class_weight': 'balanced', 'criterion': 'log... 0.620370 0.666667 0.700935 0.738318 0.682243 0.681706 0.038893 168
215 0.002424 0.000581 0.001115 0.000475 balanced log_loss 10 log2 random {'class_weight': 'balanced', 'criterion': 'log... 0.759259 0.685185 0.700935 0.757009 0.728972 0.726272 0.029565 40

216 rows × 18 columns

We can directly call the best parameters:

Code
# decision tree with best parameters
clf = DecisionTreeClassifier(**tree_clf_hyper.best_params_).fit(X_train, y_train)

# prediction
y_pred = clf.predict(X_test)

# results (will set average to weighted for multiclass vs binary)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f'Test Accuracy: {accuracy}')
print(f'Test Precision: {precision}')
print(f'Test Recall: {recall}')
print(f'Test F1: {f1}')
Test Accuracy: 0.7619047619047619
Test Precision: 0.6811594202898551
Test Recall: 0.5875
Test F1: 0.6308724832214765