Classification and Regression

Decision Tree Induction

Mostly used in classification, i.e. decision tree classification, and is known as a greedy algorithm.

A flowchart-like tree structure, where each:

Root Node: the topmost node in a tree
Internal Node: denotes a test on an attribute
Branch: represents an outcome of the test
Leaf Node (terminal node): holds a class label

In splitting attributes:

Discrete Values: directly
Continuous Values: split_point
Discrete Values:
- binary tree
- splitting_subset

Attributes are selected at each node based on a heuristic or statistical measure. An attribute selection measure (splitting rules) is a heuristic (enabling someone to discover or learn something for themselves) for selecting the splitting criterion that “best” separates a given data partition of class-labeled training uples into individual classes.

Some common selection measures:

information gain - Iterative Dichotomiser 3 (ID3): multi-valued attributes
gain ratio (C4.5): unbalanced splits
Gini impurity: multi-valued, equal-sized & pure partitions, not great when the number of classes is large

Information Gain

The purer the dataset:

the less information we need to remember
the easier we can make predictions

Entropy is to measure the impurity of a dataset.

We call \(p_i = \frac{C_{i,D}}{D}\) the probability that a tuple belong to class \(C_i\), \(m\) that number of classes, and \(v\) are partitions or subsets.

\(Info(D) = - \sum\limits_{i=1}^{m} p_i log_2(p_i)\) (entropy)

\(Info_A(D) = \sum\limits_{j=1}^{v} \frac{|D_j|}{|D|} Info(D_j)\)

\(Gain(A) = Info(D) - Info_A(D)\)

Here’s what this process looks like by hand:

Gain Ratio

Information gain is biased towards tests with many outcomes, to overcome this, gain ratio applied a kind of normalization to information gain.

\(SplitInfo_A(D) = - \sum\limits_{j=1}^{v} \frac{|D_j|}{|D|} log_2(\frac{|D_j|}{|D|})\)

\(GainRatio(A) = \frac{Gain(A)}{SplitInfo_A(D)}\)

Gini Impurity

\(Gini(D) = 1 - \sum\limits_{i=1}^m p_i^2\)

and given \(v\) partitions (\(v=2\) would indicate binary),

\(Gini_A(D) = \sum\limits_{j=1}^v \frac{D_j}{D} Gini(D_j)\)

\(\Delta Gini(A) = Gini(D) - Gini_A(D)\)

Naive Bayesian Classifier (Naives Bayes)

A classifier based on Bayes’ Theorem which uses an incremental approach.

The Theory

Bayes’ Theorem

\(P(H|X) = \frac{P(HX)}{P(X)} = \frac{P(X|H)P(H)}{P(X)}\)

The Classifier

Given

\(P(C_i|X) = \frac{P(X|C_i)P(C_i)}{P(X)}\)

\(P(X)\) is constant for all classes, thus we only need to maximize:

\(\frac{P(X|C_i)}{P(C_i)}\)

\(\rightarrow P(X|C_i) = \prod_{k=1}^n P(x_k|C_i)\)

Naive Assumption: class conditional independence (no dependence between attributes).

The Zero-Probability Problem (Laplacian Correction)

The prediction requires each conditional probability to be non-zero, if a single \(P(x_k|C_i) = 0 \rightarrow P(X|C_i) = \prod_{k=1}^n P(x_k|C_i) = 0\).

We account for this by applying the Laplacian Correction (Laplacian Estimator), by adding 1 (or a small integer) to each case’s count.

Strengths and Weaknesses

Strengths

Performance: A naive Bayesian classifier has comparable performance with decision tree and selected neural network classifiers.
Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct - prior knowledge can be combined with observed data.

Weaknesses

Assumption: Attributes conditional independence, therefore loss of accuracy.

We can deal with dependencies through Bayesian Belief Networks.

k-Nearest Neighbor

All instances correspond to points in the n-D space
The nearest neighbor is defined in terms of Euclidean distance
Target function could be discrete or real valued
For discrete valued, k-NN returns the most common value among the k training examples nearest \(x_q\)
Vonorio diagram: decision surface induced by a 1-NN for a typical set of training examples

k-NN for real-valued prediction for a given unknown tuple
- returns the mean values of the k-nearest neighbor
Distance-weighted nearest neighbor algorithm
- weight the contribution of each of the k neighbors according to their distance to the query \(x_q\)
- give greater weight to closer neighbors \(w = \frac{1}{d(x_q, x_i)^2}\)
Robust to noisy data by averaging k-nearest neighbors
Curse of dimensionality: distance between neighbors could be dominated by irrelevant attributes
- To overcome it, axes stretch or elimination of he least relevant attributes

Selecting k (the number of neighbors)

Small k: potential for overfitting (high variance, low bias)
Big k: potential for bringing too many irrelevant points (low variance, high bias)

Important Information

lazy method: doesn’t have a learning phase
non-parametric data: no assumption about data
outputs labels
object classified by a plurality vote of its neighbors, with the object being assigned to the class most common among k nearest neighbors (k is positive and normally small)

k-NN Parameters

Since k-NN is relying on the votes neighbors, the parameters are:

k: how many neighbors
distance: which distance measure to use
- Minkowski distance (\(L_p\) norm or p-Norm)
- Manhattan distance (\(L_1\))
- Euclidean distance (\(L_2\))

Bias Prevelant within k-NN

Inductive bias: is the assumption of “nearby means alike” true?
Distance bias: feature scaling matters
High-Dimension bias: if there are too many dimensions, every data is far away from all others, the meaning of “neighbor” is not valid anymore

Support Vector Machine (SVM)

Above is a 2-dimensional example of linearly separable data. In n-dimensions, we attempt to separate data with hyperplanes. Specifically, we want to find the maximum marginal hyperplane (MMH), and then set a margin of separation. The sides of the margin are called hyperplanes H1 and H2.

Linearly Separable

SVM turns classification into an optimization problem: find the separating hyperplane with the largest margin possible.

Hard-Margin SVM: an assumption that there is a separating hyperplane.

Soft-Margin SVM: rather than assume the hyperplane separates ALL data points, we assume it separates the MOST data points.

Slack Margin: allows us to move points around to make a separating hyperplane possible.

Linearly Inseparable

When the data is linearly inseparable, we can transform our data into a higher dimension, then search for the optimal linear separating hyperplane in the new space. Compute dot product on the transformed data mathematically equivalent to applying a kernel function to original data.

Overview

Classification for both linear and nonlinear data
Transforms data to higher dimensions using nonlinear mapping
Searches for optimal linear separation hyperplanes in a new dimension
SVM finds this hyperplane using support vectors (“essential” training tuples) and margins (defined by the support vectors)

Rule-Based Classification

Essentially IF-THEN rules.

Assessment of Rules:

Coverage
- quantifies the rule’s applicability across the dataset. It’s the fraction of the dataset where the rule is relevant.
- \(coverage(R) = \frac{n_{covers}}{|D|}\)
Accuracy
- measures the correctness of the rule when it’s applied
- \(accuracy(R) = \frac{n_{correct}}{n_{covers}}\)

Example

Note that rule(s) can be triggered or not. A decision tree is an example of rule-based classification.

Neural Network as a Classifier

Weaknesses

long training time
parameters determined empirically
poor interpretability

Strengths:

high tolerance to noisy data
can classify untrained patterns
well-suited for continuous valued inputs & outputs
success on a wide array of real-world data
inherently parallel
rule extraction

Linear Regression

Mapping from independent attributes to continuous values.

Data: n independent objects

observed value: \(y_i\), \(i = 1, 2, \dots, n\)
p-dimensional attributes: \(x_i = (x_{i1}, x_{i2}, \dots, x_{ip})^T\), \(i = 1, 2, \dots, n\)

Model

weight vector: \(w = (w_1, w_2, \dots, w_p)\)
\(y_i = w^Tx_i + b\)

The weight vector, \(w\), are bias, \(b\), are the model parameters learned by the data.

Solution

Least Square Method
- Cost/Loss Function: \(L(w, b) = \sum\limits_{i=1}^n (y_i - wx_i - b)^2\)
- Optimization Goal: \(argmin_{(w, b)} = \sum\limits_{i=1}^n (y_i - wx_i - b)^2\)
- Closed-Form Solution
  - \(w = \frac{\sum\limits_{i=1}^n x_i(y_i - \bar{y})}{\sum\limits_{i=1}^n x^2_i - n(\sum\limits_{i=1}^n x_i)^2}\)
  - \(b = \frac{1}{n} \sum\limits_{i=1}^n (y_i - wx_i)\)

Logistic Regression

How we can solve classification problems by regression
Key Idea: We need to transform a real value Y into a probability value \(\in [0, 1]\)
Sigmoid Function (differentiable activation function)
- \(\sigma(z) = \frac{1}{1 + e^{-z}} = \frac{e^z}{e^z + 1}\)
The projected value changes sharply around the zero point
\(ln(\frac{y}{1-y}) = w^Tx + b\)

Implementation (sklearn)

Import Libraries and Data

Code

# regular libraries
import numpy as np
import pandas as pd

# import general sklearn libraries
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, precision_score, recall_score

# import specific sklearn libraries

# decision tree classifier
from sklearn.tree import DecisionTreeClassifier
# sklearn logistic regression
from sklearn.linear_model import LogisticRegression
# sklearn knn
from sklearn.neighbors import KNeighborsClassifier 
# sklearn svm (could import svm and use svm.SVC, but we'll directly import SVC)
from sklearn.svm import SVC
# sklearn naive bayes
from sklearn.naive_bayes import GaussianNB

Code

# load data
df = pd.read_csv('data/diabetes.csv')

# separate predictor and response
X = df.drop('Outcome', axis = 1)
y = df['Outcome']

# separate train and test set
X_train, X_test, y_train, y_test =  train_test_split(X, y, test_size = 0.30, random_state = 42)

Decision Tree

Code

# train data
clf = DecisionTreeClassifier().fit(X_train, y_train)

# validate model through testing
y_pred = clf.predict(X_test)

# calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
confusion_matrix = confusion_matrix(y_test, y_pred)

print(f'Test Accuracy: {accuracy}')
print(f'Test Precision: {precision}')
print(f'Test Recall: {recall}')
print(f'Test F1: {f1}')
print(f'Test Confusion Matrix: \n{confusion_matrix}')

Test Accuracy: 0.7142857142857143
Test Precision: 0.5760869565217391
Test Recall: 0.6625
Test F1: 0.6162790697674418
Test Confusion Matrix: 
[[112  39]
 [ 27  53]]

Logistic Regression

Note that logistic regression works better with array type data in non-scaled data, but ultimately performs better with scaled data.

Non-Scaled Data

Apply array to data:

Code

X_train_array = np.array(X_train)
X_test_array = np.array(X_test)
y_train_array = np.array(y_train)
y_test_array = np.array(y_test)

Code

# train data
clf = LogisticRegression().fit(X_train_array, y_train_array)

# validate model through testing
y_pred = clf.predict(X_test_array)

# calculate metrics
accuracy = accuracy_score(y_test_array, y_pred)
precision = precision_score(y_test_array, y_pred)
recall = recall_score(y_test_array, y_pred)
f1 = f1_score(y_test_array, y_pred)

print(f'Test Accuracy: {accuracy}')
print(f'Test Precision: {precision}')
print(f'Test Recall: {recall}')
print(f'Test F1: {f1}')

Test Accuracy: 0.7402597402597403
Test Precision: 0.625
Test Recall: 0.625
Test F1: 0.625

C:\Users\carlj\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:460: ConvergenceWarning:

lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

Scaled Data

Scale Data

Code

scaler = StandardScaler()
X_train_normal = scaler.fit_transform(X_train)
X_test_normal = scaler.fit_transform(X_test)

Code

# train data
clf = LogisticRegression().fit(X_train_normal, y_train)

# validate model through testing
y_pred = clf.predict(X_test_normal)

# calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f'Test Accuracy: {accuracy}')
print(f'Test Precision: {precision}')
print(f'Test Recall: {recall}')
print(f'Test F1: {f1}')

Test Accuracy: 0.7445887445887446
Test Precision: 0.64
Test Recall: 0.6
Test F1: 0.6193548387096774

K-Nearest Neighbors (KNN)

Code

# train data
clf = KNeighborsClassifier().fit(X_train, y_train)

# validate model through testing
y_pred = clf.predict(X_test)

# calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f'Test Accuracy: {accuracy}')
print(f'Test Precision: {precision}')
print(f'Test Recall: {recall}')
print(f'Test F1: {f1}')

Test Accuracy: 0.6883116883116883
Test Precision: 0.5487804878048781
Test Recall: 0.5625
Test F1: 0.5555555555555556

Support Vector Machines (SVM)

Code

# train data
clf = SVC().fit(X_train, y_train)

# validate model through testing
y_pred = clf.predict(X_test)

# calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f'Test Accuracy: {accuracy}')
print(f'Test Precision: {precision}')
print(f'Test Recall: {recall}')
print(f'Test F1: {f1}')

Test Accuracy: 0.7359307359307359
Test Precision: 0.6610169491525424
Test Recall: 0.4875
Test F1: 0.5611510791366906

Confusion Matrix wasn’t performing properly in a few methods above, and was thus removed for the purposes of this website.

GridSearchCV

GridSearchCV can be used to test multiple parameters at once.

Code

# decision tree parameters to run through
tree_parameters = {'criterion': ('gini', 'entropy', 'log_loss'),
              'splitter': ('best', 'random'),
              'max_depth': (None, 2, 4, 6, 8, 10),
              'max_features': (None, 'sqrt', 'log2'),
              'class_weight': (None, 'balanced')}
              
# create gridsearchcv object
tree_clf_hyper = GridSearchCV(DecisionTreeClassifier(), tree_parameters)

# train the model
tree_clf_hyper.fit(X_train, y_train)

# results
tree_clf_hyper_results = pd.DataFrame(tree_clf_hyper.cv_results_)
tree_clf_hyper_results

	mean_fit_time	std_fit_time	mean_score_time	std_score_time	param_class_weight	param_criterion	param_max_depth	param_max_features	param_splitter	params	split0_test_score	split1_test_score	split2_test_score	split3_test_score	split4_test_score	mean_test_score	std_test_score	rank_test_score
0	0.005719	0.000966	0.001904	0.000525	None	gini	None	None	best	{'class_weight': None, 'criterion': 'gini', 'm...	0.768519	0.703704	0.757009	0.719626	0.635514	0.716874	0.047071	65
1	0.002301	0.000604	0.001882	0.000231	None	gini	None	None	random	{'class_weight': None, 'criterion': 'gini', 'm...	0.740741	0.638889	0.700935	0.663551	0.672897	0.683403	0.034874	160
2	0.002790	0.000282	0.001506	0.000444	None	gini	None	sqrt	best	{'class_weight': None, 'criterion': 'gini', 'm...	0.750000	0.703704	0.710280	0.644860	0.691589	0.700087	0.033854	126
3	0.001910	0.000184	0.001261	0.000386	None	gini	None	sqrt	random	{'class_weight': None, 'criterion': 'gini', 'm...	0.768519	0.750000	0.672897	0.719626	0.663551	0.714919	0.041304	76
4	0.002949	0.000457	0.001405	0.000481	None	gini	None	log2	best	{'class_weight': None, 'criterion': 'gini', 'm...	0.768519	0.675926	0.738318	0.691589	0.700935	0.715057	0.033706	72
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
211	0.002653	0.000374	0.001172	0.000203	balanced	log_loss	10	None	random	{'class_weight': 'balanced', 'criterion': 'log...	0.731481	0.694444	0.794393	0.738318	0.719626	0.735652	0.032954	17
212	0.003248	0.000428	0.001005	0.000328	balanced	log_loss	10	sqrt	best	{'class_weight': 'balanced', 'criterion': 'log...	0.712963	0.740741	0.710280	0.738318	0.672897	0.715040	0.024517	73
213	0.001813	0.000410	0.001435	0.000451	balanced	log_loss	10	sqrt	random	{'class_weight': 'balanced', 'criterion': 'log...	0.601852	0.666667	0.579439	0.710280	0.588785	0.629405	0.050666	212
214	0.003050	0.000345	0.001443	0.000641	balanced	log_loss	10	log2	best	{'class_weight': 'balanced', 'criterion': 'log...	0.620370	0.666667	0.700935	0.738318	0.682243	0.681706	0.038893	168
215	0.002424	0.000581	0.001115	0.000475	balanced	log_loss	10	log2	random	{'class_weight': 'balanced', 'criterion': 'log...	0.759259	0.685185	0.700935	0.757009	0.728972	0.726272	0.029565	40

216 rows × 18 columns

We can directly call the best parameters:

Code

# decision tree with best parameters
clf = DecisionTreeClassifier(**tree_clf_hyper.best_params_).fit(X_train, y_train)

# prediction
y_pred = clf.predict(X_test)

# results (will set average to weighted for multiclass vs binary)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f'Test Accuracy: {accuracy}')
print(f'Test Precision: {precision}')
print(f'Test Recall: {recall}')
print(f'Test F1: {f1}')

Test Accuracy: 0.7619047619047619
Test Precision: 0.6811594202898551
Test Recall: 0.5875
Test F1: 0.6308724832214765