https://github.com/zhampel/intro-ml-iihe
https://zhampel.github.io/intro-ml-iihe
E-mail: zhampel@wipac.wisc.edu
def spam_filter(email):
"""Function that labels an email as 'spam' or 'not spam'
"""
if 'Act now!' in email.contents:
label = 'spam'
elif 'hotmail.com' in email.sender:
label = 'spam'
elif email.contents.count('$') > 20:
label = 'spam'
else:
label = 'not spam'
return label
"Field of study that gives computers the ability to learn without being explicitly programmed" — Arthur Samuel (1959)
"A machine-learning system is trained rather than explicitly programmed. It’s presented with many examples relevant to a task, and it finds statistical structure in these examples that eventually allows the system to come up with rules for automating the task." — Francois Chollet, Deep Learning with Python
Supervised Learning
Unsupervised Learning
Reinforcement Learning
Requires labelled data set (e.g. MC truth)
Direct feedback on model performance while training
Application to two kinds of problems
No labels on data set (e.g. no MC truth)
No direct feedback while training model
Identify underlying structure in data
Application to two main subfields
Used for training decision making process (e.g. playing chess)
Learn a series of actions by interacting with environment
Requires a reward system to optimize, improve performance
From labelled data, learn a mapping from input data to desired output
The goal is to generalize well to future, unseen data
Application to two kinds of problems
Requires labelled data set (e.g. MC truth)
Direct feedback on model performance while training
Application to two kinds of problems
Image source: Artificial Neurons and the McCulloch-Pitts Model by Sebastian Raschka
Desired Components
Desired Characteristics
Simple mathematical functions
Building block
Components
First take input information $x_i$ and combine with weights $w_i$
Combine information into single variable, $z$
$$ \Sigma \rightarrow z = \sum_{i=0}^{n} w_i x_i $$
so that $\phi$ can activate based on its value.
As we'll see, it's really nice if $\phi(z)$ is continuous & differentiable, and acts linear near origin.
Common forms of $\phi(z)$ used to squish input into some range (monotonically).
Common forms of $\phi(z)$ used to squish input into some range (monotonically):
Based on activation, make a decision $\mathcal{D}$ to provide an output $y$.
Form of $\mathcal{D}$ is just a threshold function of $\phi(z)$ and thus of $z$, so:
$$ \mathcal{D} = \begin{cases} 1, & \text{if } z \ge b\\ \{-1,0\}, & \text{if } z < b \end{cases} $$
But since $b$ is some threshold value, we can absorb in our definition of $w$:
$$ z = \sum_{i=0}^{n} w_i x_i - b = \sum_{i=0}^{n+1} w_i x_i $$
where $w_0 = -b$ and $x_0 = 1$.
So now we can write $\mathcal{D}$ as:
$$ \mathcal{D} = \begin{cases} 1, & \text{if } z \ge 0\\ \{-1,0\}, & \text{if } z < 0 \end{cases} $$
In this example, we are doing binary classification, so we can choose either $-1$ or $0$ as the latent state depending on the problem setup.
For regression for example, we could $\mathcal{D} = \phi(z) = \frac{1}{1+e^{-z}}$ (Logistic), providing class-membership probability.
Forms of $\mathcal{D}$ are just threshold functions.
Does $z$ and thus $\phi(z)$ reach a value sufficient to fire, i.e. $\mathcal{D}=+1$?
A quick aside for the Logistic function.
Consider two possible outcomes $y\in \{a,b\}$, with probabilities $\{p, 1-p\}$.
The ratio $\frac{p}{1-p} \in (0,\infty)$ provides the odds for event $a$.
Take the logarithm: $\log \frac{p}{1-p} \in \mathbb{R}$.
Recall that $z = \sum_i w_i x_i \in \mathbb{R}$.
Consider two possible outcomes $y\in \{a,b\}$, with probabilities $\{p, 1-p\}$.
The ratio $\frac{p}{1-p} \in (0,\infty)$ provides the odds for event $a$.
Take the logarithm: $\log \frac{p}{1-p} \in \mathbb{R}$.
Recall that $z = \sum_i w_i x_i \in \mathbb{R}$.
So let's squish our z with this function: $$ \log \frac{p}{1-p} = z $$
$$ \frac{1-p}{p} = e^{-z} $$
$$ 1-p = p \ e^{-z} $$
$$ p(y=a|\mathbf{x}) = \phi(z) = \frac{1}{1+e^{-z}} $$
$\phi_{\text{L}}(z) = \frac{1}{1+e^{-z}}$
Now we can define $\mathcal{D}$ to provide either probability $p$ or request a binary decision ${0, 1}$.
So now we have the mechanics of the perceptron itself.
Given inputs $X$, weights $W$, activation and decision functions, we can get an output $y$
So now, how do we train it to learn something?
We need two main things to learn:
Quantify how good our perceptron is doing
Ability to tune our perceptron based on this performance
In supervised learning → labelled data sets, $y_{\text{true}}$.
Consider evaluating a cost function $J$:
$$ J(Y_{\text{true}}, Y) \propto \sum_{\mu=0}^{M} (y_{\text{true}}^\mu - y^\mu)^2 $$
over a data set with $M$ samples, where $y^\mu$ is the perceptron output for sample $\mu$.
Here just the mean squared error (MSE).
$J$ is a metric we want to minimize!
In general referred to as an objective function.
MSE: $$ \sum_{i=0}^{M} (y_{\text{true}}^{\mu} - y^{\mu})^2 $$
Binary cross-entropy. Model predicts $p$ while true value is $t$. $$ -t \log (p) - (1-t) \log(1-p) $$
Categorical cross-entropy. Generalization for multiclass logarithmic loss. Target $t_{ij}$, prediction $p_{ij}$. $$ -\sum_{j} t_{ij} \log( p_{ij}) $$
Only free paramerters are the weights $w_i$. Thus,
$$ \frac{\partial J}{\partial w_i} \bigg\rvert _{w_{\text{min}}} = 0 $$
$$ \begin{align} \frac{\partial J}{\partial w_i} &= \frac{\partial}{\partial w_i} \sum_{\mu=0}^{M} (y_{\text{true}}^\mu - y^\mu)^2 \\ &= \frac{\partial}{\partial w_i} \sum_{\mu=0}^{M} (y_{\text{true}}^\mu - y^{\mu})^2 \\ &= \sum_{\mu=0}^{M} \frac{\partial}{\partial w_i} (y_{\text{true}}^\mu - y^{\mu})^2 \end{align} $$
$$ y \rightarrow \phi(z) $$
$$ \begin{align} \frac{\partial J}{\partial w_i} &= \sum_{\mu=0}^{M} \frac{\partial}{\partial w_i} \left(y_{\text{true}}^\mu - \phi^{\mu}(z)\right) ^2 \\ &= \sum_{\mu=0}^{M} (-2) \left( y_{\text{true}}^\mu - \phi^{\mu}(z) \right) \frac{\partial \phi^{\mu}(z)}{\partial w_i} \\ &= \sum_{\mu=0}^{M} (-2) \left( y_{\text{true}}^\mu - \phi^{\mu}(z) \right) \frac{\partial \phi^{\mu}(z)}{\partial z} \frac{\partial z}{\partial w_i} \\ &= \sum_{\mu=0}^{M} (-2) \left( y_{\text{true}}^\mu - \phi^{\mu}(z) \right) \frac{\partial \phi^{\mu}(z)}{\partial z} x_i \end{align} $$
$$ \frac{\partial J}{\partial w_i} \bigg\rvert _{w_{\text{min}}} = \sum_{\mu=0}^{M} \left( y_{\text{true}}^\mu - \phi^{\mu}(z) \right) \frac{\partial \phi^{\mu}(z)}{\partial z}\bigg\rvert _{w_{\text{min}}} x_i = 0 $$
So we have these pieces: $$ x_i \, \, , \, \, \, \phi'= \frac{\partial \phi}{\partial z} $$
and we can evaluate this: $$ \text{Error} = \left( y_{\text{true}}^\mu - y^\mu \right) = \left( y_{\text{true}}^\mu - \phi^{\mu}(z) \right) $$
Woopdie-doo...
Looks a little nasty to evaluate.
So, what are we going to do with these to find $w_{\text{min}}$?
We're going to inform our weights via the Error to search for the $w_{\text{min}}$.
Consider small displacement of a function
For small $\delta w > 0$, $$ J(w+\delta w) \approx J(w) + J'(x) \ \delta w $$ so $$ J\left(w-\delta w \ \mathrm{sgn} \ J'(w) \right) \leq J(w) \, . $$ |
So let's update the weights $w$ via something like
$$
w \leftarrow w + \delta w
$$
where
$$
\delta w = - \eta \ J'(w)
$$
i.e., follow negative gradient to find minimum of $J(w)$, at a learning rate $\eta$.
Algorithm:
What we've seen so far is called Batch Gradient Descent.
Three main methods:
This first hyperparameter $\eta$ determines how far $\delta w$ will jump searching for $J_{\text{min}}$
Of course, we are not guaranteed a global minimum via $\frac{\partial J}{\partial w_i} \bigg\rvert _{w_{\text{min}}} = 0$.
Consider deep learning networks with $O(>10^6)$ weights! Choose $\eta$ wisely.
One possible solution is to use an adaptive rate.
Example: Annealing $$ \eta = \frac{c_1}{k+c_2} $$
where $c_i$ are constants and $k$ is the current iteration.
Set of 150 samples (individual flowers) that have 4 features: sepal length, sepal width, petal length, and petal width (all in cm). Collected in 1936 by R. Fisher.
Each sample is labeled by its species: Iris Setosa, Iris Versicolour, Iris Virginica
Task is to develop a model that predicts iris species
Dataset freely available from the UCI Machine Learning Repository
Consider just
By eye, we can separate these.
Can our perceptron learn a decision boundary for classification?
Perceptron:
How do we know when the perceptron has learned?
Look at evolution of Error and $J(w)$ with training iteration.
Perceptron:
Decision Boundary with Unit Step
Linearly separable.
What happens if we 'mis-classify' our training set, i.e. no longer linearly separable?
No convergence for non-linearly separable (mislabelled) data set.
Perceptron:
Decision Boundary with AdaLine
Perceptron:
Perceptron:
Much faster convergence just by preprocessing features!
Notebook with some animations.
Image source: Underfitting vs. Overfitting scikit-learn.
Image source: Underfitting vs. Overfitting scikit-learn.
Splitting up labelled data set
Further test performance on entirely unseen data
Image source: Python Machine Learning by Sebastian Raschka
$$ \sum_i w_i x^{\text{upper}}_i = +1 \\ \sum_i w_i x^{\text{lower}}_i = -1 \\ \sum_i w_i (x^{\text{upper}}_i - x^{\text{lower}}_i) = 2 \\ $$ Normalize: $$ \frac{\sum_i w_i (x^{\text{upper}}_i - x^{\text{lower}}_i)}{||\mathbf{w}||} = \frac{2}{||\mathbf{w}||} $$
So maximize $\frac{2}{||\mathbf{w}||}$ subject to classification conditions: $$ y_k \sum_i w_i x^k_i \ge 1 \text{for } k = {1...N} $$
Actually easier to minimize $||\mathbf{w}||$ -> quadratic optimization problem
Introduce soft margins for non-linearly separable data
Image source: Python Machine Learning by Sebastian Raschka
where
IG is the different between the impurity of the parent node and the sum of the child node impurities.
Most libraries implement binary splitting: $$ \text{IG}(D_p,f) = I(D_p) - \frac{N_\text{left}}{N_p}I(D_\text{left}) - \frac{N_\text{right}}{N_p}I(D_\text{right}) $$
Common impurity measures $I(t)$ at node $t$:
where $p(i|t)$ is the fraction of samples belonging to class $i$ at node $t$.
Can overtrain easily so need pruning to limit max. depth of tree.
Adding more layers means adding more indices.
Optimization algorithm (backpropagation) remains the same.
Just start from the outputs, and move backwards to fine tune the weights!
Image Analysis
Event Analysis
Image source: [Understanding CNN for NLP](http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/) Denny Britz
Layers of convolutions, and samplings...
Image source: [Introduction to CNN for Vision Tasks](https://pythonmachinelearning.pro/introduction-to-convolutional-neural-networks-for-vision-tasks/)
Following the first few sections of Deep Learning with Keras
A few things prior to running scripts.
Install somethings with pip
pip install numpy scipy scikit-learn pillow h5py
pip install Theano
pip install --upgrade tensorflow
pip install keras
Now test your installation
import theano
import theano.tensor as T
x = T.dmatrix('x')
s = 1 / (1 + T.exp(-x))
logistic = theatno.function([x], s)
logistic([[0, 1], [-1. -1]])
Let's go train a single later NN on the MNIST data set: 70k handwritten (labelled) digits.
We're going to define:
from __future__ import print_function
import numpy as np
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers.core import Dense, Activation
from keras.optimizers import SGD
from keras.utils import np_utils
np.random.seed(1671) # for reproducibility
# Network & training
N_EPOCH = 5 #200
BATCH_SIZE = 128
VERBOSE = 1
N_CLASSES = 10 # No. outputs = No. digits
OPTIMIZER = SGD() # SGD optimizer
N_HIDDEN = 128 # No. hidden nodes in layer
VALIDATION_SPLIT = 0.2 # fraction of training set used for validation
LOSS = 'categorical_crossentropy' #categorical_crossentropy, binary_crossentropy, mse
METRIC = 'accuracy' #accuracy, precision, recall
# Data -> shuffled and split between training and test sets
(X_train, y_train), (X_test, y_test) = mnist.load_data()
# X_train is 60,000 rows of 28x28 values -> to be reshaped to 60,000 x 784
RESHAPED = 784
X_train = X_train.reshape(60000,RESHAPED)
X_test = X_test.reshape(10000,RESHAPED)
# Need to make float32 for GPU use
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
# Normalize grey-scale values
X_train /= 255
X_test /= 255
print(X_train.shape[0], ' training samples')
print(X_test.shape[0], ' testing samples')
# Convert class vectors to binary class matrices
Y_train = np_utils.to_categorical(y_train, N_CLASSES)
Y_test = np_utils.to_categorical(y_test, N_CLASSES)
# N_CLASSES outputs, final stage is normalized via softmax
model = Sequential()
model.add(Dense(N_CLASSES, input_shape=(RESHAPED,)))
model.add(Activation('softmax'))
model.summary()
# Compile the model
model.compile(loss=LOSS,optimizer=OPTIMIZER, metrics=[METRIC])
# Train the model
history = model.fit(X_train, Y_train, \
batch_size=BATCH_SIZE,\
epochs=N_EPOCH,\
verbose=VERBOSE,\
validation_split=VALIDATION_SPLIT)
# Validation of the model with test set
score = model.evaluate(X_test, Y_test, verbose=VERBOSE)
print("Test score: ", score[0])
print("Test accuracy: ", score[1])
To add more hidden layers, add the following after the first input layer:
# 2nd layer
model.add(Dense(N_HIDDEN))
model.add(Activation('relu')) # Can use other activation functions here
# 3rd layer
model.add(Dense(N_HIDDEN))
model.add(Activation('relu')) # Can use other activation functions here
We also don't need so many iterations, so can try N_EPOCH = 20