Introduction to Machine Learning
Workshop

Zigfried Hampel-Arias¶

IIHE -- Brussels, BE¶

20 April, 2018

Program for Today¶

Intro to Machine Learning¶

Types of Learning
Machine Learning Basics
Building More Complex Models

Keras Workshop¶

Simple Neural Nets
Hyperparameters

Classical Programming¶

Set of rules to accomplish a task
'if' this then 'do that'

In [1]:

def spam_filter(email):
    """Function that labels an email as 'spam' or 'not spam'
    """
    if 'Act now!' in email.contents:
        label = 'spam'
    elif 'hotmail.com' in email.sender:
        label = 'spam'
    elif email.contents.count('$') > 20:
        label = 'spam'
    else:
        label = 'not spam'

    return label

Machine Learning¶

"Field of study that gives computers the ability to learn without being explicitly programmed" — Arthur Samuel (1959)
"A machine-learning system is trained rather than explicitly programmed. It’s presented with many examples relevant to a task, and it finds statistical structure in these examples that eventually allows the system to come up with rules for automating the task." — Francois Chollet, Deep Learning with Python

Three Types of Learning¶

Supervised Learning
Unsupervised Learning
Reinforcement Learning

Supervised Learning¶

Requires labelled data set (e.g. MC truth)
Direct feedback on model performance while training
Application to two kinds of problems
- Classification -> fixed output types
- Regression -> continuous output

png

Unsupervised Learning¶

No labels on data set (e.g. no MC truth)
No direct feedback while training model
Identify underlying structure in data
Application to two main subfields
- Clustering -> organize data stack into meaningful subgroups
- Dimensionality reduction -> preprocessing of large data sets

png

Reinforcement Learning¶

Used for training decision making process (e.g. playing chess)
Learn a series of actions by interacting with environment
Requires a reward system to optimize, improve performance

png

General Scheme for Building ML Systems¶

simple_perceptron

Supervised Learning¶

From labelled data, learn a mapping from input data to desired output
The goal is to generalize well to future, unseen data
Application to two kinds of problems
- Classification -> fixed output types
- Regression -> continuous output

png

Supervised Learning¶

png

Supervised Learning¶

Requires labelled data set (e.g. MC truth)
Direct feedback on model performance while training

png

Supervised Learning¶

Choose ML algorithm appropriate for problem

png

Supervised Learning¶

Application to two kinds of problems
- Classification -> fixed output types
- Regression -> continuous output

png

Machine Learning Basics¶

Inspired by nature
The Perceptron
Learning from the data

Neuronal Inspiration¶

overview

Image source: Artificial Neurons and the McCulloch-Pitts Model by Sebastian Raschka

Towards an Artificial Neuron¶

Desired Components

Set of input accepting 'dendrites'
Inner body that 'activates' based on input signals
Inner function that 'decides' whether to fire
Output signal

Towards an Artificial Neuron¶

Desired Characteristics

Simple mathematical functions
- Easy to evaluate
- Differentiable
Building block
- Single unit
- Can be stacked

The Perceptron¶

Components

Set of input features: $X$
Set of real valued weights: $W$
Activation function: $\phi$
Decision function
Output value (binary, $\mathbb{R}$): $y$

simple_perceptron

Inputs and Weights¶

First take input information $x_i$ and combine with weights $w_i$

simple_perceptron_input

The Activation Function¶

Combine information into single variable, $z$

$$ \Sigma \rightarrow z = \sum_{i=0}^{n} w_i x_i $$

so that $\phi$ can activate based on its value.

simple_perceptron_act

Activation Functions¶

As we'll see, it's really nice if $\phi(z)$ is continuous & differentiable, and acts linear near origin.

Common forms of $\phi(z)$ used to squish input into some range (monotonically).

Adaptive Linear Neuron (AdaLine): $z$
Logistic: $\frac{1}{1+e^{-z}}$
Arctangent: $\tan^{-1}(z)$

Activation Functions¶

Common forms of $\phi(z)$ used to squish input into some range (monotonically):

Adaptive Linear Neuron (AdaLine): $z$
Logistic: $\frac{1}{1+e^{-z}}$
Arctangent: $\tan^{-1}(z)$

act_func

The Decision Function¶

Based on activation, make a decision $\mathcal{D}$ to provide an output $y$.

simple_perceptron_dec

Decision Functions¶

Form of $\mathcal{D}$ is just a threshold function of $\phi(z)$ and thus of $z$, so:

$$ \mathcal{D} = \begin{cases} 1, & \text{if } z \ge b\\ \{-1,0\}, & \text{if } z < b \end{cases} $$

But since $b$ is some threshold value, we can absorb in our definition of $w$:

$$ z = \sum_{i=0}^{n} w_i x_i - b = \sum_{i=0}^{n+1} w_i x_i $$

where $w_0 = -b$ and $x_0 = 1$.

Decision Functions¶

So now we can write $\mathcal{D}$ as:

$$ \mathcal{D} = \begin{cases} 1, & \text{if } z \ge 0\\ \{-1,0\}, & \text{if } z < 0 \end{cases} $$

In this example, we are doing binary classification, so we can choose either $-1$ or $0$ as the latent state depending on the problem setup.

For regression for example, we could $\mathcal{D} = \phi(z) = \frac{1}{1+e^{-z}}$ (Logistic), providing class-membership probability.

Decision Functions (Binary)¶

Forms of $\mathcal{D}$ are just threshold functions.

Does $z$ and thus $\phi(z)$ reach a value sufficient to fire, i.e. $\mathcal{D}=+1$?

dec_func

Logistic Function¶

A quick aside for the Logistic function.

Consider two possible outcomes $y\in \{a,b\}$, with probabilities $\{p, 1-p\}$.

The ratio $\frac{p}{1-p} \in (0,\infty)$ provides the odds for event $a$.

Take the logarithm: $\log \frac{p}{1-p} \in \mathbb{R}$.

Recall that $z = \sum_i w_i x_i \in \mathbb{R}$.

Logistic Function¶

Consider two possible outcomes $y\in \{a,b\}$, with probabilities $\{p, 1-p\}$.

The ratio $\frac{p}{1-p} \in (0,\infty)$ provides the odds for event $a$.

Take the logarithm: $\log \frac{p}{1-p} \in \mathbb{R}$.

Recall that $z = \sum_i w_i x_i \in \mathbb{R}$.

So let's squish our z with this function: $$ \log \frac{p}{1-p} = z $$

$$ \frac{1-p}{p} = e^{-z} $$

$$ 1-p = p \ e^{-z} $$

$$ p(y=a|\mathbf{x}) = \phi(z) = \frac{1}{1+e^{-z}} $$

Logistic Function¶

$\phi_{\text{L}}(z) = \frac{1}{1+e^{-z}}$

log_func

Now we can define $\mathcal{D}$ to provide either probability $p$ or request a binary decision ${0, 1}$.

Perceptron Learning¶

So now we have the mechanics of the perceptron itself.

Given inputs $X$, weights $W$, activation and decision functions, we can get an output $y$

So now, how do we train it to learn something?

simple_perceptron

Perceptron Learning¶

We need two main things to learn:

Quantify how good our perceptron is doing
Ability to tune our perceptron based on this performance

Cost Function¶

Some way of quantifying how good our perceptron is doing

In supervised learning → labelled data sets, $y_{\text{true}}$.

Consider evaluating a cost function $J$:

$$ J(Y_{\text{true}}, Y) \propto \sum_{\mu=0}^{M} (y_{\text{true}}^\mu - y^\mu)^2 $$

over a data set with $M$ samples, where $y^\mu$ is the perceptron output for sample $\mu$.

Here just the mean squared error (MSE).

$J$ is a metric we want to minimize!

In general referred to as an objective function.

List of Common Objective Functions¶

MSE: $$ \sum_{i=0}^{M} (y_{\text{true}}^{\mu} - y^{\mu})^2 $$

Binary cross-entropy. Model predicts $p$ while true value is $t$. $$ -t \log (p) - (1-t) \log(1-p) $$

Categorical cross-entropy. Generalization for multiclass logarithmic loss. Target $t_{ij}$, prediction $p_{ij}$. $$ -\sum_{j} t_{ij} \log( p_{ij}) $$

Cost Function Minimization¶

Time to minimize!

Take a derivative w.r.t. to... ?

Only free paramerters are the weights $w_i$. Thus,

$$ \frac{\partial J}{\partial w_i} \bigg\rvert _{w_{\text{min}}} = 0 $$

$$ \begin{align} \frac{\partial J}{\partial w_i} &= \frac{\partial}{\partial w_i} \sum_{\mu=0}^{M} (y_{\text{true}}^\mu - y^\mu)^2 \\ &= \frac{\partial}{\partial w_i} \sum_{\mu=0}^{M} (y_{\text{true}}^\mu - y^{\mu})^2 \\ &= \sum_{\mu=0}^{M} \frac{\partial}{\partial w_i} (y_{\text{true}}^\mu - y^{\mu})^2 \end{align} $$

Cost Function Minimization¶

$$ y \rightarrow \phi(z) $$

$$ \begin{align} \frac{\partial J}{\partial w_i} &= \sum_{\mu=0}^{M} \frac{\partial}{\partial w_i} \left(y_{\text{true}}^\mu - \phi^{\mu}(z)\right) ^2 \\ &= \sum_{\mu=0}^{M} (-2) \left( y_{\text{true}}^\mu - \phi^{\mu}(z) \right) \frac{\partial \phi^{\mu}(z)}{\partial w_i} \\ &= \sum_{\mu=0}^{M} (-2) \left( y_{\text{true}}^\mu - \phi^{\mu}(z) \right) \frac{\partial \phi^{\mu}(z)}{\partial z} \frac{\partial z}{\partial w_i} \\ &= \sum_{\mu=0}^{M} (-2) \left( y_{\text{true}}^\mu - \phi^{\mu}(z) \right) \frac{\partial \phi^{\mu}(z)}{\partial z} x_i \end{align} $$

Cost Function Minimization¶

$$ \frac{\partial J}{\partial w_i} \bigg\rvert _{w_{\text{min}}} = \sum_{\mu=0}^{M} \left( y_{\text{true}}^\mu - \phi^{\mu}(z) \right) \frac{\partial \phi^{\mu}(z)}{\partial z}\bigg\rvert _{w_{\text{min}}} x_i = 0 $$

So we have these pieces: $$ x_i \, \, , \, \, \, \phi'= \frac{\partial \phi}{\partial z} $$

and we can evaluate this: $$ \text{Error} = \left( y_{\text{true}}^\mu - y^\mu \right) = \left( y_{\text{true}}^\mu - \phi^{\mu}(z) \right) $$

Woopdie-doo...

Looks a little nasty to evaluate.

So, what are we going to do with these to find $w_{\text{min}}$?

Cost Function Minimization¶

Ability to tune our perceptron based on this performance

We're going to inform our weights via the Error to search for the $w_{\text{min}}$.

perceptron_err

Perceptron Learning Rule¶

Consider small displacement of a function

For small $\delta w > 0$,

$$ J(w+\delta w) \approx J(w) + J'(x) \ \delta w $$

so

$$ J\left(w-\delta w \ \mathrm{sgn} \ J'(w) \right) \leq J(w) \, . $$

Perceptron Learning Rule¶

So let's update the weights $w$ via something like

$$ w \leftarrow w + \delta w $$

where

$$ \delta w = - \eta \ J'(w) $$

i.e., follow negative gradient to find minimum of $J(w)$, at a learning rate $\eta$.

Gradient Descent¶

Algorithm:

Initialize $w_i$
Evaluate perceptron output, $y$
Calculate $\frac{\partial J(w)}{\partial w_i}$, $\delta w_i$
Update weights: $w_i \leftarrow w_i + \delta w_i$
Return to Step 2 if stopping criteria not met.

Gradient Descent¶

What we've seen so far is called Batch Gradient Descent.

Three main methods:

Batch (BGD)
- $\rightarrow \sum_{\mu=0}^{M}$
- Uses all examples!
- Slow, memory requirements...

Stochastic (SGD)
- $k \in M \rightarrow J_{k}'$
- One example! Used for online learning (data continuously arriving).
- Potentially noisey path to $J_{\text{min}}$

Mini-Batch: (MBD)
- $ S \subset M \rightarrow \sum_{\mu \in S}$
- Subset of examples.
- Compromise, most stable. Typically $|S| = 128, 256, ...$

Learning Rate¶

This first hyperparameter $\eta$ determines how far $\delta w$ will jump searching for $J_{\text{min}}$

grad_eta

Learning Rate¶

Of course, we are not guaranteed a global minimum via $\frac{\partial J}{\partial w_i} \bigg\rvert _{w_{\text{min}}} = 0$.

Consider deep learning networks with $O(>10^6)$ weights! Choose $\eta$ wisely.

grad_local

Learning Rate - Adaptive¶

One possible solution is to use an adaptive rate.

Example: Annealing $$ \eta = \frac{c_1}{k+c_2} $$

where $c_i$ are constants and $k$ is the current iteration.

Iris Dataset¶

Set of 150 samples (individual flowers) that have 4 features: sepal length, sepal width, petal length, and petal width (all in cm). Collected in 1936 by R. Fisher.
Each sample is labeled by its species: Iris Setosa, Iris Versicolour, Iris Virginica
Task is to develop a model that predicts iris species
Dataset freely available from the UCI Machine Learning Repository

Iris dataset

Iris Learning Example¶

Consider just

Two species: Setosa, Versicolour
Two feature variables: sepal length, petal length

Iris data

Iris Learning Example¶

By eye, we can separate these.

Can our perceptron learn a decision boundary for classification?

Iris data

Iris Learning Example¶

Perceptron:

Two feature inputs (sepal & petal lengths)
Unit Step act. function
Learning rate: $\eta = 0.01$

How do we know when the perceptron has learned?

Look at evolution of Error and $J(w)$ with training iteration.

Iris Learning Example¶

Perceptron:

Two feature inputs (sepal & petal lengths)
Unit Step act. function (binary classification -> quantized values)
Learning rate: $\eta = 0.01$

Unit_01

Iris Learning Example¶

Decision Boundary with Unit Step

Unit_dec

Mislabelled Example¶

Linearly separable.

What happens if we 'mis-classify' our training set, i.e. no longer linearly separable?

Iris_mis

No convergence for non-linearly separable (mislabelled) data set.

Unit_dec

Continuous Activation¶

Perceptron:

Two feature inputs (sepal & petal lengths)
AdaLine act. function
Learning rate: $\eta = 0.01$

Ada_01

Iris Boundary¶

Decision Boundary with AdaLine

Unit_dec

Mislabelled AdaLine¶

Perceptron:

AdaLine act. function
Learning rate: $\eta = 0.01$
Converges! Of course with higher errors.

mis_Ada

Comparing Learning Rates¶

Perceptron:

AdaLine act. function, $\eta = {0.1,0.001}$
Can have major differences in minimizing $J(w)$.

mis_Ada

Feature Scaling¶

Weights typically initialized by $N(0,\epsilon)$, $\epsilon$ small
Features may cover large range
Can scale $\mathbf{x}_i \rightarrow \frac{\mathbf{x}_i - \mu_i}{\sigma_i}$
- $\mu_i$ is mean of feature $i$ over data set
- $\sigma_i$ is variance of feature $i$ over data set

iris_norm

Feature Scaling¶

Much faster convergence just by preprocessing features!

norm_Ada

Visualizing Learning¶

Notebook with some animations.

Visualize learning of boundary
- Different $\phi(z)$
- Non-scaled and scaled data set

Regression¶

Fit a single Perceptron to one flower type (Setosa)
Now one explanatory variable (Sepal length)

setosa

Regression¶

Smooth convergence of cost minimization
Visual animation of fit

setosa

Further Complications¶

Our examples are rather simple
Test a more complicated regression $y_{\text{true}}(x) = cos(\frac{3\pi}{2}x)$
Using polynomials of different degree

under_over_fitting

Image source: Underfitting vs. Overfitting scikit-learn.

Classification¶

Bias: mean deviation of predictions from true values
Variance: variability of model to classify a sample instance (systematic error)

under_over_fitting

Image source: Underfitting vs. Overfitting scikit-learn.

Regularization¶

Can use regularization to smooth learning
For example, we don't want weight amplitudes to grow uncontrolled

L2 Regularization
- $\lambda||\mathbf{w}||^2 = \lambda \sum_i w_i^2$
- New hyperparameter $\lambda$
- Include reg. term in minimization procedure
- Feature scaling important to use regularization (on same footing)

Quantifying Performance¶

Under/over fitting extended to ML
- Hyperparameter effects on performance
- How to quantify quality of trained model
- How do we know we are under/over fitting?

Splitting up labelled data set
- Training set ($\sim70\%$)
- Validation set ($\sim30\%$)
Further test performance on entirely unseen data

Quantifying Performance¶

Validation score: a measure of quality
- Classification: fraction of correct identifications
- Regression: mean accuracy

Not learn enough, or perfectly trained but can't generalize new data
Model Complexity: polyn, hyperparameters ($\eta$, $N_{\text{iter}}$, regularization param)

validation_curve

Back to General Scheme for Building ML Systems¶

simple_perceptron

ML Algorithms Overview¶

Support Vector Machines (SVM)
Decision Trees
Artificial Neural Networks (ANN)
- Multilayer Networks
- Convolutional NN

Support Vector Machines (SVM)¶

Maximize the margin: distance between support vectors
Support vectors defined by nearest training samples to hyperplane

SVM

Image source: Python Machine Learning by Sebastian Raschka

Support Vector Machines (SVM)¶

Maximize the margin: distance between support vectors
Support vectors defined by nearest training samples to hyperplane

$$ \sum_i w_i x^{\text{upper}}_i = +1 \\ \sum_i w_i x^{\text{lower}}_i = -1 \\ \sum_i w_i (x^{\text{upper}}_i - x^{\text{lower}}_i) = 2 \\ $$ Normalize: $$ \frac{\sum_i w_i (x^{\text{upper}}_i - x^{\text{lower}}_i)}{||\mathbf{w}||} = \frac{2}{||\mathbf{w}||} $$

Support Vector Machines (SVM)¶

So maximize $\frac{2}{||\mathbf{w}||}$ subject to classification conditions: $$ y_k \sum_i w_i x^k_i \ge 1 \text{for } k = {1...N} $$
Actually easier to minimize $||\mathbf{w}||$ -> quadratic optimization problem
Introduce soft margins for non-linearly separable data

Decision Trees¶

Break down data via series of questions
Split at each node via optimization of Information Gain

simple_tree

Image source: Python Machine Learning by Sebastian Raschka

Decision Trees¶

Information gain (IG) as the objective function $$ \text{IG}(D_p,f) = I(D_p) - \sum_{j=1}^m \frac{N_j}{N_p}I(D_j) $$

where

$f$ is the feature for the split
$D_p, \, D_j$ are the data set of the parent and $j$th child node
$I$ is the impurity measure
N_p is the # samples at the parent node
N_j is the # samples in the $j$th child node

IG is the different between the impurity of the parent node and the sum of the child node impurities.

Decision Trees¶

Most libraries implement binary splitting: $$ \text{IG}(D_p,f) = I(D_p) - \frac{N_\text{left}}{N_p}I(D_\text{left}) - \frac{N_\text{right}}{N_p}I(D_\text{right}) $$

Common impurity measures $I(t)$ at node $t$:

Entropy: $I(t) = - \sum_{i=1}^c p(i|t) \log_2 p(i|t)$
Gini: $I(t) = \sum_{i=1}^c p(i|t) \left( 1- p(i|t) \right) = 1 - \sum_{i=1}^c p(i|t)^2$
Classification error: $I(t) = 1- \text{max} p(i|t)$

where $p(i|t)$ is the fraction of samples belonging to class $i$ at node $t$.

Can overtrain easily so need pruning to limit max. depth of tree.

Artificial Neural Networks¶

ann_scheme

Artificial Neural Networks¶

Built up of many connected artificial neurons
Building block is our little perceptron!

Applications:
- Computer vision
- Speech recognition
- Social network filtering
- Game play
- Physics!

Artificial Neural Networks¶

Single layer network

single

Artificial Neural Networks¶

Single layer network
Keeping track of indices
- $i$ input features
- $j$ outputs
- $i\times j$ weights
Optimization with $J\sim \sum_{\mu,k} J\left( y_{k, \text{true}}^{\mu} - y_{k}^{\mu} \right)$ ($k$th output and $\mu$th sample)

Artificial Neural Networks¶

Adding more layers means adding more indices.

Optimization algorithm (backpropagation) remains the same.

Just start from the outputs, and move backwards to fine tune the weights!

double

ANN¶

Image Analysis
- B&W: each pixel is a feature (grayscale $[0,255] \rightarrow [0,1]$ scaled)
- RGB: each pixel provides 3 features (R, G, B $[0,255] \rightarrow [0,1]$ scaled)
- Each PMT provides charge, Q
Event Analysis
- Sophisticated reco variables: lateral dist, theta, timing

Convolutional NN¶

Primarily used in image analysis
- Convolutions account for neighboring pixels
- Great for identifying sub-features in data

MaxPool Convolution¶

simple_conv

Image source: [Understanding CNN for NLP](http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/) Denny Britz

Deep Convolutional Network¶

Layers of convolutions, and samplings...

simple_conv

Image source: [Introduction to CNN for Vision Tasks](https://pythonmachinelearning.pro/introduction-to-convolutional-neural-networks-for-vision-tasks/)

Keras Tutorial¶

Following the first few sections of Deep Learning with Keras

Single Layer
M-Hidden Layers
Dropout in M Layers

Keras Tutorial¶

Python API for running TensorFlow
TensorFlow: Google's symolic library for tensor math
Online playground here
Keras more intuitive functionality to develop Deep Learning models
- Building blocks: layers, objective and activation functions, optimizers
- Distributed training of networks via GPUs and GPU clusters

Installation¶

A few things prior to running scripts.

Install somethings with pip

pip install numpy scipy scikit-learn pillow h5py
pip install Theano
pip install --upgrade tensorflow
pip install keras

Now test your installation

import theano import theano.tensor as T x = T.dmatrix('x') s = 1 / (1 + T.exp(-x)) logistic = theatno.function([x], s) logistic([[0, 1], [-1. -1]])

First Keras Script¶

Let's go train a single later NN on the MNIST data set: 70k handwritten (labelled) digits.

mnist

First Keras Script¶

We're going to define:

10 outputs (10 digits)
Split training and validation set: 80 / 20
Mini-batch sets of 128 samples
Objective function: categorical cross-entropy

from __future__ import print_function import numpy as np from keras.datasets import mnist from keras.models import Sequential from keras.layers.core import Dense, Activation from keras.optimizers import SGD from keras.utils import np_utils np.random.seed(1671) # for reproducibility

# Network & training N_EPOCH = 5 #200 BATCH_SIZE = 128 VERBOSE = 1 N_CLASSES = 10 # No. outputs = No. digits OPTIMIZER = SGD() # SGD optimizer N_HIDDEN = 128 # No. hidden nodes in layer VALIDATION_SPLIT = 0.2 # fraction of training set used for validation LOSS = 'categorical_crossentropy' #categorical_crossentropy, binary_crossentropy, mse METRIC = 'accuracy' #accuracy, precision, recall

# Data -> shuffled and split between training and test sets (X_train, y_train), (X_test, y_test) = mnist.load_data() # X_train is 60,000 rows of 28x28 values -> to be reshaped to 60,000 x 784 RESHAPED = 784 X_train = X_train.reshape(60000,RESHAPED) X_test = X_test.reshape(10000,RESHAPED)

# Need to make float32 for GPU use X_train = X_train.astype('float32') X_test = X_test.astype('float32')

# Normalize grey-scale values X_train /= 255 X_test /= 255 print(X_train.shape[0], ' training samples') print(X_test.shape[0], ' testing samples')

# Convert class vectors to binary class matrices Y_train = np_utils.to_categorical(y_train, N_CLASSES) Y_test = np_utils.to_categorical(y_test, N_CLASSES)

# N_CLASSES outputs, final stage is normalized via softmax model = Sequential() model.add(Dense(N_CLASSES, input_shape=(RESHAPED,))) model.add(Activation('softmax')) model.summary()

# Compile the model model.compile(loss=LOSS,optimizer=OPTIMIZER, metrics=[METRIC])

# Train the model history = model.fit(X_train, Y_train, \ batch_size=BATCH_SIZE,\ epochs=N_EPOCH,\ verbose=VERBOSE,\ validation_split=VALIDATION_SPLIT)

# Validation of the model with test set score = model.evaluate(X_test, Y_test, verbose=VERBOSE) print("Test score: ", score[0]) print("Test accuracy: ", score[1])

Multiple Hidden Layers¶

To add more hidden layers, add the following after the first input layer:

# 2nd layer model.add(Dense(N_HIDDEN)) model.add(Activation('relu')) # Can use other activation functions here # 3rd layer model.add(Dense(N_HIDDEN)) model.add(Activation('relu')) # Can use other activation functions here

We also don't need so many iterations, so can try N_EPOCH = 20

Additional Resources¶

Python Machine Learning by Sebastian Raschka [GitHub][Amazon]
Data Science Handbook by Jake VanderPlas [GitHub][Amazon]
The Elements of Statistical Learning by Hastie, Tibshirani and Friedman [Free book!]
Deep Learning by Ian Goodfellow, Yoshua Bengio, and Aaron Courville [Amazon]

Thank you for your attention!¶

final

Introduction to Machine Learning Workshop

Zigfried Hampel-Arias¶

IIHE -- Brussels, BE¶

GitHub repo with materials:¶

Slides:¶

Contact:¶

Program for Today¶

Intro to Machine Learning¶

Keras Workshop¶

Classical Programming¶

Machine Learning¶

Three Types of Learning¶

Supervised Learning¶

Unsupervised Learning¶

Reinforcement Learning¶

General Scheme for Building ML Systems¶

Supervised Learning¶

Supervised Learning¶

Supervised Learning¶

Supervised Learning¶

Supervised Learning¶

Machine Learning Basics¶

Neuronal Inspiration¶

Towards an Artificial Neuron¶

Towards an Artificial Neuron¶

The Perceptron¶

Inputs and Weights¶

The Activation Function¶

Activation Functions¶

Activation Functions¶

The Decision Function¶

Decision Functions¶

Decision Functions¶

Decision Functions (Binary)¶

Logistic Function¶

Logistic Function¶

Logistic Function¶

Perceptron Learning¶

Perceptron Learning¶

Cost Function¶

List of Common Objective Functions¶

Cost Function Minimization¶

Cost Function Minimization¶

Cost Function Minimization¶

Cost Function Minimization¶

Perceptron Learning Rule¶

Perceptron Learning Rule¶

Gradient Descent¶

Gradient Descent¶

Learning Rate¶

Learning Rate¶

Learning Rate - Adaptive¶

Iris Dataset¶

Iris Learning Example¶

Iris Learning Example¶

Iris Learning Example¶

Iris Learning Example¶

Iris Learning Example¶

Mislabelled Example¶

Continuous Activation¶

Iris Boundary¶

Mislabelled AdaLine¶

Comparing Learning Rates¶

Feature Scaling¶

Feature Scaling¶

Visualizing Learning¶

Regression¶

Regression¶

Further Complications¶

Classification¶

Regularization¶

Quantifying Performance¶

Quantifying Performance¶

Back to General Scheme for Building ML Systems¶

ML Algorithms Overview¶

Support Vector Machines (SVM)¶

Support Vector Machines (SVM)¶

Support Vector Machines (SVM)¶

Decision Trees¶

Decision Trees¶

Introduction to Machine Learning
Workshop