Introduction to Machine Learning
Workshop

Zigfried Hampel-Arias

IIHE -- Brussels, BE

20 April, 2018

Program for Today

Intro to Machine Learning

  • Types of Learning
  • Machine Learning Basics
  • Building More Complex Models

Keras Workshop

  • Simple Neural Nets
  • Hyperparameters

Classical Programming

  • Set of rules to accomplish a task
  • 'if' this then 'do that'
In [1]:
def spam_filter(email):
    """Function that labels an email as 'spam' or 'not spam'
    """
    if 'Act now!' in email.contents:
        label = 'spam'
    elif 'hotmail.com' in email.sender:
        label = 'spam'
    elif email.contents.count('$') > 20:
        label = 'spam'
    else:
        label = 'not spam'

    return label

Machine Learning

  • "Field of study that gives computers the ability to learn without being explicitly programmed" — Arthur Samuel (1959)

  • "A machine-learning system is trained rather than explicitly programmed. It’s presented with many examples relevant to a task, and it finds statistical structure in these examples that eventually allows the system to come up with rules for automating the task." — Francois Chollet, Deep Learning with Python

Three Types of Learning

  • Supervised Learning

  • Unsupervised Learning

  • Reinforcement Learning

Supervised Learning

  • Requires labelled data set (e.g. MC truth)

  • Direct feedback on model performance while training

  • Application to two kinds of problems

    • Classification -> fixed output types
    • Regression -> continuous output

png

Unsupervised Learning

  • No labels on data set (e.g. no MC truth)

  • No direct feedback while training model

  • Identify underlying structure in data

  • Application to two main subfields

    • Clustering -> organize data stack into meaningful subgroups
    • Dimensionality reduction -> preprocessing of large data sets

png

Reinforcement Learning

  • Used for training decision making process (e.g. playing chess)

  • Learn a series of actions by interacting with environment

  • Requires a reward system to optimize, improve performance

png

General Scheme for Building ML Systems

simple_perceptron

Supervised Learning

  • From labelled data, learn a mapping from input data to desired output

  • The goal is to generalize well to future, unseen data

  • Application to two kinds of problems

    • Classification -> fixed output types
    • Regression -> continuous output

png

Supervised Learning

png

Supervised Learning

  • Requires labelled data set (e.g. MC truth)

  • Direct feedback on model performance while training

png

Supervised Learning

  • Choose ML algorithm appropriate for problem

png

Supervised Learning

  • Application to two kinds of problems

    • Classification -> fixed output types
    • Regression -> continuous output

png

Machine Learning Basics

  • Inspired by nature
  • The Perceptron
  • Learning from the data

Neuronal Inspiration

overview

Image source: Artificial Neurons and the McCulloch-Pitts Model by Sebastian Raschka

Towards an Artificial Neuron

Desired Components

  • Set of input accepting 'dendrites'
  • Inner body that 'activates' based on input signals
  • Inner function that 'decides' whether to fire
  • Output signal

Towards an Artificial Neuron

Desired Characteristics

  • Simple mathematical functions

    • Easy to evaluate
    • Differentiable
  • Building block

    • Single unit
    • Can be stacked

The Perceptron

Components

  • Set of input features: $X$
  • Set of real valued weights: $W$
  • Activation function: $\phi$
  • Decision function
  • Output value (binary, $\mathbb{R}$): $y$

simple_perceptron

Inputs and Weights

First take input information $x_i$ and combine with weights $w_i$

simple_perceptron_input

The Activation Function

Combine information into single variable, $z$

$$ \Sigma \rightarrow z = \sum_{i=0}^{n} w_i x_i $$

so that $\phi$ can activate based on its value.

simple_perceptron_act

Activation Functions

As we'll see, it's really nice if $\phi(z)$ is continuous & differentiable, and acts linear near origin.

Common forms of $\phi(z)$ used to squish input into some range (monotonically).

  • Adaptive Linear Neuron (AdaLine): $z$
  • Logistic: $\frac{1}{1+e^{-z}}$
  • Arctangent: $\tan^{-1}(z)$

Activation Functions

Common forms of $\phi(z)$ used to squish input into some range (monotonically):

  • Adaptive Linear Neuron (AdaLine): $z$
  • Logistic: $\frac{1}{1+e^{-z}}$
  • Arctangent: $\tan^{-1}(z)$

act_func

The Decision Function

Based on activation, make a decision $\mathcal{D}$ to provide an output $y$.

simple_perceptron_dec

Decision Functions

Form of $\mathcal{D}$ is just a threshold function of $\phi(z)$ and thus of $z$, so:

$$ \mathcal{D} = \begin{cases} 1, & \text{if } z \ge b\\ \{-1,0\}, & \text{if } z < b \end{cases} $$

But since $b$ is some threshold value, we can absorb in our definition of $w$:

$$ z = \sum_{i=0}^{n} w_i x_i - b = \sum_{i=0}^{n+1} w_i x_i $$

where $w_0 = -b$ and $x_0 = 1$.

Decision Functions

So now we can write $\mathcal{D}$ as:

$$ \mathcal{D} = \begin{cases} 1, & \text{if } z \ge 0\\ \{-1,0\}, & \text{if } z < 0 \end{cases} $$

In this example, we are doing binary classification, so we can choose either $-1$ or $0$ as the latent state depending on the problem setup.

For regression for example, we could $\mathcal{D} = \phi(z) = \frac{1}{1+e^{-z}}$ (Logistic), providing class-membership probability.

Decision Functions (Binary)

Forms of $\mathcal{D}$ are just threshold functions.

Does $z$ and thus $\phi(z)$ reach a value sufficient to fire, i.e. $\mathcal{D}=+1$?

dec_func

Logistic Function

A quick aside for the Logistic function.

Consider two possible outcomes $y\in \{a,b\}$, with probabilities $\{p, 1-p\}$.

The ratio $\frac{p}{1-p} \in (0,\infty)$ provides the odds for event $a$.

Take the logarithm: $\log \frac{p}{1-p} \in \mathbb{R}$.

Recall that $z = \sum_i w_i x_i \in \mathbb{R}$.

Logistic Function

Consider two possible outcomes $y\in \{a,b\}$, with probabilities $\{p, 1-p\}$.

The ratio $\frac{p}{1-p} \in (0,\infty)$ provides the odds for event $a$.

Take the logarithm: $\log \frac{p}{1-p} \in \mathbb{R}$.

Recall that $z = \sum_i w_i x_i \in \mathbb{R}$.

So let's squish our z with this function: $$ \log \frac{p}{1-p} = z $$

$$ \frac{1-p}{p} = e^{-z} $$

$$ 1-p = p \ e^{-z} $$

$$ p(y=a|\mathbf{x}) = \phi(z) = \frac{1}{1+e^{-z}} $$

Logistic Function

$\phi_{\text{L}}(z) = \frac{1}{1+e^{-z}}$

log_func

Now we can define $\mathcal{D}$ to provide either probability $p$ or request a binary decision ${0, 1}$.

Perceptron Learning

So now we have the mechanics of the perceptron itself.

Given inputs $X$, weights $W$, activation and decision functions, we can get an output $y$

So now, how do we train it to learn something?

simple_perceptron

Perceptron Learning

We need two main things to learn:

  • Quantify how good our perceptron is doing

  • Ability to tune our perceptron based on this performance

Cost Function

  • Some way of quantifying how good our perceptron is doing

In supervised learning → labelled data sets, $y_{\text{true}}$.

Consider evaluating a cost function $J$:

$$ J(Y_{\text{true}}, Y) \propto \sum_{\mu=0}^{M} (y_{\text{true}}^\mu - y^\mu)^2 $$

over a data set with $M$ samples, where $y^\mu$ is the perceptron output for sample $\mu$.

Here just the mean squared error (MSE).

$J$ is a metric we want to minimize!

In general referred to as an objective function.

List of Common Objective Functions

MSE: $$ \sum_{i=0}^{M} (y_{\text{true}}^{\mu} - y^{\mu})^2 $$

Binary cross-entropy. Model predicts $p$ while true value is $t$. $$ -t \log (p) - (1-t) \log(1-p) $$

Categorical cross-entropy. Generalization for multiclass logarithmic loss. Target $t_{ij}$, prediction $p_{ij}$. $$ -\sum_{j} t_{ij} \log( p_{ij}) $$

Cost Function Minimization

Time to minimize!

Take a derivative w.r.t. to... ?

Only free paramerters are the weights $w_i$. Thus,

$$ \frac{\partial J}{\partial w_i} \bigg\rvert _{w_{\text{min}}} = 0 $$

$$ \begin{align} \frac{\partial J}{\partial w_i} &= \frac{\partial}{\partial w_i} \sum_{\mu=0}^{M} (y_{\text{true}}^\mu - y^\mu)^2 \\ &= \frac{\partial}{\partial w_i} \sum_{\mu=0}^{M} (y_{\text{true}}^\mu - y^{\mu})^2 \\ &= \sum_{\mu=0}^{M} \frac{\partial}{\partial w_i} (y_{\text{true}}^\mu - y^{\mu})^2 \end{align} $$

Cost Function Minimization

$$ y \rightarrow \phi(z) $$

$$ \begin{align} \frac{\partial J}{\partial w_i} &= \sum_{\mu=0}^{M} \frac{\partial}{\partial w_i} \left(y_{\text{true}}^\mu - \phi^{\mu}(z)\right) ^2 \\ &= \sum_{\mu=0}^{M} (-2) \left( y_{\text{true}}^\mu - \phi^{\mu}(z) \right) \frac{\partial \phi^{\mu}(z)}{\partial w_i} \\ &= \sum_{\mu=0}^{M} (-2) \left( y_{\text{true}}^\mu - \phi^{\mu}(z) \right) \frac{\partial \phi^{\mu}(z)}{\partial z} \frac{\partial z}{\partial w_i} \\ &= \sum_{\mu=0}^{M} (-2) \left( y_{\text{true}}^\mu - \phi^{\mu}(z) \right) \frac{\partial \phi^{\mu}(z)}{\partial z} x_i \end{align} $$

Cost Function Minimization

$$ \frac{\partial J}{\partial w_i} \bigg\rvert _{w_{\text{min}}} = \sum_{\mu=0}^{M} \left( y_{\text{true}}^\mu - \phi^{\mu}(z) \right) \frac{\partial \phi^{\mu}(z)}{\partial z}\bigg\rvert _{w_{\text{min}}} x_i = 0 $$

So we have these pieces: $$ x_i \, \, , \, \, \, \phi'= \frac{\partial \phi}{\partial z} $$

and we can evaluate this: $$ \text{Error} = \left( y_{\text{true}}^\mu - y^\mu \right) = \left( y_{\text{true}}^\mu - \phi^{\mu}(z) \right) $$

Woopdie-doo...

Looks a little nasty to evaluate.

So, what are we going to do with these to find $w_{\text{min}}$?

Cost Function Minimization

  • Ability to tune our perceptron based on this performance

We're going to inform our weights via the Error to search for the $w_{\text{min}}$.

perceptron_err

Perceptron Learning Rule

Consider small displacement of a function

For small $\delta w > 0$,

$$ J(w+\delta w) \approx J(w) + J'(x) \ \delta w $$

so

$$ J\left(w-\delta w \ \mathrm{sgn} \ J'(w) \right) \leq J(w) \, . $$

jpg

Perceptron Learning Rule

So let's update the weights $w$ via something like

$$ w \leftarrow w + \delta w $$

where

$$ \delta w = - \eta \ J'(w) $$

i.e., follow negative gradient to find minimum of $J(w)$, at a learning rate $\eta$.

Gradient Descent

Algorithm:

  1. Initialize $w_i$
  2. Evaluate perceptron output, $y$
  3. Calculate $\frac{\partial J(w)}{\partial w_i}$, $\delta w_i$
  4. Update weights: $w_i \leftarrow w_i + \delta w_i$
  5. Return to Step 2 if stopping criteria not met.

Gradient Descent

What we've seen so far is called Batch Gradient Descent.

Three main methods:

  • Batch (BGD)
    • $\rightarrow \sum_{\mu=0}^{M}$
    • Uses all examples!
    • Slow, memory requirements...
  • Stochastic (SGD)
    • $k \in M \rightarrow J_{k}'$
    • One example! Used for online learning (data continuously arriving).
    • Potentially noisey path to $J_{\text{min}}$
  • Mini-Batch: (MBD)
    • $ S \subset M \rightarrow \sum_{\mu \in S}$
    • Subset of examples.
    • Compromise, most stable. Typically $|S| = 128, 256, ...$

Learning Rate

This first hyperparameter $\eta$ determines how far $\delta w$ will jump searching for $J_{\text{min}}$

grad_eta

Learning Rate

Of course, we are not guaranteed a global minimum via $\frac{\partial J}{\partial w_i} \bigg\rvert _{w_{\text{min}}} = 0$.

Consider deep learning networks with $O(>10^6)$ weights! Choose $\eta$ wisely.

grad_local

Learning Rate - Adaptive

One possible solution is to use an adaptive rate.

Example: Annealing $$ \eta = \frac{c_1}{k+c_2} $$

where $c_i$ are constants and $k$ is the current iteration.

Iris Dataset

  • Set of 150 samples (individual flowers) that have 4 features: sepal length, sepal width, petal length, and petal width (all in cm). Collected in 1936 by R. Fisher.

  • Each sample is labeled by its species: Iris Setosa, Iris Versicolour, Iris Virginica

  • Task is to develop a model that predicts iris species

  • Dataset freely available from the UCI Machine Learning Repository

Iris dataset

Iris Learning Example

Consider just

  • Two species: Setosa, Versicolour
  • Two feature variables: sepal length, petal length

Iris data

Iris Learning Example

By eye, we can separate these.

Can our perceptron learn a decision boundary for classification?

Iris data

Iris Learning Example

Perceptron:

  • Two feature inputs (sepal & petal lengths)
  • Unit Step act. function
  • Learning rate: $\eta = 0.01$

How do we know when the perceptron has learned?

Look at evolution of Error and $J(w)$ with training iteration.

Iris Learning Example

Perceptron:

  • Two feature inputs (sepal & petal lengths)
  • Unit Step act. function (binary classification -> quantized values)
  • Learning rate: $\eta = 0.01$

Unit_01

Iris Learning Example

Decision Boundary with Unit Step

Unit_dec

Mislabelled Example

Linearly separable.

What happens if we 'mis-classify' our training set, i.e. no longer linearly separable?

Iris_mis

No convergence for non-linearly separable (mislabelled) data set.

Unit_dec

Continuous Activation

Perceptron:

  • Two feature inputs (sepal & petal lengths)
  • AdaLine act. function
  • Learning rate: $\eta = 0.01$

Ada_01

Iris Boundary

Decision Boundary with AdaLine

Unit_dec

Mislabelled AdaLine

Perceptron:

  • AdaLine act. function
  • Learning rate: $\eta = 0.01$
  • Converges! Of course with higher errors.

mis_Ada

Comparing Learning Rates

Perceptron:

  • AdaLine act. function, $\eta = {0.1,0.001}$
  • Can have major differences in minimizing $J(w)$.

mis_Ada

Feature Scaling

  • Weights typically initialized by $N(0,\epsilon)$, $\epsilon$ small
  • Features may cover large range
  • Can scale $\mathbf{x}_i \rightarrow \frac{\mathbf{x}_i - \mu_i}{\sigma_i}$
    • $\mu_i$ is mean of feature $i$ over data set
    • $\sigma_i$ is variance of feature $i$ over data set

iris_norm

Feature Scaling

Much faster convergence just by preprocessing features!

norm_Ada

Visualizing Learning

Notebook with some animations.

  • Visualize learning of boundary
    • Different $\phi(z)$
    • Non-scaled and scaled data set

Regression

  • Fit a single Perceptron to one flower type (Setosa)
  • Now one explanatory variable (Sepal length)

setosa

Regression

  • Smooth convergence of cost minimization
  • Visual animation of fit

setosa

Further Complications

  • Our examples are rather simple
  • Test a more complicated regression $y_{\text{true}}(x) = cos(\frac{3\pi}{2}x)$
  • Using polynomials of different degree

under_over_fitting

Image source: Underfitting vs. Overfitting scikit-learn.

Classification

  • Bias: mean deviation of predictions from true values
  • Variance: variability of model to classify a sample instance (systematic error)

under_over_fitting

Image source: Underfitting vs. Overfitting scikit-learn.

Regularization

  • Can use regularization to smooth learning
  • For example, we don't want weight amplitudes to grow uncontrolled
  • L2 Regularization
    • $\lambda||\mathbf{w}||^2 = \lambda \sum_i w_i^2$
    • New hyperparameter $\lambda$
    • Include reg. term in minimization procedure
    • Feature scaling important to use regularization (on same footing)

Quantifying Performance

  • Under/over fitting extended to ML
    • Hyperparameter effects on performance
    • How to quantify quality of trained model
    • How do we know we are under/over fitting?
  • Splitting up labelled data set

    • Training set ($\sim70\%$)
    • Validation set ($\sim30\%$)
  • Further test performance on entirely unseen data

Quantifying Performance

  • Validation score: a measure of quality
    • Classification: fraction of correct identifications
    • Regression: mean accuracy
  • Not learn enough, or perfectly trained but can't generalize new data
  • Model Complexity: polyn, hyperparameters ($\eta$, $N_{\text{iter}}$, regularization param)

validation_curve

Back to General Scheme for Building ML Systems

simple_perceptron

ML Algorithms Overview

  • Support Vector Machines (SVM)
  • Decision Trees
  • Artificial Neural Networks (ANN)
    • Multilayer Networks
    • Convolutional NN

Support Vector Machines (SVM)

  • Maximize the margin: distance between support vectors
  • Support vectors defined by nearest training samples to hyperplane

SVM

Image source: Python Machine Learning by Sebastian Raschka

Support Vector Machines (SVM)

  • Maximize the margin: distance between support vectors
  • Support vectors defined by nearest training samples to hyperplane

$$ \sum_i w_i x^{\text{upper}}_i = +1 \\ \sum_i w_i x^{\text{lower}}_i = -1 \\ \sum_i w_i (x^{\text{upper}}_i - x^{\text{lower}}_i) = 2 \\ $$ Normalize: $$ \frac{\sum_i w_i (x^{\text{upper}}_i - x^{\text{lower}}_i)}{||\mathbf{w}||} = \frac{2}{||\mathbf{w}||} $$

Support Vector Machines (SVM)

  • So maximize $\frac{2}{||\mathbf{w}||}$ subject to classification conditions: $$ y_k \sum_i w_i x^k_i \ge 1 \text{for } k = {1...N} $$

  • Actually easier to minimize $||\mathbf{w}||$ -> quadratic optimization problem

  • Introduce soft margins for non-linearly separable data

Decision Trees

  • Break down data via series of questions
  • Split at each node via optimization of Information Gain

simple_tree

Image source: Python Machine Learning by Sebastian Raschka

Decision Trees

  • Information gain (IG) as the objective function $$ \text{IG}(D_p,f) = I(D_p) - \sum_{j=1}^m \frac{N_j}{N_p}I(D_j) $$

where

  • $f$ is the feature for the split
  • $D_p, \, D_j$ are the data set of the parent and $j$th child node
  • $I$ is the impurity measure
  • N_p is the # samples at the parent node
  • N_j is the # samples in the $j$th child node

IG is the different between the impurity of the parent node and the sum of the child node impurities.

Decision Trees

Most libraries implement binary splitting: $$ \text{IG}(D_p,f) = I(D_p) - \frac{N_\text{left}}{N_p}I(D_\text{left}) - \frac{N_\text{right}}{N_p}I(D_\text{right}) $$

Common impurity measures $I(t)$ at node $t$:

  • Entropy: $I(t) = - \sum_{i=1}^c p(i|t) \log_2 p(i|t)$
  • Gini: $I(t) = \sum_{i=1}^c p(i|t) \left( 1- p(i|t) \right) = 1 - \sum_{i=1}^c p(i|t)^2$
  • Classification error: $I(t) = 1- \text{max} p(i|t)$

where $p(i|t)$ is the fraction of samples belonging to class $i$ at node $t$.

Can overtrain easily so need pruning to limit max. depth of tree.

Artificial Neural Networks

ann_scheme

Artificial Neural Networks

  • Built up of many connected artificial neurons
  • Building block is our little perceptron!
  • Applications:
    • Computer vision
    • Speech recognition
    • Social network filtering
    • Game play
    • Physics!

Artificial Neural Networks

  • Single layer network

single

Artificial Neural Networks

  • Single layer network
  • Keeping track of indices
    • $i$ input features
    • $j$ outputs
    • $i\times j$ weights
  • Optimization with $J\sim \sum_{\mu,k} J\left( y_{k, \text{true}}^{\mu} - y_{k}^{\mu} \right)$ ($k$th output and $\mu$th sample)

Artificial Neural Networks

Adding more layers means adding more indices.

Optimization algorithm (backpropagation) remains the same.

Just start from the outputs, and move backwards to fine tune the weights!

double

ANN

  • Image Analysis

    • B&W: each pixel is a feature (grayscale $[0,255] \rightarrow [0,1]$ scaled)
    • RGB: each pixel provides 3 features (R, G, B $[0,255] \rightarrow [0,1]$ scaled)
    • Each PMT provides charge, Q
  • Event Analysis

    • Sophisticated reco variables: lateral dist, theta, timing

Convolutional NN

  • Primarily used in image analysis
    • Convolutions account for neighboring pixels
    • Great for identifying sub-features in data

MaxPool Convolution

simple_conv

Image source: [Understanding CNN for NLP](http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/) Denny Britz

Deep Convolutional Network

Layers of convolutions, and samplings...

simple_conv

Image source: [Introduction to CNN for Vision Tasks](https://pythonmachinelearning.pro/introduction-to-convolutional-neural-networks-for-vision-tasks/)

Keras Tutorial

Following the first few sections of Deep Learning with Keras

  • Single Layer
  • M-Hidden Layers
  • Dropout in M Layers

Keras Tutorial

  • Python API for running TensorFlow
  • TensorFlow: Google's symolic library for tensor math
  • Online playground here
  • Keras more intuitive functionality to develop Deep Learning models
    • Building blocks: layers, objective and activation functions, optimizers
    • Distributed training of networks via GPUs and GPU clusters

Installation

A few things prior to running scripts.

Install somethings with pip

  • pip install numpy scipy scikit-learn pillow h5py
  • pip install Theano
  • pip install --upgrade tensorflow
  • pip install keras

Now test your installation

import theano import theano.tensor as T x = T.dmatrix('x') s = 1 / (1 + T.exp(-x)) logistic = theatno.function([x], s) logistic([[0, 1], [-1. -1]])

First Keras Script

Let's go train a single later NN on the MNIST data set: 70k handwritten (labelled) digits.

mnist

First Keras Script

We're going to define:

  • 10 outputs (10 digits)
  • Split training and validation set: 80 / 20
  • Mini-batch sets of 128 samples
  • Objective function: categorical cross-entropy

from __future__ import print_function import numpy as np from keras.datasets import mnist from keras.models import Sequential from keras.layers.core import Dense, Activation from keras.optimizers import SGD from keras.utils import np_utils np.random.seed(1671) # for reproducibility

# Network & training N_EPOCH = 5 #200 BATCH_SIZE = 128 VERBOSE = 1 N_CLASSES = 10 # No. outputs = No. digits OPTIMIZER = SGD() # SGD optimizer N_HIDDEN = 128 # No. hidden nodes in layer VALIDATION_SPLIT = 0.2 # fraction of training set used for validation LOSS = 'categorical_crossentropy' #categorical_crossentropy, binary_crossentropy, mse METRIC = 'accuracy' #accuracy, precision, recall

# Data -> shuffled and split between training and test sets (X_train, y_train), (X_test, y_test) = mnist.load_data() # X_train is 60,000 rows of 28x28 values -> to be reshaped to 60,000 x 784 RESHAPED = 784 X_train = X_train.reshape(60000,RESHAPED) X_test = X_test.reshape(10000,RESHAPED)

# Need to make float32 for GPU use X_train = X_train.astype('float32') X_test = X_test.astype('float32')

# Normalize grey-scale values X_train /= 255 X_test /= 255 print(X_train.shape[0], ' training samples') print(X_test.shape[0], ' testing samples')

# Convert class vectors to binary class matrices Y_train = np_utils.to_categorical(y_train, N_CLASSES) Y_test = np_utils.to_categorical(y_test, N_CLASSES)

# N_CLASSES outputs, final stage is normalized via softmax model = Sequential() model.add(Dense(N_CLASSES, input_shape=(RESHAPED,))) model.add(Activation('softmax')) model.summary()

# Compile the model model.compile(loss=LOSS,optimizer=OPTIMIZER, metrics=[METRIC])

# Train the model history = model.fit(X_train, Y_train, \ batch_size=BATCH_SIZE,\ epochs=N_EPOCH,\ verbose=VERBOSE,\ validation_split=VALIDATION_SPLIT)

# Validation of the model with test set score = model.evaluate(X_test, Y_test, verbose=VERBOSE) print("Test score: ", score[0]) print("Test accuracy: ", score[1])

Multiple Hidden Layers

To add more hidden layers, add the following after the first input layer:

# 2nd layer model.add(Dense(N_HIDDEN)) model.add(Activation('relu')) # Can use other activation functions here # 3rd layer model.add(Dense(N_HIDDEN)) model.add(Activation('relu')) # Can use other activation functions here

We also don't need so many iterations, so can try N_EPOCH = 20

Additional Resources

  • Python Machine Learning by Sebastian Raschka [GitHub][Amazon]

  • Data Science Handbook by Jake VanderPlas [GitHub][Amazon]

  • The Elements of Statistical Learning by Hastie, Tibshirani and Friedman [Free book!]

  • Deep Learning by Ian Goodfellow, Yoshua Bengio, and Aaron Courville [Amazon]

Thank you for your attention!

final