Overview of Semi-Supervised Classification
with GCNs by Kipf & Welling

Zigfried Hampel-Arias¶

ML6 -- Ghent, BE¶

23 April, 2018

Slides¶

https://zhampel.github.io/gcn-ssc-technical-overview

Contact¶

E-mail: zhampel@wipac.wisc.edu

Find me on:

Overview¶

Short Graph Terminology
Semi-Supervised Classification
Work by Kipf & Welling (2017)

Graph Structured Data (GSD)¶

Data sets in which data samples (nodes) demonstrate inter-connectivity or relationships (edges).

Describe with graphs.

Graphs defined by

Set of $n$ nodes or vertices, $n = |\mathcal{V}|$
Set of $C$ features for each node, $C = |\mathcal{C}|$
Set of $E$ edges, $E = |\mathcal{E}|$
May include directionality and weights. Here just considering simple graphs.

Simple Example¶

A simple, undirected graph with $n=4$ vertices (nodes) each with $C = 1$ features connected by $E = 4$ edges.

simple_graph

Typical GSD Datasets¶

Social networks
Knowledge graphs
Citation databases

Need a way to describe connectivity...

Some Definitions¶

Indices run over graph vertices: $i \in [1,n]$.

The adjacency matrix $A$ defines the connectivity of the graph:

$$ A_{ij} = \begin{cases} %2 & \text{ if loop at } i=j \\ 1 & \text{ if edge from vertex i to j} \\ 0 & \text{ otherwise} \end{cases} $$

The degree matrix $D$ is diagonal, giving the number of edges at each vertex:

$$ D_{ii} = \sum_j A_{ij} $$

Simple Example¶

simple_graph

In [3]:

# Adjacency matrix A, Degree matrix D
print(' A \t\t D')
for j in range(0,A.shape[1]):
    print('{}   {}'.format(A[:,j],D[:,j]))
print(' Symmetric \t Diagonal')

 A 		 D
[0. 1. 0. 0.]   [1. 0. 0. 0.]
[1. 0. 1. 1.]   [0. 3. 0. 0.]
[0. 1. 0. 1.]   [0. 0. 2. 0.]
[0. 1. 1. 0.]   [0. 0. 0. 2.]
 Symmetric 	 Diagonal

Problem:¶

Can we build a predictive model on a GSD set that's not completely described?

Some but not all samples / nodes labelled: Semi-Supervised Learning
Can take advantage of node connectivity via $A$

How to include $A$ in algorithm?
- Preprocessing
- Model evaluation
- Training optimization (via loss)

Semi-Supervised Learning¶

One solution incorporates $A$ via regularization term in objective function:

$$ \begin{align} \mathcal{L} &= \underbrace{\mathcal{L}_0}_{\text{Supervised}} + \underbrace{\lambda \, \mathcal{L}_{\text{reg}}}_{\text{Unsupervised}} \\ %&= L_0 + \lambda \, L_{\text{reg}}(A) \\ %\\ %&= \underbrace{L_0}_{\text{Supervised}} + \underbrace{\lambda \ \ f(X)^{\text{T}} \ (D - A) \, \ f(X)}_{Unsupervised} \end{align} $$

Supervised Component:

$\mathcal{L}_0 \sim$ categorical cross-entropy
Acts only on labelled subset of graph data

Unsupervised Component:

$\mathcal{L_{\text{reg}}} = \ \sum_{i,j} A_{ij} \ || \ f(X_i) \ - \ f(X_j) \ ||^2$
- Incorporates graph Laplacian $L = D \ - \ A$
- Describes smoothness
$f(X) \sim$ neural network
$\lambda$: regularization factor

Semi-Supervised Classification¶

Incorporating $A$ via regularization:

$$ \begin{align} \mathcal{L} %&= \underbrace{L_0}_{\text{Supervised}} + \underbrace{\lambda \, L_{\text{reg}}}_{\text{Unsupervised}} \\ &= \mathcal{L}_0 + \lambda \, \mathcal{L}_{\text{reg}} \\ \\ &= \mathcal{L}_0 + \lambda \ \sum_{i,j} A_{ij} \ || \ f(X_i) \ - \ f(X_j) \ ||^2 \end{align} $$

Connectivity encoded in regularization
- Assumption $\rightarrow$ neighboring nodes share similar feature values
- Hyperparameter dependency $\rightarrow$ tuning of $\lambda$

Not fully general model
- Connectivity $\not \to$ node similarity
- Problem dependent

Work by Kipf & Welling (2017)¶

Two main contributions:

Encode graph structure within model architecture
Scalable approach to semi-supervised learning via approximation

Incorporating Connectivity¶

Embed $A$ inside neural network structure:

$$ f(X) \rightarrow f(X,A) $$

Encode in activation value
Training of model weights explicitly includes node connectivity
Only need supervised component of objective function

Convolution on Graph¶

Consider convolution parametrized by characteristic spectrum of graph structure:

$$ \text{Conv} \left( X \right)= U \ g_{\theta} (\Lambda) \ U^{T} X $$

$U$ eigenvectors of $L_{\text{norm}} = U \ \Lambda \ U^{T} = I - D^{-1/2} \ A \ D^{-1/2}$
$\Lambda$ eigenvalues of $L_{\text{norm}}$
$g_{\theta}(\Lambda)$ spectral convolution parametrized by $\theta \in \mathbb{R}^n$

Potential issues:

Eigendecomposition $(U)$ may be prohibitively expensive!
Calculation $\sim \mathcal{O}(n^2)$ also expensive!

Approximation via Expansion¶

Approximate $g_{\theta'}\approx g_{\theta}$

Truncated Chebyshev polynomial expansion:

$$ \text{Conv} \left( X \right) \approx U \ \left( \theta'_0 + \theta'_1 \ \Lambda \ + ... + \ \theta'_k \ \Lambda^k \right) \ U^{T} \ X $$

Furthermore, absorbing $U, \ U^{T}$ into expansion:

$$ g_{\theta'}(\Lambda) \rightarrow g_{\theta'}(L_{\text{norm}}) = \theta'_0 + \theta'_1 \ L_{\text{norm}} \ + ... + \ \theta'_k \ L_{\text{norm}}^k $$

Filter now explicitly encodes graph connectivity $L_{\text{norm}} = L_{\text{norm}}(A)$

Complexity in evaluating $g_{\theta'}$ scales linearly with number of edges $\mathcal{O}(k \ E) < \mathcal{O}(n^2)$

A Linear Model¶

Let's limit to $k=1$

$$ \text{Conv} \left( X \right) \approx w \ \left( \ I + D^{-1/2} \ A \ D^{-1/2} \ \right) \ X $$

Filter is linear in $A$
Also limited to single filter parameter, $\theta \rightarrow w$

Renormalization $\tilde{A} = A+I$ to avoid numerical instabilities for repeated convolution gives:

$$ Z = \tilde{D}^{-1/2} \ \tilde{A} \ \tilde{D}^{-1/2} \ X \ W $$

Full propagation rule at layer $l$, with activation function $\sigma(\cdot)$:

$$ H^{(l+1)} = \sigma \left( \tilde{D}^{-1/2} \ \tilde{A} \ \tilde{D}^{-1/2} \ H^{(l)} \ W^{(l)} \right) $$

Linear in $A$ → stack $k$ convolutions to probe neighborhood $k$ nodes away
Complexity is $\mathcal{O}( \, E \ C \ k \, )$

Summary of Propagation Rule¶

prop_rule

Image source: Node Classification by Graph Convolutional Network by Graham Ganssle

Demonstration¶

Semi-supervised classification on karate club network

34 nodes (featureless)
154 edges (unweighted, undirected)
Four classes (clustering found by Brandes et al, 2008)
For demo, only one instance of each class is labelled!

Model:

Three-layer GCN model
300 training iterations

Demonstration: Karate Club Graph¶

karate_club

Demonstration¶

Video source: Graph Convolutional Networks by Thomas Kipf

Performance¶

Authors tested on GSD sets with different

Numbers of vertices, edges, features, & classes
Labelled fraction used for supervised training
Variants of layer-wise propagation rule

Performance: Method Comparison on Datasets¶

baseline_perf

Performance: Propagation Rule¶

prop_rule_perf

Performance: CPU/GPU¶

Mean training time over 100 epochs. Factor $\sim 2$ performance gain.

cpu_gpu_compare

Summary¶

Authors developed a model that

encodes graph structure explicitly
- Embeds $A$ directly: $f(X) \rightarrow f(X,A)$
- Learns weights to predict node classification via labelled component of dataset
linearly scales with edges
- Can stack $k$ layers to probe $k^{\text{th}}$-scale neighborhood
- Computationally accessible, $\tilde{D}^{-1/2} \ \tilde{A} \ \tilde{D}^{-1/2}$ preprocessing
outperforms other models

Thank you for your attention!¶

final

Any Questions?¶

Backups¶

Performance: Number of Layers¶

Two - three layers sufficient

layer_perf

Limitations of Method¶

Memory
- Full-batch training $\rightarrow$ linear memory requirements
- Mini-batch must account for neighborhood depth, potentially more approximations
Non-simple graphs
- Current architecture doesn't include directed, weighted edges
- Potential to represent directionality via representation as undirected with additional nodes
Locality
- Equal weighting given to self-connections & edges to neighbors
- Potential to incorporate trade-off parameter that can be learned: $\tilde{A} = A+\lambda \, I$

Mathematics of Spectral Graph Theory¶

For simple graph, the Laplacian is given by $L = D - A$,

$$ L_{ij} = \begin{cases} \text{deg} \, v_i & \text{ if } i=j \\ -1 & \text{ if } i \neq j \text{ and } v_i \text{ adjacent to } v_j \\ 0 & \text{ otherwise} \end{cases} $$

The symmetric, normalized Laplacian $L_{\text{norm}}$ is given by

$$ L_{\text{norm}} = D^{-1/2} \, L \, D^{-1/2} = I - D^{-1/2} \, A \, D^{-1/2} \, , $$

$$ L^{\text{norm}}_{ij} = \begin{cases} 1 &\text{ if } i=j \text{ and deg} \, v_i \neq 0 \\ -\left( \text{deg } v_i \, \text{deg } v_j \right)^{-1/2} & \text{ if } i \neq j \text{ and } v_i \text{ adjacent to } v_j \\ 0 & \text{ otherwise} \end{cases} $$

The resulting spectrum of eigenvalues are all non-negative real values: $\lambda^{\text{norm}}_i \in \mathbb{R}_{\geq 0} \ \ , \ \forall i$

Mathematics of Expansion¶

Relation to Weisfeiler-Lehman Algorithm¶

Boils down node connectivity information to minimal representation
Replace hash function with $\sigma(\cdot)$
Define $c_{ij} = \sqrt{\, d_i \ d_j}$

$$ h_i^{(l+1)} = \sigma \left( \sum_{j\in\mathcal{N}+i} \ \frac{1}{c_{ij}} \ h_{j}^{(l)} \ W^{(l)}\right) $$

Interpretation: propagation rule is a parametrized, differentiable generalization of 1D WL alg.

Overview of Semi-Supervised Classification with GCNs by Kipf & Welling

Zigfried Hampel-Arias¶

ML6 -- Ghent, BE¶

Slides¶

Contact¶

Overview¶

Graph Structured Data (GSD)¶

Simple Example¶

Typical GSD Datasets¶

Some Definitions¶

Simple Example¶

Problem:¶

Semi-Supervised Learning¶

Semi-Supervised Classification¶

Work by Kipf & Welling (2017)¶

Incorporating Connectivity¶

Convolution on Graph¶

Approximation via Expansion¶

A Linear Model¶

Summary of Propagation Rule¶

Demonstration¶

Demonstration: Karate Club Graph¶

Demonstration¶

Performance¶

Performance: Method Comparison on Datasets¶

Performance: Propagation Rule¶

Performance: CPU/GPU¶

Summary¶

Thank you for your attention!¶

Any Questions?¶

Backups¶

Performance: Number of Layers¶

Limitations of Method¶

Mathematics of Spectral Graph Theory¶

Mathematics of Expansion¶

Relation to Weisfeiler-Lehman Algorithm¶

Overview of Semi-Supervised Classification
with GCNs by Kipf & Welling