Overview of Semi-Supervised Classification
with GCNs by Kipf & Welling

Zigfried Hampel-Arias

ML6 -- Ghent, BE

23 April, 2018

Overview

  • Short Graph Terminology
  • Semi-Supervised Classification
  • Work by Kipf & Welling (2017)

Graph Structured Data (GSD)

Data sets in which data samples (nodes) demonstrate inter-connectivity or relationships (edges).

Describe with graphs.

Graphs defined by

  • Set of $n$ nodes or vertices, $n = |\mathcal{V}|$
  • Set of $C$ features for each node, $C = |\mathcal{C}|$
  • Set of $E$ edges, $E = |\mathcal{E}|$
  • May include directionality and weights. Here just considering simple graphs.

Simple Example

A simple, undirected graph with $n=4$ vertices (nodes) each with $C = 1$ features connected by $E = 4$ edges.



simple_graph

Typical GSD Datasets

  • Social networks
  • Knowledge graphs
  • Citation databases

Need a way to describe connectivity...

Some Definitions

Indices run over graph vertices: $i \in [1,n]$.


The adjacency matrix $A$ defines the connectivity of the graph:

$$ A_{ij} = \begin{cases} %2 & \text{ if loop at } i=j \\ 1 & \text{ if edge from vertex i to j} \\ 0 & \text{ otherwise} \end{cases} $$


The degree matrix $D$ is diagonal, giving the number of edges at each vertex:

$$ D_{ii} = \sum_j A_{ij} $$

Simple Example

simple_graph

In [3]:
# Adjacency matrix A, Degree matrix D
print(' A \t\t D')
for j in range(0,A.shape[1]):
    print('{}   {}'.format(A[:,j],D[:,j]))
print(' Symmetric \t Diagonal')
 A 		 D
[0. 1. 0. 0.]   [1. 0. 0. 0.]
[1. 0. 1. 1.]   [0. 3. 0. 0.]
[0. 1. 0. 1.]   [0. 0. 2. 0.]
[0. 1. 1. 0.]   [0. 0. 0. 2.]
 Symmetric 	 Diagonal

Problem:

Can we build a predictive model on a GSD set that's not completely described?

  • Some but not all samples / nodes labelled: Semi-Supervised Learning
  • Can take advantage of node connectivity via $A$
  • How to include $A$ in algorithm?
    • Preprocessing
    • Model evaluation
    • Training optimization (via loss)

Semi-Supervised Learning

One solution incorporates $A$ via regularization term in objective function:

$$ \begin{align} \mathcal{L} &= \underbrace{\mathcal{L}_0}_{\text{Supervised}} + \underbrace{\lambda \, \mathcal{L}_{\text{reg}}}_{\text{Unsupervised}} \\ %&= L_0 + \lambda \, L_{\text{reg}}(A) \\ %\\ %&= \underbrace{L_0}_{\text{Supervised}} + \underbrace{\lambda \ \ f(X)^{\text{T}} \ (D - A) \, \ f(X)}_{Unsupervised} \end{align} $$

Supervised Component:

  • $\mathcal{L}_0 \sim$ categorical cross-entropy
  • Acts only on labelled subset of graph data

Unsupervised Component:

  • $\mathcal{L_{\text{reg}}} = \ \sum_{i,j} A_{ij} \ || \ f(X_i) \ - \ f(X_j) \ ||^2$
    • Incorporates graph Laplacian $L = D \ - \ A$
    • Describes smoothness
  • $f(X) \sim$ neural network
  • $\lambda$: regularization factor

Semi-Supervised Classification

Incorporating $A$ via regularization:

$$ \begin{align} \mathcal{L} %&= \underbrace{L_0}_{\text{Supervised}} + \underbrace{\lambda \, L_{\text{reg}}}_{\text{Unsupervised}} \\ &= \mathcal{L}_0 + \lambda \, \mathcal{L}_{\text{reg}} \\ \\ &= \mathcal{L}_0 + \lambda \ \sum_{i,j} A_{ij} \ || \ f(X_i) \ - \ f(X_j) \ ||^2 \end{align} $$

  • Connectivity encoded in regularization
    • Assumption $\rightarrow$ neighboring nodes share similar feature values
    • Hyperparameter dependency $\rightarrow$ tuning of $\lambda$
  • Not fully general model
    • Connectivity $\not \to$ node similarity
    • Problem dependent

Work by Kipf & Welling (2017)


Two main contributions:

  • Encode graph structure within model architecture
  • Scalable approach to semi-supervised learning via approximation

Incorporating Connectivity

Embed $A$ inside neural network structure:

$$ f(X) \rightarrow f(X,A) $$

  • Encode in activation value
  • Training of model weights explicitly includes node connectivity
  • Only need supervised component of objective function

Convolution on Graph

Consider convolution parametrized by characteristic spectrum of graph structure:

$$ \text{Conv} \left( X \right)= U \ g_{\theta} (\Lambda) \ U^{T} X $$

  • $U$ eigenvectors of $L_{\text{norm}} = U \ \Lambda \ U^{T} = I - D^{-1/2} \ A \ D^{-1/2}$
  • $\Lambda$ eigenvalues of $L_{\text{norm}}$
  • $g_{\theta}(\Lambda)$ spectral convolution parametrized by $\theta \in \mathbb{R}^n$

Potential issues:

  • Eigendecomposition $(U)$ may be prohibitively expensive!
  • Calculation $\sim \mathcal{O}(n^2)$ also expensive!

Approximation via Expansion

Approximate $g_{\theta'}\approx g_{\theta}$

  • Truncated Chebyshev polynomial expansion:

$$ \text{Conv} \left( X \right) \approx U \ \left( \theta'_0 + \theta'_1 \ \Lambda \ + ... + \ \theta'_k \ \Lambda^k \right) \ U^{T} \ X $$

  • Furthermore, absorbing $U, \ U^{T}$ into expansion:

$$ g_{\theta'}(\Lambda) \rightarrow g_{\theta'}(L_{\text{norm}}) = \theta'_0 + \theta'_1 \ L_{\text{norm}} \ + ... + \ \theta'_k \ L_{\text{norm}}^k $$

  • Filter now explicitly encodes graph connectivity $L_{\text{norm}} = L_{\text{norm}}(A)$

Complexity in evaluating $g_{\theta'}$ scales linearly with number of edges $\mathcal{O}(k \ E) < \mathcal{O}(n^2)$

A Linear Model

Let's limit to $k=1$

$$ \text{Conv} \left( X \right) \approx w \ \left( \ I + D^{-1/2} \ A \ D^{-1/2} \ \right) \ X $$

  • Filter is linear in $A$
  • Also limited to single filter parameter, $\theta \rightarrow w$

Renormalization $\tilde{A} = A+I$ to avoid numerical instabilities for repeated convolution gives:

$$ Z = \tilde{D}^{-1/2} \ \tilde{A} \ \tilde{D}^{-1/2} \ X \ W $$

Full propagation rule at layer $l$, with activation function $\sigma(\cdot)$:

$$ H^{(l+1)} = \sigma \left( \tilde{D}^{-1/2} \ \tilde{A} \ \tilde{D}^{-1/2} \ H^{(l)} \ W^{(l)} \right) $$

  • Linear in $A$ → stack $k$ convolutions to probe neighborhood $k$ nodes away
  • Complexity is $\mathcal{O}( \, E \ C \ k \, )$

Summary of Propagation Rule

prop_rule

Image source: Node Classification by Graph Convolutional Network by Graham Ganssle

Demonstration

Semi-supervised classification on karate club network

  • 34 nodes (featureless)
  • 154 edges (unweighted, undirected)
  • Four classes (clustering found by Brandes et al, 2008)
  • For demo, only one instance of each class is labelled!

Model:

  • Three-layer GCN model
  • 300 training iterations

Demonstration: Karate Club Graph

karate_club

Demonstration

Video source: Graph Convolutional Networks by Thomas Kipf

Performance

Authors tested on GSD sets with different

  • Numbers of vertices, edges, features, & classes
  • Labelled fraction used for supervised training
  • Variants of layer-wise propagation rule

Performance: Method Comparison on Datasets

baseline_perf

Performance: Propagation Rule


prop_rule_perf

Performance: CPU/GPU

Mean training time over 100 epochs. Factor $\sim 2$ performance gain.

cpu_gpu_compare

Summary

Authors developed a model that

  • encodes graph structure explicitly
    • Embeds $A$ directly: $f(X) \rightarrow f(X,A)$
    • Learns weights to predict node classification via labelled component of dataset
  • linearly scales with edges

    • Can stack $k$ layers to probe $k^{\text{th}}$-scale neighborhood
    • Computationally accessible, $\tilde{D}^{-1/2} \ \tilde{A} \ \tilde{D}^{-1/2}$ preprocessing
  • outperforms other models

Thank you for your attention!


final

Any Questions?

Backups

Performance: Number of Layers

Two - three layers sufficient

layer_perf

Limitations of Method

  • Memory

    • Full-batch training $\rightarrow$ linear memory requirements
    • Mini-batch must account for neighborhood depth, potentially more approximations
  • Non-simple graphs

    • Current architecture doesn't include directed, weighted edges
    • Potential to represent directionality via representation as undirected with additional nodes
  • Locality

    • Equal weighting given to self-connections & edges to neighbors
    • Potential to incorporate trade-off parameter that can be learned: $\tilde{A} = A+\lambda \, I$

Mathematics of Spectral Graph Theory

For simple graph, the Laplacian is given by $L = D - A$,

$$ L_{ij} = \begin{cases} \text{deg} \, v_i & \text{ if } i=j \\ -1 & \text{ if } i \neq j \text{ and } v_i \text{ adjacent to } v_j \\ 0 & \text{ otherwise} \end{cases} $$


The symmetric, normalized Laplacian $L_{\text{norm}}$ is given by

$$ L_{\text{norm}} = D^{-1/2} \, L \, D^{-1/2} = I - D^{-1/2} \, A \, D^{-1/2} \, , $$

$$ L^{\text{norm}}_{ij} = \begin{cases} 1 &\text{ if } i=j \text{ and deg} \, v_i \neq 0 \\ -\left( \text{deg } v_i \, \text{deg } v_j \right)^{-1/2} & \text{ if } i \neq j \text{ and } v_i \text{ adjacent to } v_j \\ 0 & \text{ otherwise} \end{cases} $$


The resulting spectrum of eigenvalues are all non-negative real values: $\lambda^{\text{norm}}_i \in \mathbb{R}_{\geq 0} \ \ , \ \forall i$

Mathematics of Expansion

Relation to Weisfeiler-Lehman Algorithm

  • Boils down node connectivity information to minimal representation
  • Replace hash function with $\sigma(\cdot)$
  • Define $c_{ij} = \sqrt{\, d_i \ d_j}$

$$ h_i^{(l+1)} = \sigma \left( \sum_{j\in\mathcal{N}+i} \ \frac{1}{c_{ij}} \ h_{j}^{(l)} \ W^{(l)}\right) $$

Interpretation: propagation rule is a parametrized, differentiable generalization of 1D WL alg.