Notes - Geometric Deep Learning HT26, Graphs

@Define the difference between transductive and inductive tasks on graphs.

In a transductive task, the graph is the same but the nodes may differ for each task
In an inductive task, the graph can be completely different for each task

@State the desired property of functions acting on graphs.

They should be permutation invariant to the ordering of the nodes.

Suppose we have some function $f(\mathbf X, \mathbf A)$ operating on graph nodes and features. @Define what it means for $f$ to be permutation invariant and what it means to be permutation equivariant.

3cee199e9b924146a80edaffd2cd3ae1

\[f(\mathbf {PX}, \mathbf{PAP}^\top) = f(\mathbf X,\mathbf A)\]

and

\[\mathbf F(\mathbf{PX}, \mathbf{PAP}^\top) = \mathbf{PF}(\mathbf X, \mathbf A)\]

for any permutation matrix $\mathbf P$.

@State the general blueprint for constructing permutation equivariant functions on graphs, and visualise this procedure.

70bd8ca64c524894a3c5bed1ed94cdd6

Define a local function $\phi$ that operates over a node and its neighbourhood, $\phi(\mathbf x _ u, \mathbf X _ {\mathcal N _ u})$. Then a permutation equivariant function $\mathbf F$ can be constructed by applying $\phi$ to every node’s neighbourhood in isolation, i.e.

\[F(\mathbf{X}, A) = \begin{bmatrix} \cdots \quad \phi(\mathbf{x} _ 1, \mathbf{X} _ {\mathcal{N} _ 1}) \quad \cdots \\ \cdots \quad\phi(\mathbf{x} _ 2, \mathbf{X} _ {\mathcal{N} _ 2}) \quad \cdots \\ \vdots \\ \cdots \quad\phi(\mathbf{x} _ n, \mathbf{X} _ {\mathcal{N} _ n}) \quad \cdots \end{bmatrix}\]

@Define and @visualise the three “flavours” of ways to aggregate information from surrounding nodes in a graph neural network.

24b516293d8348dc8de03ad5365d683e

Convolutional:

\[\mathbf h _ u = \phi\left( \mathbf x _ u, \bigoplus _ {v \in \mathcal N} c _ {uv} \psi(\mathbf x _ v) \right)\]

where $c _ {uv}$ is a constant that depends only on the structure of the graph (i.e. the adjacency matrix), $\psi$ is a (potentially learned) function that transforms the features of the neighbourhood $\mathcal N$, $\bigoplus$ is some permutation-invariant aggregation function, and $\phi$ is some (learnable) function that updates these features.

Attentional:

\[\mathbf h _ u = \phi\left( \mathbf x _ u, \bigoplus _ {v \in \mathcal N _ u} a(\mathbf x _ u, \mathbf x _ v) \psi(\mathbf x _ v)\right)\]

where $a$ is a learnable self-attention mechanism that computed the importance coefficients $\alpha _ {uv} = a(\mathbf x _ u, \mathbf x _ v)$ implicitly, and are often softmax-normalised across all neighbours.

Message-passing:

\[\mathbf h _ u = \phi\left(\mathbf x _ u, \bigoplus _ {v \in \mathcal N _ u} \psi(\mathbf x _ u, \mathbf x _ v) \right)\]

where $\psi$ is a learnable message function, computing $v$’s vector sent to $U$.

@State the containments between the relative expressive power of GNNs implemented via:

convolution
attention
message-passing

886f0d5796dd4880b02a7114fbd0673a

\[\text{convolution} \subseteq \text{attention} \subseteq \text{message-passing}\]

@State how a graph convolution neighbourhood averaging operation could be performed over a graph.

a414d3480aeb44b09a1e246478d55566

@Visualise one layer of a graph convolution network with softmax activation.

ecedc0eec6b146f692b400f832809f08

@Visualise a two-layer graph convolution network with activation $\sigma$ and final output $\text{softmax}$.

11b1e67cd5804777b00cd80654fad74a

@Visualise the scalable inception-like graph network architecture.

270749b6be3a4751954b32c7d2fe3da8

@Visualise the GraphSAGE architecture.

a22a846eaf8b4097b7a714776bbc63a3

@Define what it means for two node-attributed graphs $G = (V, E, \mathbf X)$ and $G’ = (V’, E’, \mathbf X’)$ to be isomorphic.

293f45701fc74dab82a07d2619798e41

There exists a bijection $\phi : V \to V’$ such that you have

Structure preservation: $u \sim v$ in $G$ iff $\phi(u) \sim \phi(v)$ in $G’$
Feature preservation: $\mathbf x _ u = \mathbf x’ _ {\phi(u)}$

(i.e. they are isomorphic as graphs and features are preserved).

@State a theorem giving a condition for a class of functions to be able to universally approximate any permutation invariant function on graphs.

912fcdb3db9746eca447b80017b4c686

A class of functions is universally approximating permutation-invariant functions on graphs with finite node features iff it can discriminate graph isomorphisms.

@Visualise the containment between the expressive power of functions that can be computed by message-parsing neural networks and all permutation-invariant functions on graphs.

1773183bad124c20a991cc3fe31fb628

What is the Weisfeiler-Lehman test at a high level?

6df24b1544994ea4b9072a7a7b62fe68

A necessary but insufficient condition for graph isomorphism, based on iterative colour refinement.

@Visualise the relative expressive powers of functions that can be computed by message-passing graph neural networks, those that can be computed by WL-test style functions, those that depend on the counts of certain substructures, and all permutation invariant functions.

dd56f87a338e47ccb38882d7f22344de

@State a theorem about the expressive power of a graph isomorphism network.

7c8ba3a7f4fb44769ef5e356010b1344

Suppose:

Graph node features are from a discrete countable set (often not true in practice)
We have an injective aggregator $\bigoplus$ and $\phi$
The readout function is graph-wise

Then this MPNN is as powerful as the WL-test.

@State a theorem about the number of aggregators needed to distinguish between real multisets of size $n$.

42ae52b3f1b2441eb8b17e69d3769afa

At least $n$ aggregators are needed.

@Define principled neighbourhood aggregation.

d899a2738f7d4e1c883f4870a8af6c34

An aggregation function defined as moments of neighbour features:

\[\bigoplus = \begin{bmatrix} I \\ S(D, \alpha = 1) \\ S(D, \alpha = -1) \end{bmatrix} \otimes \begin{bmatrix} \mu \\ \sigma _ {\max} \\ m _ {\min} \end{bmatrix}\]

where we define scalers $S(d, \alpha)$ as

\[S(d, \alpha) = \left( \frac{\log(d+1)}{\delta} \right)^\alpha\]

What are $k$-WL tests at a high-level?

2c44c1f2bfd740359e15511f22f876ba

Colour refinement over $k$-tuples, where two $k$-tuples are adjacent if they differ in one node.

What is the idea behind random node features and random message passing neural networks (rMPMMs)?

cdd6abb2c6604f7c81fbb14504fea147

Attach a random feature to every node of the graph, then apply a MPNN. This is to break symmetries that would confuse the WL-test.

What is the idea behind a graph substructure networks (GSNs)?

72dd4f1486154adeb8ee71acf3f63bee

Choose a bank of substructures containing graphs $H$ of size $k$, and count the occurence of each $H$ in every node or edge of the input graph.

@Visualise the class of graphs that GSNs can distinguish compared to $k$-WL tests.

4f27acfe2b7146d1a1e4b15ed70dba25

What is the idea behind subgraph GNNs?

fdc369648eca4e3eb8ee6402787a8261

Create a collection of subgraphs of the original graph by deleting edges. Then run a permutation-invariant network over the features generated by GNNs applied to each subgraph. This can distinguish graphs the WL-test cannot.

How can you interpret transformers as a type of GNN?

b0f7c5592ac740a589fde2c3f2219949

Their architecture is equivalent to an attention GNN on a complete graph with positional encoding.

Geometric Deep Learning HT26, Graphs

Flashcards

Basic definitions

Graph neural networks

Expressivity

Weisfeiler-Lehman test

Transformers

Over-squashing and bottlenecks

Exotic message passing

Related posts