cheatsheet

Neural Network Cheatsheet

A comprehensive guide to the fundamental concepts, architectures, and theoretical foundations of Neural Networks and Deep Learning.


1. Activation Functions

Activation functions introduce non-linearity into the network, enabling it to learn complex patterns. Without them, a multi-layer neural network would be mathematically equivalent to a single linear transformation (a generalized linear model).

Common Functions


2. MLP (Multi-Layer Perceptron)

Architecture

An MLP consists of an input layer, one or more hidden layers, and an output layer. \(h^{(l)} = \phi(W^{(l)}h^{(l-1)} + b^{(l)})\)

Theoretical Background: Universal Approximation

Regularization: Dropout (Srivastava et al., 2014)

A technique to prevent overfitting by randomly dropping units during training.



3. Normalization Layers

Batch Normalization (BatchNorm)

Standard in CNNs. Normalizes activations across the mini-batch dimension. \(\hat{x}_{i} = \frac{x_{i} - \mu_{\mathcal{B}}}{\sqrt{\sigma_{\mathcal{B}}^2 + \epsilon}}\) \(y_{i} = \gamma \hat{x}_{i} + \beta\)

Layer Normalization (LayerNorm)

Standard in NLP/Transformers. Normalizes across the feature dimension for a single sample. \(\mu_L = \frac{1}{H} \sum_{j=1}^H x_{ij}, \quad \sigma_L^2 = \frac{1}{H} \sum_{j=1}^H (x_{ij} - \mu_L)^2\) \(\hat{x}_{ij} = \frac{x_{ij} - \mu_L}{\sqrt{\sigma_L^2 + \epsilon}}\)

RMS Normalization (RMSNorm)

Simplified LayerNorm used in modern LLMs (e.g., Llama, Gopher). \(\bar{x}_i = \frac{x_i}{\sqrt{\frac{1}{H} \sum_{j=1}^H x_j^2}} \cdot \gamma_i\)


4. CNN (Convolutional Neural Networks)

Architecture

Theoretical Background: Equivariance & Inductive Bias


5. ResNet (Residual Networks)

Architecture

Introduced to allow training of extremely deep networks (hundreds/thousands of layers) by learning residual functions.

Mathematical Theory: The Gradient Highway

ResNet solves the Vanishing Gradient degradation problem not just by initialization, but by structure.

  1. Forward Propagation: By unrolling the recursion $x_{l+1} = x_l + \mathcal{F}(x_l)$, we can express $x_L$ as the sum of $x_l$ plus all intermediate residuals: \(x_L = x_l + \sum_{i=l}^{L-1} \mathcal{F}(x_i, \mathcal{W}_i)\) This contrasts with plain networks $x_L = \prod W_i x_0$, where multiplicative dynamics cause instability.

  2. Backward Propagation (Gradient Flow): Consider the gradient of the loss $\mathcal{L}$ with respect to the input of the $l$-th layer $x_l$. Using the chain rule: \(\frac{\partial \mathcal{L}}{\partial x_l} = \frac{\partial \mathcal{L}}{\partial x_L} \cdot \frac{\partial x_L}{\partial x_l}\) From the additive formulation above, the term $\frac{\partial x_L}{\partial x_l}$ becomes: \(\frac{\partial x_L}{\partial x_l} = \frac{\partial}{\partial x_l} \left( x_l + \sum_{i=l}^{L-1} \mathcal{F}(x_i, \mathcal{W}_i) \right) = I + \frac{\partial}{\partial x_l} \sum_{i=l}^{L-1} \mathcal{F}(x_i, \mathcal{W}_i)\) Combining these: \(\frac{\partial \mathcal{L}}{\partial x_l} = \underbrace{\frac{\partial \mathcal{L}}{\partial x_L}}_{\text{Direct Term}} + \underbrace{\frac{\partial \mathcal{L}}{\partial x_L} \left( \frac{\partial}{\partial x_l} \sum_{i=l}^{L-1} \mathcal{F}(x_i, \mathcal{W}_i) \right)}_{\text{Residual Term}}\)

  3. Implication:

    • The “Direct Term” ($\frac{\partial \mathcal{L}}{\partial x_L}$) propagates information directly from the loss to any layer $l$ without passing through the weights of the intermediate layers. The gradient flow is theoretically unimpeded (it does not vanish even if weights are small).
    • This structure acts as a Gradient Highway, allowing gradients to flow effortlessly to earlier layers.
    • Even if the weights in $\mathcal{F}$ are initialized to be small, the total gradient is dominated by the identity term $I$, ensuring learning can start effectively.

6. Neural ODEs

Concept

Taking the layer depth limit $L \to \infty$ and step size $\Delta t \to 0$ of a ResNet leads to a continuous-depth model defined by an ODE. \(\frac{dz(t)}{dt} = f(z(t), t, \theta)\) The output is the solution to the IVP (Initial Value Problem) at time $T$: \(z(T) = z(0) + \int_0^T f(z(t), t, \theta) dt\)

Theoretical Background: Adjoint Sensitivity Method

Standard backpropagation would require storing all intermediate activations, leading to high memory cost. Neural ODEs use the Adjoint Sensitivity Method to compute gradients with constant memory cost $O(1)$.


7. RNN, LSTM, GRU & Seq2Seq

RNN (Recurrent Neural Networks)

Processes sequence data $x_1, \dots, x_T$ by maintaining a hidden state $h_t$ that acts as memory.

LSTM (Long Short-Term Memory)

Designed to solve the vanishing gradient problem by introducing a Cell State $C_t$ that allows gradients to flow unchanged.

GRU (Gated Recurrent Unit)

A simplified variant of LSTM that merges cell and hidden states.

Seq2Seq (Sequence-to-Sequence) / Encoder-Decoder

Framework for mapping sequence $X$ to sequence $Y$ (e.g., Translation).


8. Attention & Transformer

Scaled Dot-Product Attention

The core mechanism that maps a query and a set of key-value pairs to an output. \(\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\)

Multi-Head Attention (MHA)

Instead of a single attention function, we project queries, keys, and values $h$ times with different, learned linear projections to $d_k, d_k, d_v$ dimensions. \(\text{Head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)\) \(\text{MultiHead}(Q, K, V) = \text{Concat}(\text{Head}_1, \dots, \text{Head}_h)W^O\)

Transformer Block Architecture

A Transformer is composed of a stack of identical layers. Each layer has two sub-layers:

  1. Multi-Head Self-Attention: \(y_1 = \text{LayerNorm}(x + \text{MultiHead}(x, x, x))\)
  2. Position-wise Feed-Forward Network (FFN): Applied to each position separately and identically. Usually two linear transformations with a ReLU/GELU activation in between. \(\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2\) \(y_2 = \text{LayerNorm}(y_1 + \text{FFN}(y_1))\)
    • Residual Connections & LayerNorm: Crucial for training deep transformers. $\text{LayerNorm}(x + \text{Sublayer}(x))$.
    • Positional Encoding: Injected at the input (via addition) to provide order information, since Attention is permutation invariant.

FlashAttention (IO-Aware Exact Attention)

Standard Attention implementation requires $O(N^2)$ memory to store the attention matrix $A = (QK^T)$, which often exceeds GPU high-bandwidth memory (HBM).


9. GNN (Graph Neural Networks)

Definition & Framework (MPNN)

Operates on data represented as graphs $G=(V, E)$. The core mechanism is Message Passing. \(h_v^{(k)} = \text{UPDATE}^{(k)} \left( h_v^{(k-1)}, \text{AGGREGATE}^{(k)}\left( \{ h_u^{(k-1)} \mid u \in \mathcal{N}(v) \} \right) \right)\)

Common Architectures

  1. GCN (Graph Convolutional Network) (Kipf & Welling, 2017)
    • Mechanism: Approximates spectral graph convolutions (Chebyshev polynomials) to a simple first-order neighborhood average.
    • Update Rule: \(H^{(l+1)} = \sigma(\tilde{D}^{-\frac{1}{2}}\tilde{A}\tilde{D}^{-\frac{1}{2}} H^{(l)} W^{(l)})\) where $\tilde{A} = A + I$ (self-loops) and $\tilde{D}$ is the degree matrix of $\tilde{A}$.
    • Insight: Effectively performs a weighted average of neighbor features (smoothing), acting as a low-pass filter on graph signals.
  2. GAT (Graph Attention Network) (Veličković et al., 2018)
    • Mechanism: Assigns different importance weights $\alpha_{ij}$ to different neighbors using Self-Attention.
    • Update Rule: \(h_v' = \sigma \left( \sum_{u \in \mathcal{N}(v)} \alpha_{vu} W h_u \right), \quad \alpha_{vu} = \text{softmax}(\text{LeakyReLU}(a^T [Wh_v \| Wh_u]))\)
    • Insight: Allows the model to focus on relevant neighbors, improving expressivity over fixed GCN weights. Supports heterogeneous graphs implicitly.
  3. GIN (Graph Isomorphism Network) (Xu et al., 2019)
    • Mechanism: Uses Sum aggregation instead of Mean/Max to preserve structural information.
    • Update Rule: \(h_v^{(k)} = \text{MLP}^{(k)} \left( (1 + \epsilon^{(k)}) h_v^{(k-1)} + \sum_{u \in \mathcal{N}(v)} h_u^{(k-1)} \right)\)
    • Theory: Proven to be as powerful as the Weisfeiler-Lehman (1-WL) test for graph isomorphism, whereas GCN/GraphSAGE are strictly less powerful.

Challenges

Inference Examples

  1. Node Classification: Classifying documents in a citation network or fraud detection in transaction graphs.
    • Input: Features of all nodes + Partial labels. Output: Labels for unlabeled nodes.
  2. Graph Classification: Predicting molecular properties (toxicity, solubility).
    • Mechanism: Use a Readout Function (Global Pooling) like Sum or Mean over all node embeddings to get a graph vector $h_G$.
  3. Link Prediction: Recommender systems (User-Item Bipartite Graph).
    • Mechanism: Predict if an edge exists between user $u$ and item $i$ by computing score $s = h_u^T h_i$.
  4. Heterogeneous/Multi-Graph Inference: Knowledge Graphs with multiple edge types.
    • Mechanism: Use Relational GCN (R-GCN) where weights $W_r$ are specific to edge type $r$.

10. Optimal Transport (OT)

Concept

Provides a geometrically meaningful distance defining the “cost” to transform one probability distribution into another.

Theoretical Definitions


11. KAN (Kolmogorov-Arnold Networks)

Theoretical Foundation: Kolmogorov-Arnold Representation Theorem

The theorem (1957) states that any multivariate continuous function $f(x_1, \dots, x_n)$ on a bounded domain can be represented as: \(f(x_1, \dots, x_n) = \sum_{q=0}^{2n} \Phi_q \left( \sum_{p=1}^n \psi_{p,q}(x_p) \right)\) where $\Phi_q$ and $\psi_{p,q}$ are continuous single-variable functions.

KAN Architecture (Liu et al., 2024)

Inspired by this theorem, KANs differ fundamentally from MLPs.

Key Differences & Mechanism

  1. Edges as Functions: In KANs, the “activation functions” are placed on the edges (arrows) of the graph, not the nodes. \(x^{(l+1)}_j = \sum_{i=1}^{N_l} \phi_{l, j, i}(x^{(l)}_i)\) where $\phi$ is a learnable 1D function parametrized by B-splines.
  2. No Linear Weights: There are no traditional weight matrices $W$. The “weights” are the coefficients of the spline basis functions.
  3. Interpretability: KANs can often be symbolically pruned to reveal the exact mathematical formula governing the data (e.g., discovering $f(x,y) = \sin(x) + y^2$).
  4. Scaling Laws: KANs have shown faster scaling properties (Accuracy vs Parameters) than MLPs for certain scientific / PDE solving tasks.