Paper Notes - Structured State Space Model (S4)

These are notes on the S4 paper titled: Efficiently Modeling Long Sequences with Structured State Spaces by Albert Gu, Karan Goel, and Christopher Re. This is the direct predecessor of the famous Mamba state space model. This is a great starting point for learning more about state space models for deep learning in general, and is a great complementary lesson to my post: "Paper Notes - State Space Models for Deep Learning".

I also conducted an experimental study comparing the performance of Transformer and State Space Model (specifically, Mamba) based LLM architectures for increasing sequence length via number of chunks retrieved using Retrieval Augmented Generation (RAG).

Efficiently Modeling Long Sequences with Structured State Spaces: S4

The central problem in sequence modeling is efficiently handling data that contains long-range dependencies (LRDs).

Background: State Spaces

2.1 SSMs: A Continuous-time Latent State Model

The SSM is defined by the simple equation (1). It maps a 1-D input signal $u(t)$ to an N-D latent state $x(t)$ before projecting to a 1-D output signal $y(t)$ .

\begin{gather*} x'(t) = Ax(t) + Bu(t) \\ y(t) = Cx(t) + Du(t) \end{gather*}

It is related to latent state models such as Hidden Markov Models.
The goal is to use the SSM as a black-box representation in a deep sequence model, where $A$ , $B$ , $C$ , $D$ are parameters learned by gradient descent.

Addressing Long-Range Dependencies with HiPPO

HiPPO theory of continuous-time memorization.
HiPPO specifies a class of certain matrices $A \in \mathbb{R}^{N \times N}$ that, when incorporated into (1), allow the state $x(t)$ to memorize the history of the input $u(t)$ .
The most important matrix in this class is defined by equation (2), which we call the HiPPO matrix.

\begin{gather*} A_{nk} = -\begin{cases} (2n+1)^{1/2}(2k+1)^{1/2} & n > k \\ n+1 & n = k \\ 0 & n < k \end{cases} \end{gather*}

Discrete-time SSM: The Recurrent Representation

To apply the model on a discrete input sequence $(u_0, u_1, \ldots)$ instead of a continuous $u(t)$ , (1) must be discretized by a step size $\Delta$ that represents the resolution of the input. Conceptually, the inputs $u_k$ can be viewed as sampling an implicit underlying continuous signal $u(t)$ , where $u_k = u(k\Delta)$ .
To discretize the continuous-time SSM, we follow prior work using the bilinear method, which converts the state matrix $A$ into an approximation $\overline{A}$ . The discrete SSM is then:

\begin{gather*} x_k = \overline{A}x_{k-1} + \overline{B}u_k \\ y_k = \overline{C}x_k \end{gather*}

Equation (3) is now a sequence-to-sequence map $u_k \to y_k$ instead of a function-to-function map. Moreover, the state equation is now a recurrence in $x_k$ , allowing the discrete SSM to be computed like an RNN. Concretely, $x_k \in \mathbb{R}^N$ can be viewed as a hidden state with transition matrix $\overline{A}$ .

2.4 Training SSMs: Convolutional Representation

The recurrent SSM (3) is not practical for training on modern hardware due to its sequential nature. Instead, there is a well-known connection between linear time-invariant (LTI) SSMs such as (1) and continuous convolutions. Correspondingly, (3) can be written as a discrete convolution.
Let the initial state be $x_{-1} = 0$ . Then unrolling (3) yields:

\begin{gather*} x_0 = \overline{B}u_0 & x_1 = \overline{A}\overline{B}u_0 + \overline{B}u_1 & x_2 = \overline{A}^2\overline{B}u_0 + \overline{A}\overline{B}u_1 + \overline{B}u_2 & \ldots \\ y_0 = \overline{C}\overline{B}u_0 & y_1 = \overline{C}\overline{A}\overline{B}u_0 + \overline{C}\overline{B}u_1 & \overline{C}\overline{A}^2\overline{B}u_0 + \overline{C}\overline{A}\overline{B}u_1 + \overline{C}\overline{B}u_2 & \ldots \end{gather*}

This can be vectorized into a convolution (4) with an explicit formula for the convolution kernel:

\begin{gather*} y_k = \overline{C}\overline{A}^k\overline{B}u_0 + \overline{C}\overline{A}^{k-1}\overline{B}u_1 + \ldots + \overline{C}\overline{A}^{k-1}\overline{B}u_{k-1} + \overline{C}\overline{B}u_k \\ y = \overline{K} * u \\ \overline{K} \in \mathbb{R}^L := \mathcal{K}_L(\overline{A}, \overline{B}, \overline{C}) := (\overline{C}\overline{A}^i\overline{B})_{i \in [L]} = (\overline{CB}, \overline{CAB}, \ldots, \overline{C}\overline{A}^{L-1}\overline{B}) \end{gather*}

In other words, equation (4) is a single non-circular convolution and can be computed very efficiently with FFTs, provided that $\overline{K}$ is known (pre-computed beforehand).

However, computing $\overline{K}$ is non-trivial and is the focus of the technical contributions of S4. $\overline{K}$ is referred to as the SSM convolution kernel or filter.

Method: Structured State Spaces (S4)

The technical focus of S4 is on developing the S4 parameterization and showing how to efficiently compute all views of the SSM: the continuous representation, the recurrent representation, and the convolutional representation.
Section 3.1 is based on conjugation and diagonalization, and discusses why the naïve application does not work. Section 3.2 gives an overview of key technical components of their approach and formally defines the S4 parameterization. Section 3.3 sketches the main results.

3.1 Motivation: Diagonalization

The bottleneck in computing the discrete-time SSM (3) is that it involves repeated matrix multiplication by $\overline{A}$ .
To overcome this, they use a structural result that allows simplification of the SSMs.

Lemma 3.1 Conjugation is an equivalence relation on SSMs $(A, B, C) \sim (V^{-1}AV, V^{-1}B, CV)$ .

Proof: Write out the two SSMs with states denoted by $x$ and $\tilde x$ respectively. After multiplying the $\tilde x$ side SSM by $V$ , the two SSMs become identical with $x = V\tilde x$ . Therefore, these compute the exact same operator $u \to y$ , but with a change of basis by $V$ in the state $x$ .

Lemma 3.1 motivates putting $A$ into a canonical form by conjugation, which is ideally more structured and allows faster computation.

Unfortunately, the naïve application of diagonalization does not work due to numerical issues. They derive the explicit diagonalization for the HiPPO matrix (2) and show it has entries exponentially large in the state size $N$ , rendering diagonalization numerically infeasible.

Lemma 3.2 The HiPPO matrix $A$ in equation (2) is diagonalized by the matrix $V_{ij} = \binom{i+2}{i-j}$ . In particular, $V_{3i, i} = \binom{4i}{2i} \approx 2^{4i}$ . Therefore, $V$ has entries of magnitude up to $2^{4N/3}$ .

3.2 The S4 Parameterization: Normal Plus Low-Rank

The core innovation in S4 is the S4 parameterization, which represents the state transition matrix $A$ as the sum of two components: a normal matrix and a low-rank matrix. This is inspired by the observation that the HiPPO matrix $A$ is highly structured and its diagonalization produces matrices with exponentially large entries, making it numerically challenging to work with.
The S4 parameterization aims to provide a practical solution by splitting $A$ into a normal part $\overline{A}_N$ and a low-rank part $\overline{A}_L$ :

\overline{A} = \overline{A}_N + \overline{A}_L

The normal part $\overline{A}_N$ has a relatively simple structure, while the low-rank part $\overline{A}_L$ is constructed to capture the long-range dependencies.
The key to this approach is that the low-rank part $\overline{A}_L$ can be efficiently handled using techniques from low-rank matrix factorization, and the normal part $\overline{A}_N$ allows for easier computations. This decomposition drastically reduces the computational burden associated with the full matrix multiplication, allowing for the efficient computation of the convolution kernel $\overline{K}$ .

3.3 Efficient Computation of $\overline{K}$

The convolution kernel $\overline{K}$ in equation (4) is computed using an efficient approximation that exploits the structure of $\overline{A}$ . The decomposition of $A$ into normal and low-rank parts provides a method to compute the kernel more efficiently than the naive approach, which involves repeated matrix exponentiation.
The key steps in the computation are:
1. Approximate the low-rank part $\overline{A}_L$ using low-rank updates to the transition matrix.
2. Use this approximation to compute the convolution kernel $\overline{K}$ through efficient matrix operations.
By decoupling the normal and low-rank parts, the authors show how to compute $\overline{K}$ much more efficiently, reducing the complexity from $O(N^3)$ to $O(N^2)$ in the best cases, where $N$ is the size of the latent state.

Experiments

The authors evaluate the performance of S4 on several benchmark tasks, demonstrating that it outperforms previous state-of-the-art methods for sequence modeling, particularly in terms of handling long-range dependencies.
The experiments show that S4 achieves superior performance on tasks such as time-series prediction and natural language processing tasks compared to traditional methods such as RNNs, LSTMs, and transformers.
S4 exhibits a notable advantage in terms of computational efficiency, allowing for faster training and inference times despite handling sequences with very long-range dependencies.

Conclusion

The S4 method presents a novel approach to sequence modeling by leveraging structured state spaces and a combination of normal and low-rank matrix factorizations.
The key contribution of S4 is the efficient computation of the convolution kernel $\overline{K}$ , which allows for modeling long-range dependencies in sequences much more efficiently than prior methods.
S4 offers a promising solution to a fundamental challenge in machine learning and time-series analysis, achieving state-of-the-art performance while maintaining computational efficiency.

Summary of the Paper

The paper introduces S4 (Structured State Spaces), a novel approach to efficiently modeling long-range dependencies in sequence data. The central idea of S4 is based on state-space models (SSMs), which map input sequences to latent states and then to output sequences. The paper highlights the challenge of efficiently computing with these models, particularly when handling long-range dependencies in large sequences.

S4 addresses this challenge by using HiPPO (history-preserving) theory to create structured state spaces, which can store and retrieve long-term dependencies. It then introduces a practical parameterization of the state transition matrix as the sum of a normal matrix and a low-rank matrix. This decomposition significantly reduces the computational complexity involved in matrix operations and convolution kernel computations.

The authors demonstrate that this approach enables efficient computation of long-range dependencies with reduced memory and time requirements, making S4 suitable for tasks like time-series prediction and natural language processing. In experimental evaluations, S4 outperforms previous methods like RNNs, LSTMs, and transformers on sequence modeling tasks, particularly in terms of computational efficiency and long-range dependency handling.

In conclusion, S4 provides a promising new direction for sequence modeling, efficiently addressing the problem of long-range dependencies without compromising performance. It offers a solid foundation for further research and practical applications in machine learning and data analysis.