Blog - ViR

Brian Pulfer

Research

Projects

Blog

Published: 06.12.2023

Vision Retention Networks

Introduction

Retention is a mechanism recently proposed in Retentive Network: A Successor to Transformer for Large Language Models by Sun et. al. which core idea is to carry out similar computation as attention while being much more computationally efficient. It has now become a recurrent pattern that researchers from other fields of machine learning take inspiration from the progress done in NLP and try to adapt NLP-solutions to a different problem. It was thus only a matter of time before we would have heard of Retention in the field of Computer Vision.

ViR: Vision Retention Networks by Ali Hatamizadeh, Michael Ranzinger, Jan Kautz first applied Retention in a CV model. I recently had a great time re-implementing the paper and digging into Retention, so I thought I would share what I have learned. You can find my re-implementation at brianpulfer/vision-Retention-networks or at brianpulfer/papersreimplementations.

Attention

Before we dig into ViR, we need to learn what Retention is. But before we learn what Retention is, a little recap on Attention.

Attention is a mechanism that allows a model to learn relationships between elements of the input. The meaning of a word can be completely altered based on the surrounding words. A red pixel in an image might come from a tomato, a Ferrari car or a baloon. Only the combination with neighbouring pixels give it a meaning. It is thus important for models to have the ability to learn the interplay of elements in the input sequence. That is where Attention comes in, and this is how it works:

Given an input sequence $X \in \mathbb{R} ^ {N \times D}$ , attention computes Queries, Keys and Values for each element of the sequence as follows:

Q = X W_q

K = X W_k

V = X W_v

Where $W_q, W_k, W_v \in \mathbb{R} ^ {D \times D}$ are learnable parameters. The output for each element of the sequence is going to be a weighted sum of the values, where the weights are computed as the dot product between the query and the keys:

\text{Attention}(Q, K, V) = \text{softmax} \left( \frac{Q K^T}{\sqrt{D}} \right) V

and $\text{softmax}(x)_i = \frac{e^{x_i}}{\sum_j e^{x_j}}$ is applied row-wise.

This mechanism, ever since the Attention Is All You Need paper, has been empirically proven to be very powerful for learning relationships between elements of a sequence. It has been used in virtually all contexts (NLP, CV, TTS, ...), and it has become a de-facto standard for many tasks.

Then why getting rid of it?

There is only one issue that has researchers a bit troubled: the complexity of attention is $O(N^2)$ (easily seen when computing $QK^T$ ), meaning that for an input sequence twice as long, computing attention takes four times as much time.

Quite some effort went into trying to solve this issue, with various variations like Linear Attention and Efficient Attention trying to replicate the mechanism while being computationally more convenient.

Retention

Retention works recurrently just like recurrent neural networks. At each step, it reads the input to update an inner state matrix, use the inner state to compute an output and pass the inner state onward. Here is the RECURRENT formulation of Retention

\mathbf{s}_n = \alpha \mathbf{s}_{n-1} + \mathbf{k}_n^t \mathbf{v}_n

\text{Retention}(\mathbf{x})=\mathbf{o}_n = \mathbf{q}_n \mathbf{s}_n

where $\mathbf{s}_n$ is the inner state at step $n$ , $\mathbf{k}_n$ is the key and $\mathbf{v}_n$ is the value of the current (n-th) element in the sequence (row vectors, so $\mathbf{s}_n, \mathbf{k_n^Tv_n} \in \mathbb{R}^{D \times D}$ ). Needless to say, $\mathbf{q}_n, \mathbf{k}_n, \mathbf{v}_n$ are linear projections of the n-th sequence element $\mathbf{x}_n$ . Finally, $0 \le \alpha \le 1$ is a constant that exponentially decays older key-values products.

Translating into text these equations, the idea is the following: $\mathbf{s}_n$ will contain the state in the form of all key-value products. The ouput is obtained by fetching the desired value (mixture of values) by using the current query $\mathbf{q}_n$ .

This is literally all there is to Retention! What is so special about it is that it can also be computed using a PARALLEL formulation just like we do for Attention. The formula to compute all outputs at once is the following:

M_{i,j} = \begin{cases} 0, & i < j \\ \alpha^{i-j}, & i \ge j \end{cases}

\text{Retention}(X) = (\frac{QK^T}{\sqrt{D}} \odot M)V

Looks familiar, right? In fact, we do everything exactly as for Attention, except that we do not apply a row-wise softmax function and always apply $M$ , a lower-triangular matrix that simultaneously deals with causal masking (take no contribution from future elements in the sequence) and applies the exponential decay given by $\alpha$ .

The key takeaway here is that if we get rid of the softmax operator we unlock the recurrent formulation, where we can just carry on what we had computed before to compute the next output.

However, processing sequences recurrently sucks! That is exactly the reason why we generally prefer Transformers over RNNs: the Transformer time complexity might be quadratic in sequence length, but at least we can process everything in parallel. With a recurrent formulation, we need to sequentially compute the n-th output before we can compute the n-th + 1 while our GPUs sit quiet.

Then why caring about a recurrent formulation?

The real ✨magic✨ happens when we decide to use a hybrid between parallel and recurrent formulations. In fact, it turns out that we can break the input sequence into multiple chunks, run each chunk in parallel using the parallel formulation, and then aggregate all of the results with a cross-chunk recurrent computation. This means that as soon as the sequence becomes prohibitively long for the parallel formulation (quadratic in $N$ ), we can just split it into chunks of size $C$ and run those parallelly (quadratic in chunk-size $C$ only!) and finally combine the cross-chunk information recurrently (linear in $\frac{N}{C}$ ). The real gain is thus obtained when we have very long sequences.

Here we have the CHUNKWISE RECURRENT formulation of Retention:

Q_{[i]} = Q_{Ci:C(i+1)}, \quad K_{[i]} = K_{Ci:C(i+1)}, \quad V_{[i]} = V_{Ci:C(i+1)}

R_i = K^T_{[i]}(V_{[i]} \odot \Alpha) + \alpha^CR_{i-1}, \quad \Alpha_{ij} = \alpha^{C-i-1}

\text{Retention}(X_{i}) = \underbrace{(Q_{[i]} K^T_{[i]} \odot M)V_{[i]}}_{\text{Inner-Chunk}} + \underbrace{(Q_{[i]}R_{i-1}) \odot \xi}_{\text{Intra-Chunk}}, \quad \xi_{ij} = \alpha^{i+1}

The math looks scary, but really we are just applying the parallel computation for all chunks and, once we have the Inner-Chunk parts, we can merge them using the recurrent formulation.

Comparison of Attention and Retention

Time complexity

Retention can, given the previous state, compute the next token in $O(1)$ time complexity, whereas Attention does not have a previous state and it needs to use all $O(N)$ past keys and queries to predict the next token.

Recurrent formulation

Attention does not need to be formulated recurrently, whereas Retention does. This is perfectly fine for causal decoder transformers, where we don not want current tokens to attend to future tokens anyways. However, in computer vision we mostly use the encoder type of transformer, so it is not completely clear what impact forcing the causal relationship might have in a task where seemingly there is no causal relationship.

Personal observation: Because Retention accumulates all keys and queries, I believe that it is probably not as powerful of a mechanism as Attention. Perhaps this loss of expressivity is not a big deal for text and/or images, especially compared to the gains made in time complexity, but this is still something to keep in mind. It might very well be that Retention fails to become a de-facto standard like other alternatives to Attention before it due to worse performances. What is sure is that Retention enables faster inference and, for very long sequences, even faster training while being quite similar to Attention.

Vision Retention Networks

Vision Retention Networks are a minor yet important variantion from Vision Transformer. I have previously written about how Vision Transformers (ViT) work, but in short, a ViT breaks an image into many distinct non-overlapping patches (typically, 16x16 patches of size 14x14 for images of size 224x224) which are then flattened and treated as a sequence. An encoder transformer is then used to process the sequence without any causal masking and the output is used for down-stream tasks.

The ViT is thus just a stack of encoder blocks, where each block sequentially applies an Attention block and an MLP block. In ViR, we get rid of the Attention block and swap a Retention block in instead.

Personal observation: It must be noted that because Retention works in a recurrent matter by definition, this is a big shift from ViT! While a ViT sees the whole image in one go, a ViR virtually reads the image from left to right from top to bottom. This is potentially a drawback of ViR over ViT, since it might not make sense to introduce causality in images.

Because retention reads the image in sequence, if we want our model to be a classifier, we need to use an output that comes after all tokens have been seen. To do so, we append a learnable [CLS] at the end of the sequence and use the generated output to do classification. Notice than in regular ViT, the CLS token was typically placed at the beginning of the sequence (although for a regular ViT this does not really make a difference).

Implementation

Here is my full re-implementation of a ViR:

1import torch
2import torch.nn as nn
3
4
5class ViRModes:
6    PARALLEL = "parallel"
7    RECURRENT = "recurrent"
8    CHUNKWISE = "chunkwise"
9
10
11class Retention(nn.Module):
12    def __init__(
13        self,
14        embed_dim,
15        max_len,
16        alpha,
17        mode=ViRModes.PARALLEL,
18        chunk_size=20,
19    ):
20        super(Retention, self).__init__()
21        self.dim = embed_dim
22        self.max_len = max_len
23        self.chunk_size = chunk_size
24        self.alpha = alpha
25        self.mode = mode
26
27        # Useful buffers
28        self.register_buffer("dim_sqrt", torch.tensor(embed_dim**0.5))
29
30        indices = torch.arange(max_len).reshape(1, -1)
31        self.register_buffer(
32            "decay_mask",
33            (alpha ** (indices.t() - indices)).tril(),
34        )
35
36        self.register_buffer("causal_mask", torch.ones(max_len, max_len).tril())
37        self.qkv = nn.Linear(embed_dim, embed_dim * 3)
38
39    def forward_parallel(self, x):
40        # Getting queries, keys, values
41        bs, sl, d = x.shape
42        qkv = self.qkv(x)
43        q, k, v = torch.chunk(qkv, 3, dim=-1)
44
45        # Causal and decay masking
46        M = (self.causal_mask[:sl, :sl] * self.decay_mask[:sl, :sl]).repeat(bs, 1, 1)
47
48        # Retention
49        out = (q @ k.transpose(-1, -2) / self.dim_sqrt * M) @ v
50
51        return out
52
53    def forward_recurrent(self, x, state):
54        batch_size, length, dim = x.shape
55
56        all_outputs = []
57        state = torch.zeros(batch_size, dim, dim, device=x.device)
58        for i in range(length):
59            xi = x[:, i]
60            q, k, v = self.qkv(xi).chunk(3, dim=-1)
61
62            state = self.alpha * state + k.unsqueeze(-1) @ v.unsqueeze(1)
63            out = q.unsqueeze(1) @ state / self.dim_sqrt
64            all_outputs.append(out.squeeze())
65
66        x = torch.stack(all_outputs, dim=1)
67        return x
68
69    def forward_chunkwise(self, x, chunk_size=None):
70        # Getting queries, keys, values
71        if chunk_size is None:
72            chunk_size = self.chunk_size
73
74        bs, sl, d = x.shape
75
76        # Adding dummy tokens to make the sequence length divisible by chunk_size
77        if sl % chunk_size != 0:
78            x = torch.cat(
79                [x, torch.zeros(bs, chunk_size - sl % chunk_size, d, device=x.device)],
80                dim=1,
81            )
82        n_chunks = x.shape[1] // chunk_size
83
84        # Running all chunks in parallel
85        x = x.reshape(bs, n_chunks, chunk_size, d)
86        q, k, v = self.qkv(x).chunk(3, dim=-1)
87
88        M = (
89            self.causal_mask[:chunk_size, :chunk_size]
90            * self.decay_mask[:chunk_size, :chunk_size]
91        ).repeat(bs, n_chunks, 1, 1)
92
93        inner_chunk = (q @ k.transpose(-1, -2) / self.dim_sqrt * M) @ v
94
95        # Updating outputs with chunk-wise recurrent
96        retention_mask = (
97            torch.tensor(
98                [self.alpha ** (chunk_size - i - 1) for i in range(chunk_size)],
99                device=x.device,
100            )
101            .repeat(bs, d, 1)
102            .transpose(-1, -2)
103        )
104
105        cross_mask = (
106            torch.tensor(
107                [self.alpha ** (i + 1) for i in range(chunk_size)], device=x.device
108            )
109            .repeat(bs, n_chunks, d, 1)
110            .transpose(-1, -2)
111        )
112
113        states = torch.zeros(bs, n_chunks, d, d, device=x.device)
114        for i in range(1, n_chunks):
115            chunk_state = k[:, i - 1].transpose(-1, -2) @ (v[:, i - 1] * retention_mask)
116            states[:, i] = chunk_state + states[:, i - 1] * self.alpha**chunk_size
117
118        cross_chunk = (q @ states) / self.dim_sqrt * cross_mask
119
120        # Combining inner and cross chunk
121        out = inner_chunk + cross_chunk
122
123        # Removing dummy tokens
124        out = out.flatten(1, 2)[:, :sl]
125        return out
126
127    def forward(self, x, state=None, mode=ViRModes.PARALLEL, chunk_size=None):
128        if mode is None:
129            mode = self.mode
130
131        if mode == ViRModes.PARALLEL:
132            return self.forward_parallel(x)
133        elif mode == ViRModes.RECURRENT:
134            return self.forward_recurrent(x, state)
135        elif mode == ViRModes.CHUNKWISE:
136            return self.forward_chunkwise(x, chunk_size)
137        else:
138            raise ValueError(f"Unknown mode {mode}")
139
140
141class MultiHeadRetention(nn.Module):
142    def __init__(
143        self,
144        heads,
145        embed_dim,
146        max_len,
147        alphas=None,
148        mode=ViRModes.PARALLEL,
149        chunk_size=20,
150    ):
151        super(MultiHeadRetention, self).__init__()
152        self.n_heads = heads
153        self.embed_dim = embed_dim
154        self.max_len = max_len
155        self.alphas = alphas
156        self.head_dim = embed_dim // heads
157        self.mode = mode
158        self.chunk_size = chunk_size
159
160        if alphas is None:
161            alphas = [1 - 2 ** (-5 - i) for i in range(heads)]
162
163        assert len(alphas) == heads, "Number of alphas must match number of heads"
164
165        assert (
166            embed_dim % heads == 0
167        ), "Embedding dimension must be divisible by the number of heads"
168
169        self.heads = nn.ModuleList(
170            [
171                Retention(embed_dim // heads, max_len, alpha, mode, chunk_size)
172                for alpha in alphas
173            ]
174        )
175        self.ln = nn.LayerNorm(embed_dim)
176        self.gelu = nn.GELU()
177        self.linear = nn.Linear(embed_dim, embed_dim)
178
179    def forward(self, x, mode=None, chunk_size=None):
180        if mode is None:
181            mode = self.mode
182
183        if chunk_size is None:
184            chunk_size = self.chunk_size
185
186        out = torch.cat(
187            [
188                head(
189                    x[:, :, i * self.head_dim : (i + 1) * self.head_dim],
190                    mode=mode,
191                    chunk_size=chunk_size,
192                )
193                for i, head in enumerate(self.heads)
194            ],
195            dim=-1,
196        )
197        return self.linear(self.gelu(self.ln(out)))
198
199
200class MLP(nn.Module):
201    def __init__(self, embed_dim, hidden_dim=None):
202        super(MLP, self).__init__()
203
204        if hidden_dim is None:
205            hidden_dim = 4 * embed_dim
206
207        self.linear1 = nn.Linear(embed_dim, hidden_dim)
208        self.linear2 = nn.Linear(hidden_dim, embed_dim)
209        self.gelu = nn.GELU()
210
211    def forward(self, x):
212        return self.linear2(self.gelu(self.linear1(x)))
213
214
215class ViRBlock(nn.Module):
216    def __init__(
217        self,
218        heads,
219        embed_dim,
220        max_len,
221        alphas=None,
222        mode=ViRModes.PARALLEL,
223        chunk_size=20,
224        dropout=0.1,
225    ):
226        super(ViRBlock, self).__init__()
227        self.mode = mode
228        self.chunk_size = chunk_size
229
230        self.ln1 = nn.LayerNorm(embed_dim)
231        self.retention = MultiHeadRetention(
232            heads, embed_dim, max_len, alphas, mode, chunk_size
233        )
234        self.ln2 = nn.LayerNorm(embed_dim)
235        self.mlp = MLP(embed_dim)
236        self.dropout1 = nn.Dropout(dropout)
237        self.dropout2 = nn.Dropout(dropout)
238
239    def forward(self, x, mode=None, chunk_size=None):
240        if mode is None:
241            mode = self.mode
242
243        if chunk_size is None:
244            chunk_size = self.chunk_size
245
246        x = (
247            self.dropout1(self.retention(self.ln1(x), mode=mode, chunk_size=chunk_size))
248            + x
249        )
250        x = self.dropout2(self.mlp(self.ln2(x))) + x
251        return x
252
253
254class ViR(nn.Module):
255    def __init__(
256        self,
257        patch_size=14,
258        depth=12,
259        heads=12,
260        embed_dim=768,
261        max_len=256,
262        alphas=None,
263        mode=ViRModes.CHUNKWISE,
264        chunk_size=256,
265        dropout=0.1,
266    ):
267        super(ViR, self).__init__()
268
269        # Local parameters
270        self.out_dim = 10
271        self.patch_size = patch_size
272        self.depth = depth
273        self.heads = heads
274        self.embed_dim = embed_dim
275        self.max_len = max_len
276        self.alphas = alphas
277        self.mode = mode
278        self.chunk_size = chunk_size
279
280        # Embeddings
281        self.patch_embed = nn.Conv2d(
282            3, embed_dim, (patch_size, patch_size), stride=(patch_size, patch_size)
283        )
284        self.pos_embed = nn.Parameter(torch.randn(1, max_len, embed_dim))
285
286        # ViR blocks
287        self.blocks = nn.ModuleList(
288            [
289                ViRBlock(heads, embed_dim, max_len, alphas, mode, chunk_size, dropout)
290                for _ in range(depth)
291            ]
292        )
293
294        # Head
295        self.ln = nn.LayerNorm(embed_dim)
296
297    def forward(self, x, mode=None, chunk_size=None, reshape=False):
298        if mode is None:
299            mode = self.mode
300
301        if chunk_size is None:
302            chunk_size = self.chunk_size
303
304        # Patch embedding, positional embedding
305        x = self.patch_embed(x).permute(0, 2, 3, 1).flatten(1, 2)
306        bs, sl = x.shape[:2]
307        x = x + self.pos_embed.repeat(bs, 1, 1)[:, :sl]
308
309        # Blocks
310        for block in self.blocks:
311            x = block(x, mode=mode, chunk_size=chunk_size)
312
313        # Layer Norm
314        x = self.ln(x)
315
316        # Reshape
317        if reshape:
318            ps = int(x.shape[1] ** 0.5)
319            x = x.reshape(bs, ps, ps, self.embed_dim).permute(0, 3, 1, 2)
320
321        return x
322
323
324if __name__ == "__main__":
325    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
326    x = torch.randn(16, 3, 224, 224).to(device)
327    model = ViR(depth=12, heads=3, embed_dim=192).eval().to(device)
328
329    with torch.no_grad():
330        y1 = model(x, mode=ViRModes.PARALLEL)
331        y2 = model(x, mode=ViRModes.RECURRENT)
332        y3 = model(x, mode=ViRModes.CHUNKWISE, chunk_size=20)
333
334        assert torch.allclose(
335            y1, y2, atol=1e-5
336        ), "Parallel and recurrent modes should give the same output"
337
338        assert torch.allclose(
339            y1, y3, atol=1e-5
340        ), "Parallel and chunkwise modes should give the same output"
341

It feels like I should comment there 300+ lines, but really there is nothing that is not already covered in the formulas. The only thing that I should mention is that the chunk size $C$ might not entirely devide the sequence length $N$ , so what one can do is adding some dummy tokens at the end of the sequence such that the sequence is entirely divisible by the chunk size (a sort of padding).

Also, I found it key for performances to actually perform computations for all chunks in parallel, so it is not enough to re-use the forward_parallel function sequentially for each chunk.

Also notice that we use different alphas for each head: some heads with a higher alpha will look further back into the past, other heads with a lower alpha will mostly focus on most recent tokens.

Thank you for reading until here! If you found this helpful / interesting, or have suggestions on how to improve, please do not hesitate to contact me at me@brianpulfer.ch