From Fixed to Flexible: Harnessing Deformable Attention for Enhanced Learning

Cenk Bircanoglu
5 min readFeb 3, 2024

This blog post will dig into the details of attention mechanisms, the foundation for understanding how machines pick and choose what information to process. But that’s not all; we’ll also explore the innovative idea of deformable attention, a dynamic approach that brings adaptability to the forefront.

Attention Mechanism

Consider reading a long sentence where your attention isn’t distributed evenly across every word. Rather, you concentrate more on keywords crucial for comprehension. Similarly, an attention layer in a neural network operates by assigning weights to various segments of an input sequence, prioritizing them based on their significance to a particular task.

Credit: https://arxiv.org/pdf/1706.03762.pdf

The components of a standard attention layer are;

  • Query: like the model asking, “What am I looking for?” It’s a bunch of vectors that show what the model is curious about at the moment. These vectors carry the context or features the model needs to pay attention to the important stuff in the input.
  • Key: as another group of clues showing what’s in the input. The attention system compares the clues from the question (query) to the ones in the input (key), figuring out how well they match. These key clues help the model figure out which parts of the input are most important for the question it’s trying to answer.
  • Value: as another bunch of details that hold the real information about each part in the input. It’s like the actual content or features linked to every piece in what the model is looking at.
  • Attention Scores: like grades for how much your model is paying attention to different things. When the model is looking at information, it gives a score to each part, deciding which parts are more important. They help the model figure out where to concentrate its efforts and understand the most relevant details in the data it’s working with. Technically, attention score refers to the measure of similarity or relevance between the query and key vectors for a given pair of elements.
  • Attention weights: are calculated by using a softmax function on the attention scores, making sure that their total adds up to 1. Think of these weights as guides, showing how much importance each part of the information should get. They help the model decide how to weigh each element’s value in the overall picture.
  • Output: is a sum of values, but each value is multiplied by its assigned attention weight before adding up. This final result holds the essential information from the sequence that matters the most for the task at hand.

When the Query, Key and Value are generated from the same sequence, then we call it self-attention.

where W is learnable weight.

Deformable Attention Mechanism

Deformable Attention stands out as an attention technique that enhances the usual self-attention method by incorporating adaptability to capture spatial connections in a sequence or image input. Originally designed for computer vision tasks, it introduces flexibility to effectively handle intricate spatial relationships.

In a regular self-attention setup, each position in a sequence or spatial spot in an image interacts with others in a fixed, predefined way. Deformable Attention suggests, “Let’s learn how to shift our focus dynamically.” This innovation enables the model to grasp complex and uneven relationships in the data, making it more flexible and intelligent in recognizing intricate patterns in images or sequences. The introduction of trainable offsets allows each position or location to dynamically adjust its attention, empowering the model to understand intricate and non-uniform relationships within the data.

Credit: https://arxiv.org/pdf/2201.00520.pdf

Key components of Deformable Attention are:

  • Query, Key, Value: Query and Value are the same with attention mechanism and there
  • Sampling Points: serve as the starting point, representing the positions or locations in the absence of deformable adjustments
  • Sampling Offsets: Learned vectors that dynamically adjust sampling points. Introduces additional learnable parameters associated with each position or location. These offsets control how much each position or location can “move” or deform its attention region.
  • Deformed Sampling Points: Final locations where attention focuses, obtained by adding offsets to original positions.
  • Attention Scores: Measure the relevance of each deformed sampling point to the query.
  • Attention Weights: Normalized scores, indicating the importance of each point.
  • Output: Weighted sum of values based on attention weights, capturing relevant information.

The prediction of sampling offsets involves a small neural network. This model examines the surrounding context for each key and predicts a vector representing the offset for sampling, effectively adjusting the initial sampling point.

Deformed Sampling Points are derived by combining the initial grid positions with the offsets predicted by the model. This results in the ultimate adjusted sampling points that dynamically respond to the content, in contrast to the fixed points used in standard attention mechanisms.

The formulation of the Deformable Attention becomes like below with deformed K and V values. And ϕ is positional embedding.

Comparison

Deformable attention is like a smart way for a computer to pay attention. Instead of sticking to fixed points, it can adjust and focus better on different things, which helps it do a great job in tasks like finding objects in pictures, describing images, and translating languages. It’s like having a more versatile tool that works well with different types of information. Although a bit more complex, if done carefully, deformable attention can give better results.

Buy me a coffee.

References:

[1]. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin Attention is all you need. 2017, https://arxiv.org/abs/1706.03762

[2]. Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, Jifeng Dai Deformable DETR: Deformable Transformers for End-to-End Object Detection, ICLR 2021 https://arxiv.org/abs/2010.04159

--

--

Cenk Bircanoglu

Computer Vision Engineer, focused on Self-Supervised Learning approaches