TP2 Building Transformers Encoder and Exploring BERT downstream tasks

second TP

By Mariem ZAOUALI

TP 2 : Building Transformers Encoder and Exploring BERT downstream task

NB: This Lab was developed based on the materials delivered by NVIDIA.

The Transformer architecture was introduced in the paper “Attention Is All You Need” (Vaswani et al., 2017). We’ll examine the main components of a (encoder-style) Transformer layer step-by-step:

  • Token Embeddings
  • Positional Encoding
  • Multi-Head Self-Attention
  • Feed-Forward Network
  • Residual Connections & Layer Normalization
  • Putting them together into a Transformer Block
  • Finally, building a multi-layer Transformer Encoder

Part 1 : The Transformer Architecture

import torch
import torch.nn as nn
import torch.optim as optim
import math

# If you're in a fresh environment and need transformers:
# # !pip install transformers

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.manual_seed(42) #It ensures reproducibility — meaning that any random operations (like weight initialization, random sampling, etc.) will produce the same results each time you run the code.

2. KEY COMPONENTS OF A TRANSFORMER

2.1 Token Embeddings

We recall that tokenization is the step of splitting sentences into a set of words, and then assigning each word a unique ID. In our example, we will arbitrarily generate these IDs. image

A simple embedding layer converts token IDs (integers) into vectors of dimension d_model. This dimension corresponds to the representation space of the embeddings in our model. We will choose 8 for our example, but in a model like BERT, for instance, the embedding representation space is set to 768.

Demo: We’ll:

  1. Create a random batch of token IDs (batch_size=2, seq_length=5).
    This means we want to create two sequences, each containing five IDs.
    Example: [23, 54, 2, 88, 19] and [12, 99, 5, 45, 7].

  2. Pass them through the embedding layer.

  3. Print the shape of the output.

class TokenEmbedding(nn.Module):
    def __init__(self, vocab_size, d_model):
        super(TokenEmbedding, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)

    def forward(self, x):
        # x shape: (batch_size, seq_length)
        # output shape: (batch_size, seq_length, d_model)
        return self.embedding(x)

print("\n-- Demo: TokenEmbedding --")
vocab_size = 100
d_model = 8
sample_batch = torch.randint(0, vocab_size, (2, 5))  # (batch_size=2, seq_length=5) This line creates random integers between 0 and vocab_size - 1.
#These integers are meant to simulate token IDs, not actual words.

embedding_layer = TokenEmbedding(vocab_size, d_model)
embedded_output = embedding_layer(sample_batch)
print("Input shape:", sample_batch.shape)
print("Output shape (embedded):", embedded_output.shape)
print("-- End of Demo --")

To better understand this code, let’s break it down into parts:

class TokenEmbedding(nn.Module):
    def __init__(self, vocab_size, d_model):
        super(TokenEmbedding, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)

    def forward(self, x):
        # x shape: (batch_size, seq_length)
        # output shape: (batch_size, seq_length, d_model)
        return self.embedding(x)

The class is a subclass of torch.nn.Module. This means it inherits all the internal functionality of PyTorch for neural network models.

__init__() is the constructor; it defines the layers (here nn.Embedding).

forward() is the computation function; it describes what the module does when an input is passed to it. When you call:

embedding_layer = TokenEmbedding(vocab_size, d_model)

You create an object (an instance) of the TokenEmbedding class. This calls the constructor __init__() once, to initialize the internal layer:

self.embedding = nn.Embedding(vocab_size, d_model)

At this point, the embedding matrix (e.g., 100 × 8) is created (the word embedding operation is properly set up) and randomly initialized. However, no computation has been performed yet.

Then, when you do:

embedded_output = embedding_layer(sample_batch)

When you call a module instance like a function, i.e., embedding_layer(...), PyTorch automatically redirects this call to the forward() method of your class.

2.2 Positional Encoding

The Transformer doesn’t inherently understand the ordering of tokens. We add a positional encoding to each token embedding so that the model knows the relative/absolute positions.

We’ll implement the sinusoidal approach from Vaswani et al.:

positional encoding

Demo: We’ll:

  1. Use the output of TokenEmbedding from the previous step.
  2. Add positional encodings to it.
  3. Print shape and show the effect on a small slice.
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super(PositionalEncoding, self).__init__()

        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)  # shape: (max_len, 1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))

        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)

        # Register pe as a buffer so it won't be trained
        self.register_buffer('pe', pe.unsqueeze(0))  # shape: (1, max_len, d_model)

    def forward(self, x):
        # x: (batch_size, seq_length, d_model)
        seq_length = x.size(1)
        # Add positional encoding up to seq_length
        x = x + self.pe[:, :seq_length, :].to(x.device)
        return x


print("\n-- Demo: PositionalEncoding --")
pos_enc_layer = PositionalEncoding(d_model, max_len=10)  # just a small max_len for demonstration
with_pe = pos_enc_layer(embedded_output)  # from previous embedding demo
print("Input shape (embedded_output):", embedded_output.shape)
print("Output shape (pos-encoded):", with_pe.shape)
print("First example, first token embedding (before PE):\n", embedded_output[0,0,:])
print("First example, first token embedding (after PE):\n", with_pe[0,0,:])
print("-- End of Demo --")

2.3 Multi-Head Self-Attention

Self-attention helps the model understand which words should focus on which others, enabling deep contextual understanding across the entire sequence. It operates using three key components derived from the same input sequence:

  • Q (Query) → What am I looking for?
  • K (Key) → What information do I have?
  • V (Value) → The actual information to be passed along. attention

The image shows Head 1 of 12, meaning this is just one attention head. Transformers use multiple heads to capture different types of relationships (e.g., syntax, meaning, dependency).

Steps of the Self-Attention Mechanism (Transformer)

Self-attention allows a Transformer model to determine which words in a sentence are most relevant to each other, producing context-aware representations.

1️⃣ Input: Token Embeddings
  • The input sentence (e.g., “Data visualization empowers users to”) is tokenized into words or subwords.
  • Each token is converted into a vector embedding.

2️⃣ Linear Projections → Q, K, and V

For each token embedding x, the model computes three vectors:

  • Q (Query) — what this word is looking for
  • K (Key) — what this word offers
  • V (Value) — the actual information to be passed on

These are obtained via learned weight matrices:

Linear projection


3️⃣ Dot Product (Attention Scores)
  • Each Query vector is compared with all Key vectors using a dot product.
  • This produces a matrix of attention scores that measure how much each token attends to every other token.

Dot Product


4️⃣ Scaling and Masking
  • The scores are divided by (\sqrt{d_k}) (where (d_k) is the key dimension) to prevent large values.
  • In some cases (e.g., decoder), a mask is applied to ignore certain positions (like future words).

Scaling


5️⃣ Softmax Normalization
  • Each row of the score matrix is passed through a Softmax function.
  • This converts raw scores into probabilities (attention weights) that sum to 1.

Softmax Normalization


6️⃣ Weighted Sum with Values
  • Each Value (V) vector is multiplied by its corresponding attention weight.
  • The results are summed to produce the final context vector for each token.

Weighted Sum

This gives each token a new embedding that captures contextual information from other tokens.


7️⃣ Multi-Head Attention
  • The process above is repeated several times in parallel (e.g., 12 heads).
  • Each head learns different relationships (syntax, meaning, dependencies).
  • The outputs are concatenated and linearly projected to form the final attention output.

All heads’ outputs are then concatenated and projected again.

Demo:

  1. We’ll create a MultiHeadSelfAttention object with num_heads=2.
  2. Pass a small batch of embeddings (pos-encoded) through it.
  3. Print the output shape.
class MultiHeadSelfAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadSelfAttention, self).__init__()
        self.num_heads = num_heads
        self.d_model = d_model
        self.d_k = d_model // num_heads

        self.query = nn.Linear(d_model, d_model)
        self.key = nn.Linear(d_model, d_model)
        self.value = nn.Linear(d_model, d_model)

        self.out_proj = nn.Linear(d_model, d_model)

    def forward(self, x, mask=None):
        # x: (batch_size, seq_length, d_model)
        batch_size, seq_length, _ = x.size()

        # Linear projections
        Q = self.query(x)  # (batch_size, seq_length, d_model)
        K = self.key(x)
        V = self.value(x)

        # Reshape to (batch_size, num_heads, seq_length, d_k)
        Q = Q.view(batch_size, seq_length, self.num_heads, self.d_k).transpose(1, 2)
        K = K.view(batch_size, seq_length, self.num_heads, self.d_k).transpose(1, 2)
        V = V.view(batch_size, seq_length, self.num_heads, self.d_k).transpose(1, 2)

        # Compute attention scores: (batch_size, num_heads, seq_length, seq_length)
        scores = torch.matmul(Q, K.transpose(-1, -2)) / math.sqrt(self.d_k)

        if mask is not None:
            # mask shape is typically (batch_size, 1, seq_length, seq_length) or (batch_size, seq_length)
            scores = scores.masked_fill(mask == 0, float('-inf'))

        attn_weights = torch.softmax(scores, dim=-1)

        # Weighted sum of values: (batch_size, num_heads, seq_length, d_k)
        attn_output = torch.matmul(attn_weights, V)

        # Transpose back and combine heads
        attn_output = attn_output.transpose(1, 2).contiguous().view(batch_size, seq_length, self.d_model)

        # Final linear
        out = self.out_proj(attn_output)
        return out


print("\n-- Demo: MultiHeadSelfAttention --")
attention_layer = MultiHeadSelfAttention(d_model=d_model, num_heads=2)
attn_output = attention_layer(with_pe)  # with_pe from the PositionalEncoding demo
print("Input shape (pos-encoded embeddings):", with_pe.shape)
print("Output shape (after self-attention):", attn_output.shape)
print("-- End of Demo --")

2.4 Feed-Forward Network (Position-wise)

A 2-layer MLP applied to each position independently: FFN equation

Demo:

  1. Construct the feed-forward network with d_ff=16.
  2. Pass the attention output through it.
  3. Print the shape.
class PositionwiseFeedForward(nn.Module):
    def __init__(self, d_model, d_ff):
        super(PositionwiseFeedForward, self).__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.linear2 = nn.Linear(d_ff, d_model)

    def forward(self, x):
        x = self.linear1(x)
        x = torch.relu(x)
        x = self.linear2(x)
        return x

print("\n-- Demo: PositionwiseFeedForward --")
ffn_layer = PositionwiseFeedForward(d_model=d_model, d_ff=16)
ffn_output = ffn_layer(attn_output)
print("Input shape (after self-attn):", attn_output.shape)
print("Output shape (after feed-forward):", ffn_output.shape)
print("-- End of Demo --")

2.5 Residual & Layer Normalization

Each sub-layer (attention or feed-forward) is wrapped with:

  1. Residual connection
  2. Layer normalization

We’ll define a small sublayer wrapper:

Demo:

  1. Use the sublayer wrapper to apply self-attention to the input with a residual connection.
  2. Print shape.
class SublayerConnection(nn.Module):
    def __init__(self, d_model, dropout=0.1):
        super(SublayerConnection, self).__init__()
        self.norm = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, sublayer):
        # x: (batch_size, seq_length, d_model)
        # sublayer is a function (could be self-attn or feed-forward)
        normed_x = self.norm(x)
        out = sublayer(normed_x)
        return x + self.dropout(out)  # residual connection


print("\n-- Demo: SublayerConnection (with multi-head attention) --")
sublayer = SublayerConnection(d_model=d_model, dropout=0.1)
# We'll define a "temporary" function that calls our attention layer
def attn_sublayer(x_input):
    return attention_layer(x_input)  # reusing attention_layer from above

# We pass the attn_sublayer function in
sublayer_output = sublayer(with_pe, attn_sublayer)
print("Input shape:", with_pe.shape)
print("Output shape (with residual + layernorm):", sublayer_output.shape)
print("-- End of Demo --")

3. PUTTING IT ALL TOGETHER: TRANSFORMER BLOCK

A single Transformer Encoder Block typically has:

  1. Sublayer: Multi-Head Self-Attention
  2. Sublayer: Feed-Forward

Each with residual + layer normalization.

Demo:

  1. Build a TransformerEncoderLayer.
  2. Pass the positional-encoded embeddings through it.
class TransformerEncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super(TransformerEncoderLayer, self).__init__()
        self.self_attn = MultiHeadSelfAttention(d_model, num_heads)
        self.feed_forward = PositionwiseFeedForward(d_model, d_ff)

        self.sublayer1 = SublayerConnection(d_model, dropout)
        self.sublayer2 = SublayerConnection(d_model, dropout)

    def forward(self, x, mask=None):
        # 1. Self-Attention sublayer
        x = self.sublayer1(x, lambda _x: self.self_attn(_x, mask=mask))
        # 2. Feed-forward sublayer
        x = self.sublayer2(x, self.feed_forward)
        return x

# Demo for TransformerEncoderLayer
if __name__ == "__main__":
    print("\n-- Demo: TransformerEncoderLayer --")
    encoder_layer = TransformerEncoderLayer(d_model=d_model, num_heads=2, d_ff=16, dropout=0.1)
    layer_output = encoder_layer(with_pe)  # with_pe is from earlier demo
    print("Input shape (pos-encoded embeddings):", with_pe.shape)
    print("Output shape (after 1 Transformer block):", layer_output.shape)
    print("-- End of Demo --")

4. BUILDING A MULTI-LAYER TRANSFORMER ENCODER

We can stack multiple TransformerEncoderLayer objects to form a full encoder. We also include the TokenEmbedding and PositionalEncoding at the start.

Demo:

  1. Build a TransformerEncoder with 2 layers.
  2. Pass some random token IDs to see the final output.
class TransformerEncoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_heads, d_ff, num_layers, max_seq_len=100, dropout=0.1):
        super(TransformerEncoder, self).__init__()
        self.token_embedding = TokenEmbedding(vocab_size, d_model)
        self.pos_encoding = PositionalEncoding(d_model, max_len=max_seq_len)

        self.layers = nn.ModuleList([
            TransformerEncoderLayer(d_model, num_heads, d_ff, dropout)
            for _ in range(num_layers)
        ])
        self.norm = nn.LayerNorm(d_model)

    def forward(self, x, mask=None):
        # x: (batch_size, seq_length)
        x = self.token_embedding(x)
        x = self.pos_encoding(x)

        for layer in self.layers:
            x = layer(x, mask=mask)

        x = self.norm(x)
        return x
# -----------------------
# Demo: TransformerEncoder
# -----------------------
print("\n-- Demo: TransformerEncoder (2 layers) --")
vocab_size = 1000
d_model = 16
num_heads = 2
d_ff = 32
num_layers = 2
max_seq_len = 10

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

encoder = TransformerEncoder(vocab_size, d_model, num_heads, d_ff, num_layers, max_seq_len).to(device)

dummy_input = torch.randint(0, vocab_size, (2, 10)).to(device)
encoder_output = encoder(dummy_input)

print("Encoder output shape:", encoder_output.shape)  # should be (2, 10, d_model)
print("-- End of Demo --")

Part 2 : BERT downstream tasks

References

  • Introduction to Transformer Based Natural Language Processing : NVIDIA
  • Generative AI NVIDA Teaching Kit
  • The Transformer Explainer https://poloclub.github.io/transformer-explainer/
  • A Complete Guide to BERT with code : https://towardsdatascience.com/a-complete-guide-to-bert-with-code-9f87602e4a11/