TP 2 : Building Transformers Encoder and Exploring BERT for Sentiment Analysis

NB: This Lab was developed based on the materials delivered by NVIDIA.

The Transformer architecture was introduced in the paper “Attention Is All You Need” (Vaswani et al., 2017). We’ll examine the main components of a (encoder-style) Transformer layer step-by-step:

Token Embeddings
Positional Encoding
Multi-Head Self-Attention
Feed-Forward Network
Residual Connections & Layer Normalization
Putting them together into a Transformer Block
Finally, building a multi-layer Transformer Encoder

Part 1 : The Transformer Architecture

import torch
import torch.nn as nn
import torch.optim as optim
import math

# If you're in a fresh environment and need transformers:
# # !pip install transformers

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.manual_seed(42) #It ensures reproducibility — meaning that any random operations (like weight initialization, random sampling, etc.) will produce the same results each time you run the code.

2. KEY COMPONENTS OF A TRANSFORMER

2.1 Token Embeddings

We recall that tokenization is the step of splitting sentences into a set of words, and then assigning each word a unique ID. In our example, we will arbitrarily generate these IDs.

A simple embedding layer converts token IDs (integers) into vectors of dimension d_model. This dimension corresponds to the representation space of the embeddings in our model. We will choose 8 for our example, but in a model like BERT, for instance, the embedding representation space is set to 768.

Demo: We’ll:

Create a random batch of token IDs (batch_size=2, seq_length=5).
This means we want to create two sequences, each containing five IDs.
Example: [23, 54, 2, 88, 19] and [12, 99, 5, 45, 7].
Pass them through the embedding layer.
Print the shape of the output.

class TokenEmbedding(nn.Module):
    def __init__(self, vocab_size, d_model):
        super(TokenEmbedding, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)

    def forward(self, x):
        # x shape: (batch_size, seq_length)
        # output shape: (batch_size, seq_length, d_model)
        return self.embedding(x)

print("\n-- Demo: TokenEmbedding --")
vocab_size = 100
d_model = 8
sample_batch = torch.randint(0, vocab_size, (2, 5))  # (batch_size=2, seq_length=5) This line creates random integers between 0 and vocab_size - 1.
#These integers are meant to simulate token IDs, not actual words.

embedding_layer = TokenEmbedding(vocab_size, d_model)
embedded_output = embedding_layer(sample_batch)
print("Input shape:", sample_batch.shape)
print("Output shape (embedded):", embedded_output.shape)
print("-- End of Demo --")

To better understand this code, let’s break it down into parts:

class TokenEmbedding(nn.Module):
    def __init__(self, vocab_size, d_model):
        super(TokenEmbedding, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)

    def forward(self, x):
        # x shape: (batch_size, seq_length)
        # output shape: (batch_size, seq_length, d_model)
        return self.embedding(x)

The class is a subclass of torch.nn.Module. This means it inherits all the internal functionality of PyTorch for neural network models.

__init__() is the constructor; it defines the layers (here nn.Embedding).

forward() is the computation function; it describes what the module does when an input is passed to it. When you call:

embedding_layer = TokenEmbedding(vocab_size, d_model)

You create an object (an instance) of the TokenEmbedding class. This calls the constructor __init__() once, to initialize the internal layer:

self.embedding = nn.Embedding(vocab_size, d_model)

At this point, the embedding matrix (e.g., 100 × 8) is created (the word embedding operation is properly set up) and randomly initialized. However, no computation has been performed yet.

Then, when you do:

embedded_output = embedding_layer(sample_batch)

When you call a module instance like a function, i.e., embedding_layer(...), PyTorch automatically redirects this call to the forward() method of your class.

2.2 Positional Encoding

The Transformer doesn’t inherently understand the ordering of tokens. We add a positional encoding to each token embedding so that the model knows the relative/absolute positions.

We’ll implement the sinusoidal approach from Vaswani et al.:

$positional encoding$

Demo: We’ll:

Use the output of TokenEmbedding from the previous step.
Add positional encodings to it.
Print shape and show the effect on a small slice.

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super(PositionalEncoding, self).__init__()

        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)  # shape: (max_len, 1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))

        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)

        # Register pe as a buffer so it won't be trained
        self.register_buffer('pe', pe.unsqueeze(0))  # shape: (1, max_len, d_model)

    def forward(self, x):
        # x: (batch_size, seq_length, d_model)
        seq_length = x.size(1)
        # Add positional encoding up to seq_length
        x = x + self.pe[:, :seq_length, :].to(x.device)
        return x


print("\n-- Demo: PositionalEncoding --")
pos_enc_layer = PositionalEncoding(d_model, max_len=10)  # just a small max_len for demonstration
with_pe = pos_enc_layer(embedded_output)  # from previous embedding demo
print("Input shape (embedded_output):", embedded_output.shape)
print("Output shape (pos-encoded):", with_pe.shape)
print("First example, first token embedding (before PE):\n", embedded_output[0,0,:])
print("First example, first token embedding (after PE):\n", with_pe[0,0,:])
print("-- End of Demo --")

2.3 Multi-Head Self-Attention

Self-attention helps the model understand which words should focus on which others, enabling deep contextual understanding across the entire sequence. It operates using three key components derived from the same input sequence:

Q (Query) → What am I looking for?
K (Key) → What information do I have?
V (Value) → The actual information to be passed along.

The image shows Head 1 of 12, meaning this is just one attention head. Transformers use multiple heads to capture different types of relationships (e.g., syntax, meaning, dependency).

Steps of the Self-Attention Mechanism (Transformer)

Self-attention allows a Transformer model to determine which words in a sentence are most relevant to each other, producing context-aware representations.

1️⃣ Input: Token Embeddings

The input sentence (e.g., “Data visualization empowers users to”) is tokenized into words or subwords.
Each token is converted into a vector embedding.

2️⃣ Linear Projections → Q, K, and V

For each token embedding x, the model computes three vectors:

Q (Query) — what this word is looking for
K (Key) — what this word offers
V (Value) — the actual information to be passed on

These are obtained via learned weight matrices:

$Linear projection$

3️⃣ Dot Product (Attention Scores)

Each Query vector is compared with all Key vectors using a dot product.
This produces a matrix of attention scores that measure how much each token attends to every other token.

$Dot Product$

4️⃣ Scaling and Masking

The scores are divided by (\sqrt{d_k}) (where (d_k) is the key dimension) to prevent large values.
In some cases (e.g., decoder), a mask is applied to ignore certain positions (like future words).

$Scaling$

5️⃣ Softmax Normalization

Each row of the score matrix is passed through a Softmax function.
This converts raw scores into probabilities (attention weights) that sum to 1.

$Softmax Normalization$

6️⃣ Weighted Sum with Values

Each Value (V) vector is multiplied by its corresponding attention weight.
The results are summed to produce the final context vector for each token.

$Weighted Sum$

This gives each token a new embedding that captures contextual information from other tokens.

7️⃣ Multi-Head Attention

The process above is repeated several times in parallel (e.g., 12 heads).
Each head learns different relationships (syntax, meaning, dependencies).
The outputs are concatenated and linearly projected to form the final attention output.

All heads’ outputs are then concatenated and projected again.

Demo:

We’ll create a MultiHeadSelfAttention object with num_heads=2.
Pass a small batch of embeddings (pos-encoded) through it.
Print the output shape.

class MultiHeadSelfAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadSelfAttention, self).__init__()
        self.num_heads = num_heads
        self.d_model = d_model
        self.d_k = d_model // num_heads

        self.query = nn.Linear(d_model, d_model)
        self.key = nn.Linear(d_model, d_model)
        self.value = nn.Linear(d_model, d_model)

        self.out_proj = nn.Linear(d_model, d_model)

    def forward(self, x, mask=None):
        # x: (batch_size, seq_length, d_model)
        batch_size, seq_length, _ = x.size()

        # Linear projections
        Q = self.query(x)  # (batch_size, seq_length, d_model)
        K = self.key(x)
        V = self.value(x)

        # Reshape to (batch_size, num_heads, seq_length, d_k)
        Q = Q.view(batch_size, seq_length, self.num_heads, self.d_k).transpose(1, 2)
        K = K.view(batch_size, seq_length, self.num_heads, self.d_k).transpose(1, 2)
        V = V.view(batch_size, seq_length, self.num_heads, self.d_k).transpose(1, 2)

        # Compute attention scores: (batch_size, num_heads, seq_length, seq_length)
        scores = torch.matmul(Q, K.transpose(-1, -2)) / math.sqrt(self.d_k)

        if mask is not None:
            # mask shape is typically (batch_size, 1, seq_length, seq_length) or (batch_size, seq_length)
            scores = scores.masked_fill(mask == 0, float('-inf'))

        attn_weights = torch.softmax(scores, dim=-1)

        # Weighted sum of values: (batch_size, num_heads, seq_length, d_k)
        attn_output = torch.matmul(attn_weights, V)

        # Transpose back and combine heads
        attn_output = attn_output.transpose(1, 2).contiguous().view(batch_size, seq_length, self.d_model)

        # Final linear
        out = self.out_proj(attn_output)
        return out


print("\n-- Demo: MultiHeadSelfAttention --")
attention_layer = MultiHeadSelfAttention(d_model=d_model, num_heads=2)
attn_output = attention_layer(with_pe)  # with_pe from the PositionalEncoding demo
print("Input shape (pos-encoded embeddings):", with_pe.shape)
print("Output shape (after self-attention):", attn_output.shape)
print("-- End of Demo --")

2.4 Feed-Forward Network (Position-wise)

A 2-layer MLP applied to each position independently: $FFN equation$

Demo:

Construct the feed-forward network with d_ff=16.
Pass the attention output through it.
Print the shape.

class PositionwiseFeedForward(nn.Module):
    def __init__(self, d_model, d_ff):
        super(PositionwiseFeedForward, self).__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.linear2 = nn.Linear(d_ff, d_model)

    def forward(self, x):
        x = self.linear1(x)
        x = torch.relu(x)
        x = self.linear2(x)
        return x

print("\n-- Demo: PositionwiseFeedForward --")
ffn_layer = PositionwiseFeedForward(d_model=d_model, d_ff=16)
ffn_output = ffn_layer(attn_output)
print("Input shape (after self-attn):", attn_output.shape)
print("Output shape (after feed-forward):", ffn_output.shape)
print("-- End of Demo --")

2.5 Residual & Layer Normalization

Each sub-layer (attention or feed-forward) is wrapped with:

Residual connection
Layer normalization

We’ll define a small sublayer wrapper:

Demo:

Use the sublayer wrapper to apply self-attention to the input with a residual connection.
Print shape.

class SublayerConnection(nn.Module):
    def __init__(self, d_model, dropout=0.1):
        super(SublayerConnection, self).__init__()
        self.norm = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, sublayer):
        # x: (batch_size, seq_length, d_model)
        # sublayer is a function (could be self-attn or feed-forward)
        normed_x = self.norm(x)
        out = sublayer(normed_x)
        return x + self.dropout(out)  # residual connection


print("\n-- Demo: SublayerConnection (with multi-head attention) --")
sublayer = SublayerConnection(d_model=d_model, dropout=0.1)
# We'll define a "temporary" function that calls our attention layer
def attn_sublayer(x_input):
    return attention_layer(x_input)  # reusing attention_layer from above

# We pass the attn_sublayer function in
sublayer_output = sublayer(with_pe, attn_sublayer)
print("Input shape:", with_pe.shape)
print("Output shape (with residual + layernorm):", sublayer_output.shape)
print("-- End of Demo --")

3. PUTTING IT ALL TOGETHER: TRANSFORMER BLOCK

A single Transformer Encoder Block typically has:

Sublayer: Multi-Head Self-Attention
Sublayer: Feed-Forward

Each with residual + layer normalization.

Demo:

Build a TransformerEncoderLayer.
Pass the positional-encoded embeddings through it.

class TransformerEncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super(TransformerEncoderLayer, self).__init__()
        self.self_attn = MultiHeadSelfAttention(d_model, num_heads)
        self.feed_forward = PositionwiseFeedForward(d_model, d_ff)

        self.sublayer1 = SublayerConnection(d_model, dropout)
        self.sublayer2 = SublayerConnection(d_model, dropout)

    def forward(self, x, mask=None):
        # 1. Self-Attention sublayer
        x = self.sublayer1(x, lambda _x: self.self_attn(_x, mask=mask))
        # 2. Feed-forward sublayer
        x = self.sublayer2(x, self.feed_forward)
        return x

# Demo for TransformerEncoderLayer
if __name__ == "__main__":
    print("\n-- Demo: TransformerEncoderLayer --")
    encoder_layer = TransformerEncoderLayer(d_model=d_model, num_heads=2, d_ff=16, dropout=0.1)
    layer_output = encoder_layer(with_pe)  # with_pe is from earlier demo
    print("Input shape (pos-encoded embeddings):", with_pe.shape)
    print("Output shape (after 1 Transformer block):", layer_output.shape)
    print("-- End of Demo --")

4. BUILDING A MULTI-LAYER TRANSFORMER ENCODER

We can stack multiple TransformerEncoderLayer objects to form a full encoder. We also include the TokenEmbedding and PositionalEncoding at the start.

Demo:

Build a TransformerEncoder with 2 layers.
Pass some random token IDs to see the final output.

class TransformerEncoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_heads, d_ff, num_layers, max_seq_len=100, dropout=0.1):
        super(TransformerEncoder, self).__init__()
        self.token_embedding = TokenEmbedding(vocab_size, d_model)
        self.pos_encoding = PositionalEncoding(d_model, max_len=max_seq_len)

        self.layers = nn.ModuleList([
            TransformerEncoderLayer(d_model, num_heads, d_ff, dropout)
            for _ in range(num_layers)
        ])
        self.norm = nn.LayerNorm(d_model)

    def forward(self, x, mask=None):
        # x: (batch_size, seq_length)
        x = self.token_embedding(x)
        x = self.pos_encoding(x)

        for layer in self.layers:
            x = layer(x, mask=mask)

        x = self.norm(x)
        return x
# -----------------------
# Demo: TransformerEncoder
# -----------------------
print("\n-- Demo: TransformerEncoder (2 layers) --")
vocab_size = 1000
d_model = 16
num_heads = 2
d_ff = 32
num_layers = 2
max_seq_len = 10

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

encoder = TransformerEncoder(vocab_size, d_model, num_heads, d_ff, num_layers, max_seq_len).to(device)

dummy_input = torch.randint(0, vocab_size, (2, 10)).to(device)
encoder_output = encoder(dummy_input)

print("Encoder output shape:", encoder_output.shape)  # should be (2, 10, d_model)
print("-- End of Demo --")

Part 2 : BERT for Sentiment analysis

In this part, you will learn to fine-tune a pre-trained model. Specifically, we will use a model for sentiment analysis.

Sentiment Analysis is the task of detecting the sentiment in text. We model this problem as a simple form of a text classification problem. For example Gollum's performance is incredible! has a positive sentiment while It's neither as romantic nor as thrilling as it should be. has a negative sentiment. In such an analysis, we need to look at sentences, and we only have two classes: “positive” and “negative”. Each sentence in the training set must be labeled as one or the other. Sentiment analysis is widely used by businesses to identify customer sentiment toward products, brands, or services in online conversations and feedback. For this task, we will be using the Bidirectional Encoder Representations from Transformers (BERT).

BERT is a Large Language Model (LLM) developed by Google AI Language which has made significant advancements in the field of Natural Language Processing (NLP). the work of the Transformer architecture proposed the year prior. While GPT focused on Natural Language Generation (NLG), BERT prioritised Natural Language Understanding (NLU). These two developments reshaped the landscape of NLP.

### 2.1 Load and Preprocess a Dataset

We will use a sentiment analysis dataset provided by Stanford University. This dataset contains 50,000 online movie reviews from the Internet Movie Database (IMDb), with each review labelled as either positive or negative. Download the dataset directly from the Stanford University website : IMDB Dataset of 50K Movie Reviews (Kaggle)

Then, in your notebook, import the file:

import pandas as pd

# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Install and import kagglehub
!pip install kagglehub
import kagglehub

#  Download the dataset from Kaggle using kagglehub
path = kagglehub.dataset_download("lakshmi25npathi/imdb-dataset-of-50k-movie-reviews")
print("Dataset downloaded to:", path)

#  Copy the downloaded dataset to your Google Drive folder
import shutil

# Example: copy it into a folder in your Drive
drive_path = '/content/drive/MyDrive/KaggleDatasets/IMDB_50K/'
shutil.copytree(path, drive_path, dirs_exist_ok=True)

print("Dataset copied to Google Drive at:", drive_path)


df = pd.read_csv('/content/drive/MyDrive/KaggleDatasets/IMDB_50K/IMDB Dataset.csv')
df.head()

The IMDb dataset is fairly clean; however, there are some artifacts left over from the scraping process, such as HTML break tags (
) and unnecessary whitespace, which should be removed.

# Remove the break tags (<br />)
df['review_cleaned'] = df['review'].apply(lambda x: x.replace('<br />', ''))

# Remove unnecessary whitespace
df['review_cleaned'] = df['review_cleaned'].replace('s+', ' ', regex=True)

# Compare 72 characters of the second review before and after cleaning
print('Before cleaning:')
print(df.iloc[1]['review'][0:72])

print('nAfter cleaning:')
print(df.iloc[1]['review_cleaned'][0:72])

2.2 Encode the Sentiment

The final step of the preprocessing is to encode the sentiment of each review as either 0 for negative or 1 for positive. These labels will be used to train the classification head later in the fine-tuning process. Fine-tuning in BERT means taking a pre-trained BERT model (which already learned language patterns from a huge corpus like Wikipedia and BooksCorpus) and then training it further on a specific downstream task (like sentiment analysis, text classification, question answering, etc.).

df['sentiment_encoded'] = df['sentiment'].apply(lambda x: 0 if x == 'negative' else 1)
df.head()

2.3 Tokenize the data

Once preprocessed, the fine-tuning data can undergo tokenization. This process: splits the review text into individual tokens, adds the [CLS] and [SEP] special tokens, and handles padding. It’s important to select the appropriate tokenizer for the model, as different language models require different tokenization steps (e.g. GPT does not expect [CLS] and [SEP] tokens). We will use the BertTokenizer class from the Hugging Face transformers library, which is designed to be used with BERT-based models.

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
print(tokenizer)

There are four main options when working with BERT, each of which use the vocabulary from Google’s pre-trained tokenizers. These are:

bert-base-uncased – the vocabulary for the smaller version of BERT, which is NOT case sensitive (e.g. the tokens Cat and cat will be treated the same)
bert-base-cased – the vocabulary for the smaller version of BERT, which IS case sensitive (e.g. the tokens Cat and cat will not be treated the same)
bert-large-uncased – the vocabulary for the larger version of BERT, which is NOT case sensitive (e.g. the tokens Cat and cat will be treated the same)
bert-large-cased – the vocabulary for the larger version of BERT, which IS case sensitive (e.g. the tokens Cat and cat will not be treated the same)

2.4 Encoding Process: Converting Text to Tokens to Token IDs

Next, the tokenizer can be used to encode the cleaned data. This process will convert each review into a tensor of token IDs. For example, the review I liked this movie will be encoded by the following steps:

Convert the review to lower case (since we are using bert-base-uncased)
Break the review down into individual tokens according to the bert-base-uncased vocabulary: ['i', 'liked', 'this', 'movie']
Add the special tokens expected by BERT: ['[CLS]', 'i', 'liked', 'this', 'movie', '[SEP]']
Convert the tokens to their token IDs, also according to the bert-base-uncased vocabulary (e.g. [CLS] -> 101, i -> 1045, etc)

# Encode a sample input sentence
sample_sentence = 'I liked this movie'
token_ids = tokenizer.encode(sample_sentence, return_tensors='np')[0]
print(f'Token IDs: {token_ids}')

# Convert the token IDs back to tokens to reveal the special tokens added
tokens = tokenizer.convert_ids_to_tokens(token_ids)
print(f'Tokens   : {tokens}')

2.5 Truncation and Padding

Both BERT Base and BERT Large are designed to handle input sequences of exactly 512 tokens. But what do you do when your input sequence doesn’t fit this limit? The answer is Two Key Techniques:

Truncation

Cuts off excess tokens beyond 512
Set truncation=True and max_length=512
Used when text exceeds the limit
Preserves the most important beginning/middle content

Padding

Adds special tokens to reach 512
Set padding='max_length'
Used when text is shorter than 512 tokens
Ensures consistent input size But, we are going to set max_length to 128 to improve performance.

  review = df['review_cleaned'].iloc[0]

token_ids = tokenizer.encode(
    review,
    max_length = 128,
    padding = 'max_length',
    truncation = True,
    return_tensors = 'pt')

print(token_ids)

2.6 Using the Attention Mask with `encode_plus`

The example above shows the encoding for the first review in the dataset, which contains 119 padding tokens. If used in its current state for fine-tuning, BERT could attend to the padding tokens, potentially leading to a drop in performance. To address this, we can apply an attention mask that will instruct BERT to ignore certain tokens in the input (in this case the padding tokens).

review = df['review_cleaned'].iloc[0]

batch_encoder = tokenizer.encode_plus(
    review,
    max_length = 128,
    padding = 'max_length',
    truncation = True,
    return_tensors = 'pt')

print('Batch encoder keys:')
print(batch_encoder.keys())

print('nAttention mask:')
print(batch_encoder['attention_mask'])

2.7 Encode All Reviews:

The last step for the tokenization stage is to encode all the reviews in the dataset and store the token IDs and corresponding attention masks as tensors.

import torch
from tqdm import tqdm  # Progress bar

token_ids = []
attention_masks = []

# Encode each review with a progress bar
for review in tqdm(df['review_cleaned'], desc="Encoding reviews", unit="review"):
    batch_encoder = tokenizer.encode_plus(
        review,
        max_length=128,
        padding='max_length',
        truncation=True,
        return_tensors='pt'
    )

    token_ids.append(batch_encoder['input_ids'])
    attention_masks.append(batch_encoder['attention_mask'])

# Convert lists of token IDs and attention masks into PyTorch tensors
token_ids = torch.cat(token_ids, dim=0)
attention_masks = torch.cat(attention_masks, dim=0)

print("Encoding complete!")
print("Token IDs shape:", token_ids.shape)
print("Attention masks shape:", attention_masks.shape)

2.8 Create the Train and Validation DataLoaders

To partition the data, we can use the train_test_split function from SciKit-Learn’s model_selection package. This function requires the dataset we intend to split, the percentage of items to be allocated to the test set (or validation set in our case), and an optional argument for whether the data should be randomly shuffled.

import torch
from tqdm import tqdm  # Progress bar

token_ids = []
attention_masks = []

# Encode each review with a progress bar
for review in tqdm(df['review_cleaned'], desc="Encoding reviews", unit="review"):
    batch_encoder = tokenizer.encode_plus(
        review,
        max_length=512,
        padding='max_length',
        truncation=True,
        return_tensors='pt'
    )

    token_ids.append(batch_encoder['input_ids'])
    attention_masks.append(batch_encoder['attention_mask'])

# Convert lists of token IDs and attention masks into PyTorch tensors
token_ids = torch.cat(token_ids, dim=0)
attention_masks = torch.cat(attention_masks, dim=0)

print("Encoding complete!")
print("Token IDs shape:", token_ids.shape)
print("Attention masks shape:", attention_masks.shape)

2.9 Instantiate a BERT Model

The next step is to load in a pre-trained BERT model for us to fine-tune. We can import a model from the Hugging Face model repository similarly to how we did with the tokenizer. Hugging Face has many versions of BERT with classification heads already attached, which makes this process very convenient. Some examples of models with pre-configured classification heads include:

BertForMaskedLM
BertForNextSentencePrediction
BertForSequenceClassification
BertForMultipleChoice
BertForTokenClassification
BertForQuestionAnswering

Of course, it is possible to import a headless BERT model and create your own classification head from scratch in PyTorch or Tensorflow. However in our case, we can simply import the BertForSequenceClassification model since this already contains the linear layer we need. This linear layer is initialised with random weights and biases, which will be trained during the fine-tuning process. Since BERT Base uses 768 embedding dimensions, the hidden layer contains 768 neurons which are connected to the final encoder block of the model. The number of output neurons is determined by the num_labels argument, and corresponds to the number of unique sentiment labels. The IMDb dataset features only positive and negative, and so the num_labels argument is set to 2.

from transformers import BertForSequenceClassification

model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels=2)

2.10 Instantiate an Optimizer, Loss Function, and Scheduler

An optimizer is required to calculate the changes needed to each weight and bias, The optimizer works by determining how changes to the weights and biases in the classification head will affect the loss against a scoring function called the loss function.

A parameter called the learning rate is used to determine the size of the changes made to the weights and biases in the classification head. In early batches and epochs, large changes may prove advantageous since the randomly-initialised parameters will likely need substantial adjustments. However, as the training progresses, the weights and biases tend to improve, potentially making large changes counterproductive. Schedulers are designed to gradually decrease the learning rate as the training process continues, reducing the size of the changes made to each weight and bias in each optimizer step

from torch.optim import AdamW
import torch.nn as nn
from transformers import get_linear_schedule_with_warmup

EPOCHS = 2

# Optimizer
optimizer = AdamW(model.parameters())

# Loss function
loss_function = nn.CrossEntropyLoss()

# Scheduler
num_training_steps = EPOCHS * len(train_dataloader)
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps)

2.11 Fine-Tuning Loop

So here, we come to the final steps of the configuration of BERT and we will define our final milestones: training and validation. We define first the accuracy the measure the performance.

def calculate_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

from datetime import datetime
import numpy as np
from sklearn.model_selection import train_test_split
import torch.nn.functional as F
from tqdm.auto import tqdm  

#  Device setup
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Reproducibility
seed = 42
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)

# Move the model to the device
model.to(device)

loss_dict = {
    'epoch': [i + 1 for i in range(EPOCHS)],
    'average training loss': [],
    'average validation loss': []
}

t0_train = datetime.now()

for epoch in range(EPOCHS):
    print(f'\n{"-"*20} Epoch {epoch + 1} {"-"*20}')

    # ---- Training ----
    model.train()
    training_loss = 0
    t0_epoch = datetime.now()

    #  tqdm progress bar for training loop
    for batch in tqdm(train_dataloader, desc=f"Training Epoch {epoch+1}", leave=False):
        batch_token_ids = batch[0].to(device)
        batch_attention_mask = batch[1].to(device)
        batch_labels = batch[2].to(device)

        model.zero_grad()
        loss, logits = model(
            batch_token_ids,
            attention_mask=batch_attention_mask,
            labels=batch_labels,
            return_dict=False
        )

        training_loss += loss.item()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        scheduler.step()

    average_train_loss = training_loss / len(train_dataloader)
    time_epoch = datetime.now() - t0_epoch
    print(f"Training Loss: {average_train_loss:.4f} | Time: {time_epoch}")

    # ---- Validation ----
    model.eval()
    val_loss = 0
    val_accuracy = 0

    # tqdm progress bar for validation loop
    for batch in tqdm(val_dataloader, desc=f"Validating Epoch {epoch+1}", leave=False):
        batch_token_ids = batch[0].to(device)
        batch_attention_mask = batch[1].to(device)
        batch_labels = batch[2].to(device)

        with torch.no_grad():
            loss, logits = model(
                batch_token_ids,
                attention_mask=batch_attention_mask,
                labels=batch_labels,
                return_dict=False
            )

        logits = logits.detach().cpu().numpy()
        label_ids = batch_labels.cpu().numpy()
        val_loss += loss.item()
        val_accuracy += calculate_accuracy(logits, label_ids)

    avg_val_loss = val_loss / len(val_dataloader)
    avg_val_acc = val_accuracy / len(val_dataloader)
    print(f"Validation Loss: {avg_val_loss:.4f} | Accuracy: {avg_val_acc:.4f}")

    loss_dict['average training loss'].append(average_train_loss)
    loss_dict['average validation loss'].append(avg_val_loss)

print(f"\nTotal training time: {datetime.now() - t0_train}")

2.12 Model Inference

In this section, we will test the fine-tuned BERT on the validation dataset.

def predict(dataloader, model, device):
    model.eval()
    all_logits = []

    with torch.no_grad():
        for batch in dataloader:
            batch_token_ids, batch_attention_mask = tuple(t.to(device) for t in batch)[:2]
            logits = model(batch_token_ids, batch_attention_mask)
            all_logits.append(logits.logits)

    all_logits = torch.cat(all_logits, dim=0)
    probs = F.softmax(all_logits, dim=1).cpu().numpy()
    return probs
probs = predict(val_dataloader, model, device)
# Convert probabilities to predicted class indices
preds = np.argmax(probs, axis=1)

# If your labels are 0 = negative, 1 = positive:
sentiment_labels = ['negative', 'positive']

# Convert numeric predictions to text labels
predicted_sentiments = [sentiment_labels[p] for p in preds]

for i, sentiment in enumerate(predicted_sentiments[:10]):  # show first 10
    print(f"Sample {i+1}: {sentiment} (probabilities: {probs[i]})")

Let’s test our fine-tuned BERT on another input out of the dataset.

from torch.utils.data import TensorDataset, DataLoader
import torch
import numpy as np

# Example user inputs
user_texts = [
    "I love this movie! It was fantastic.",
    "The product broke after one use, terrible experience.",
    "Not bad, but could be better."
]

# Tokenize all user inputs at once (using the same BERT tokenizer)
encoded = tokenizer(
    user_texts,
    max_length=128,           # use 128 for faster inference
    padding='max_length',
    truncation=True,
    return_tensors='pt'       # return PyTorch tensors directly
)

# Move inputs to the correct device
input_ids = encoded['input_ids'].to(device)
attention_masks = encoded['attention_mask'].to(device)

# Create a DataLoader (optional, for batch inference)
user_data = TensorDataset(input_ids, attention_masks)
user_dataloader = DataLoader(user_data, batch_size=16)

# Inference
model.eval()
all_probs = []

with torch.no_grad():
    for batch in user_dataloader:
        batch_token_ids = batch[0].to(device)
        batch_attention_mask = batch[1].to(device)

        # forward pass
        outputs = model(
            batch_token_ids,
            attention_mask=batch_attention_mask,
            return_dict=True
        )

        # Apply softmax to get probabilities
        probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
        all_probs.append(probs.cpu().numpy())

# Concatenate all probabilities
user_probs = np.concatenate(all_probs, axis=0)
user_preds = np.argmax(user_probs, axis=1)

# Display results
for text, pred, prob in zip(user_texts, user_preds, user_probs):
    print(f"Text: {text}")
    print(f"Predicted class: {pred} | Probabilities: {prob}\n")

Homework

Based on the model developed in the second part of this lab, you are required to provide a Jupyter Notebook introducing the concepts of the RoBERTa. Explain its concepts and provide an overview table comparing the performance of the models on the same dataset used in the lab.

The comparison should be based on the following four evaluation metrics:
Accuracy, Precision, Recall, and F1-score.

You should:

Define the equations of these metrics.
Explain their use cases (i.e., when and why each metric is important).
Provide your analytical insights and interpretation of the obtained results.

Example of the Overview Table

Model	Accuracy	Precision	Recall	F1
BERT (base-uncased)	91.8	91.8	91.8	91.8
RoBERTa (base-uncased)	93.4	93.5	93.4	93.3

Example of Insights

From the table, it can be seen that the RoBERTa model outperforms the other models.
This improvement may be attributed to factors such as:
- Its larger and more diverse vocabulary,
- Dynamic masking during pre-training,
- And a longer and more extensive pre-training phase compared to BERT and ALBERT.

References

Introduction to Transformer Based Natural Language Processing : NVIDIA
Generative AI NVIDA Teaching Kit
The Transformer Explainer https://poloclub.github.io/transformer-explainer/
A Complete Guide to BERT with code : https://towardsdatascience.com/a-complete-guide-to-bert-with-code-9f87602e4a11/

Tags: Transformers Intent Classification GenAI

TP2 Building Transformers Encoder and Exploring BERT for Sentiment Analysis

second TP

TP 2 : Building Transformers Encoder and Exploring BERT for Sentiment Analysis

Part 1 : The Transformer Architecture

2. KEY COMPONENTS OF A TRANSFORMER

2.1 Token Embeddings

2.2 Positional Encoding

2.3 Multi-Head Self-Attention

Steps of the Self-Attention Mechanism (Transformer)

1️⃣ Input: Token Embeddings

2️⃣ Linear Projections → Q, K, and V

3️⃣ Dot Product (Attention Scores)

4️⃣ Scaling and Masking

5️⃣ Softmax Normalization

6️⃣ Weighted Sum with Values

7️⃣ Multi-Head Attention

2.4 Feed-Forward Network (Position-wise)

2.5 Residual & Layer Normalization

3. PUTTING IT ALL TOGETHER: TRANSFORMER BLOCK

4. BUILDING A MULTI-LAYER TRANSFORMER ENCODER

Part 2 : BERT for Sentiment analysis

2.2 Encode the Sentiment

2.3 Tokenize the data

2.4 Encoding Process: Converting Text to Tokens to Token IDs

2.5 Truncation and Padding

2.6 Using the Attention Mask with `encode_plus`

2.7 Encode All Reviews:

2.8 Create the Train and Validation DataLoaders

2.9 Instantiate a BERT Model

2.10 Instantiate an Optimizer, Loss Function, and Scheduler

2.11 Fine-Tuning Loop

2.12 Model Inference

Homework

References

TP 2 : Building Transformers Encoder and Exploring BERT for Sentiment Analysis

Part 1 : The Transformer Architecture

2. KEY COMPONENTS OF A TRANSFORMER

2.1 Token Embeddings

2.2 Positional Encoding

2.3 Multi-Head Self-Attention

Steps of the Self-Attention Mechanism (Transformer)

1️⃣ Input: Token Embeddings

2️⃣ Linear Projections → Q, K, and V

3️⃣ Dot Product (Attention Scores)

4️⃣ Scaling and Masking

5️⃣ Softmax Normalization

6️⃣ Weighted Sum with Values

7️⃣ Multi-Head Attention

2.4 Feed-Forward Network (Position-wise)

2.5 Residual & Layer Normalization

3. PUTTING IT ALL TOGETHER: TRANSFORMER BLOCK

4. BUILDING A MULTI-LAYER TRANSFORMER ENCODER

Part 2 : BERT for Sentiment analysis

2.2 Encode the Sentiment

2.3 Tokenize the data

2.4 Encoding Process: Converting Text to Tokens to Token IDs

2.5 Truncation and Padding

2.6 Using the Attention Mask with encode_plus

2.7 Encode All Reviews:

2.8 Create the Train and Validation DataLoaders

2.9 Instantiate a BERT Model

2.10 Instantiate an Optimizer, Loss Function, and Scheduler

2.11 Fine-Tuning Loop

2.12 Model Inference

Homework

References

2.6 Using the Attention Mask with `encode_plus`