AI 101 - Building Supervised ML Models

Welcome to this hands-on machine learning lab!

In this lab you will

Work with real-world email spam data
Build your own Transformer-based NLP model
Train a classifier using deep learning
Evaluate the performance of your model
Understand each step of the ML pipeline
Run live inference on user-provided text inputs

What This Lab Covers—and What It Does Not

This lab focuses on supervised machine learning.
The model produces deterministic predictions.
The objective is classification accuracy, not text generation.
Generative models and LLMs are covered in a separate advanced lab.

Learning Goals

By the end of this lab you will be able to:

Explain how text datasets are prepared for machine learning
Vectorize and tokenize text inputs
Implement a Transformer encoder block
Train a binary classification model
Understand accuracy, AUC, confusion matrices
Save and reload ML models
Use a model for real-world predictions

Introduction

In this chapter you will get an introduction into the theory of machine learning models and how to develop such. You will also prepare the development environment and setup the required libraries.

Chapters:

Introduction to AI, Machine Learning & Deep Learning

What Is Artificial Intelligence (AI)?

Artificial Intelligence (AI) is the broad discipline focused on building systems that can perform tasks requiring human-like intelligence. These include:

Understanding and generating language
Recognizing objects and patterns
Solving problems and making decisions
Learning from data and experience

AI includes many subfields:

Natural Language Processing (NLP)

Enables machines to understand, analyze, and generate human language. Examples: chatbots, translation, sentiment analysis.

Computer Vision

Allows computers to “see” and interpret images and video. Examples: face recognition, autonomous driving, surveillance.

Robotics

Combines perception, control, and decision-making to perform tasks autonomously.

Expert Systems

Systems that mimic the decision-making abilities of human specialists.

AI’s Role AI is not designed to replace humans, but to augment our capabilities:

Faster decisions
Reduced errors
Ability to handle massive amounts of data
Automation of repetitive tasks

AI is used widely today:

Domain	AI Use Case
Healthcare	Diagnosis, personalized treatment, drug discovery
Finance	Fraud detection, credit scoring, trading
Cybersecurity	Threat detection, anomaly detection
Transportation	Route optimization, autonomous driving

What Is Machine Learning? (Foundations & Intuition)

Machine Learning (ML) is a subset of AI focused on building systems that learn patterns from data rather than being explicitly programmed.

ML algorithms analyze data, detect patterns, and make predictions or decisions based on new input

Types of ML

Supervised Learning (Spam Detection, Image Classification)
- Model learns from labeled data (e.g., emails marked spam or ham).
Unsupervised Learning (Clustering, Anomaly Detection)
- Model finds structure in unlabeled data (e.g., customer segmentation).
Reinforcement Learning (Robotics, Game Playing)
- Model learns by interacting with an environment and receiving rewards or penalties.

Practical Applications of ML

Medical diagnosis
Fraud detection
Recommendation systems
Malware detection
Predictive maintenance
Autonomous vehicles

ML is the engine that enables AI systems to adapt and improve over time.

What Is Deep Learning (DL)?

Deep Learning (DL) is a subfield of ML that uses neural networks with many layers (deep neural networks) to learn complex patterns.

DL excels with unstructured data:

Text
Images
Audio
Video

DL models automatically discover meaningful features — no manual engineering required.

**Key characteristics:

Hierarchical Feature Learning

Early layers learn simple features (edges, words); deeper layers learn complex concepts (faces, sentiment, intent).

End-to-End Learning

Model receives raw data and produces outputs directly.

Scalability

DL models improve dramatically as data and compute increase.

Common Deep Learning Architectures

Convolutional Neural Networks: A neural network architecture that learns hierarchical spatial features by applying shared convolutional filters over localized regions of structured input data.
Recurrent Neural Networks: A neural network designed to model sequential data by maintaining and updating an internal state that captures temporal dependencies across ordered inputs.
Transformers: A neural network architecture that models relationships within a sequence using self-attention mechanisms to capture global contextual dependencies without relying on recurrence or convolution.

Model Type	Best For	Why
CNNs	Images, spatial data	Detect local patterns & spatial hierarchies
RNNs	Sequences	Handle ordered, time-dependent data
LSTMs / GRUs	Long text sequences	Better long-term memory
Transformers	NLP, vision, multimodal tasks	Self-attention enables global reasoning

Transformers are now the dominant architecture across NLP and many other ML tasks.

Relationship Between AI, ML, and DL

Here is the hierarchy:

Artificial Intelligence (AI)
    └── Machine Learning (ML)
          └── Deep Learning (DL)

Examples of the hierarchy at work

In Computer Vision, Deep CNNs dominate image classification.
In NLP, transformers like BERT and GPT outperform all prior ML models.
In Autonomous Driving, ML + DL work together:
- ML for prediction models
- DL for perception (lanes, objects)

Bringing It Back to This Workshop

You will work primarily with:

Machine Learning (ML) for classifying emails
Deep Learning (DL) using neural networks
Transformers as the modern backbone of NLP

In this lab, we will use SPAM detection as a great example because it requires:

Understanding language
Detecting subtle patterns
Learning from real-world data
Generalizing to new emails
Adapting to evolving spam tactics

By the end of this workshop, you’ll understand how to build an AI system that performs a real-world NLP task using modern deep learning methods.

Why Machine Learning Exists

Traditional programming requires humans to create explicit rules.
This fails in real-world tasks with high variability, such as spam detection.

IF email contains "FREE" → spam

This breaks immediately when:

Spammers change vocabulary
New fraud techniques appear
Grammar, language, and tone vary

Machine learning solves this by learning rules from examples, not from humans.

graph TD
    A[Email Text] --> B[ML Model Learns Patterns]
    B --> C[Spam Probability]

ML models learn patterns automatically instead of relying on rigid rules. Also the model discovers statistical regularities humans never coded.

How Models Learn

All ML models follow this loop:

Input → text
Model predicts → probability of spam
Loss function → evaluates how wrong the prediction was
Backpropagation → computes how each weight influenced error
Gradient descent → updates weights to reduce future errors

Repeating this process thousands of times makes the model learn.

Mathematics (Simple Intuition)

A model is a function:

$$y=f(x)$$

Training aims to minimize:

$$ \mathrm{Loss} = \frac{1}{n}\sum_{i=1}^{n}\left(y_i - \hat{y}_i\right)^2 $$

The optimizer adjusts parameters:

$$ \theta_{\text{new}} = \theta_{\text{old}} - \alpha \frac{\partial \mathrm{Loss}}{\partial \theta} $$

Generalization, Underfitting, Overfitting

Underfitting

Model too simple
Fails to learn patterns

Overfitting

Memorizes training examples
Performs poorly on unseen data

Generalization

Learns underlying structure
Performs well on new data

Concept drift

Real-world patterns change (spam evolves)
Model must be retrained periodically

Types of Machine Learning Models

In this chapter we will go over the following topics:

Understand classical ML vs deep learning
Explain manual vs learned features
Understand the limitations of traditional NLP methods

Classical Machine Learning

Classical ML models include:

Logistic Regression
Naive Bayes
Support Vector Machines
Decision Trees / Random Forests

They rely heavily on manual feature engineering (TF-IDF, Bag-of-Words).

graph LR
    A[Text] --> B[Manual Feature Engineering]
    B --> C[Classical Model]
    C --> D[Spam/Ham]

Strengths

Simple
Fast
Works well on structured data

Weaknesses

Cannot understand meaning
Word order ignored
Cannot capture context
Struggles with long sequences

Why Classical ML Fails for Real NLP

Example:

“You won a free iPhone!”
“Claim your reward now!”

A TF-IDF model sees:

Completely different words
No connection between concepts (“won”, “reward”)

ML fails because meaning ≠ word counts. Deep learning models solved this gap.

Neural Networks: Core Concepts

In this chapter we will go over the following topics:

Understand neurons, layers, weights, and activations
Learn how backpropagation works
Understand hierarchical feature learning

What Is a Neural Network?

A neural network is a series of learned transformations. Example for a chain of transformations:

graph TD
    A[Input] --> B[Hidden Layer 1]
    B --> C[Hidden Layer 2]
    C --> D[Output]

Each neuron computes:

$$a=σ(Wx+b)$$

Hierarchical Learning

Deep networks build layers of understanding:

Layer 1 → edges, patterns, keywords
Layer 2 → sentiment, tone, syntax
Layer 3 → intent, meaning (fraud, manipulation)

This hierarchical learning makes deep networks extremely powerful.

How Neural Networks Learn

Neural networks adjust millions of weights using backpropagation. Each weight is updated by how much it contributed to the model’s error.

This allows the network to gradually shift from random noise to pattern recognition.

Sequence Models: RNN, LSTM, GRU

In this chapter we will go over the following topics:

Understand why sequential models were needed
Learn how RNNs build memory
Understand LSTM and GRU gating mechanisms
Recognize bottlenecks that transformers eliminate

RNN — Recurrent Neural Networks

RNNs introduced the idea of time-dependent memory:

graph LR
    A[Word at t] --> B[RNN Cell]
    B --> C[Word at t+1]
    B --> D[Hidden State Memory]

They track previous context using a hidden state.

Weaknesses

Vanishing gradients
Slow (sequential processing)
Struggles with long-range patterns

LSTM — Long Short-Term Memory

LSTMs added gates (input, forget, output) to regulate information flow.

This allowed:

Better long-term memory
Less catastrophic forgetting

For years, LSTMs were the backbone of NLP.

GRU — Gated Recurrent Unit

A simpler LSTM:

Fewer gates
Faster to train
Slightly less expressive

Why Sequence Models Broke at Scale

Cannot parallelize
Slow to train on large datasets
Memory bottlenecks
Poor long-term reasoning

These limitations motivated the invention of transformers.

CNNs & Feedforward Networks

In this chapter we will go over the following topics:

Understand why CNNs work for textual patterns
Explain feature maps
Know their limits

Feedforward Networks (MLPs)

MLPs treat input as fixed-length vectors. They ignore order, structure, and variable sequence length.

CNNs — Convolutional Neural Networks

CNNs slide filters across sequences to detect local patterns:

graph TD
    A[Embedding Sequence] --> B[Convolution Filters]
    B --> C[Feature Maps]
    C --> D[Spam/Ham Output]

CNNs detect:

“free offer”
“click now”
“urgent response”

Weaknesses

They do not understand long-range dependencies.

Transformers & Self-Attention

In this chapter we will go over the following topics:

Understand self-attention
Compare transformers with RNN/LSTM
Learn why transformers dominate NLP today

The Self-Attention Mechanism

Transformers compare each word with every other word:

graph TD
    A[Tokens] --> B[Self Attention]
    B --> C[Contextualized Representation]
    C --> D[Output]

This allows the model to discover relationships instantly.

Example spam email:

Claim your free prize now before the offer expires.

Attention detects:

free ↔ prize
offer ↔ expires
urgency patterns

Multi-Head Attention

Each head focuses on a different pattern:

Head 1 → urgency
Head 2 → reward structures
Head 3 → threat/pressure

This parallel interpretation is why transformers are so strong.

Transformers Architecture

Why Transformers Replaced LSTMs

Feature	LSTM	Transformer
Reads tokens	sequential	parallel
Long-range memory	limited	excellent
Training speed	slow	fast
Scalability	low	extremely high
NLP performance	outdated	state-of-the-art

Embeddings & Representation Learning

In this chapter we will go over the following topics:

Understand why text → numbers
Learn how embeddings encode meaning
Explain vector similarity

Why Text Must Be Converted Into Numbers

Neural networks require numeric vectors. Embeddings map words to dense vectors:

"free" → [0.2, -0.7, 0.1, ...]
"winner" → [0.25, -0.82, 0.06, ...]

Semantic Vector Spaces

Embeddings encode meaning via relative positions:

king - man + woman ≈ queen
free + prize + claim ≈ spam-like semantics

This allows models to:

Generalize
Understand synonyms
Capture context

Embeddings Learned in This Workshop

Your model learns 10-dimensional embeddings from the dataset. They become specialized for spam semantics, such as:

urgency
reward
threats
scam structure

Prepare the Environment

Setup Google Colab

For this lab we will use Google Colab. Google Colab has many advantages over using our own hardware for this lab:

Free access to GPUs (at least sometimes)
Zero installation overhead
Ready-to-use scientific Python stack
Cloud execution
Notebook interface for interactive learning

This allows you to have a stable environment including powerfull hardware to train a machine learning model.

Info

If you have strict blocking of site cookies, you may have an issue running the Javascript necessary to run the Colab pages and render graphs. You may have to temporarily change you cookie settings in your browser.

Info

You can still run this lab on your local machine if you prefer. However, it is recommended to use a Mac with an M-series chip or a Linux Machine. (This has not been tested on Windows!) In addition, it is strongly recommende to setup a Python virtual environment to avoid dependency conflicts with other Python projects on your machine.

To use Google Colab, a Google Account is required (it doesn’t matter if this is your coporate or private one).

Go to https://colab.research.google.com/
Login to Google Colab by selecting the Sign In button at the top right corner
Select + New notebook in the Wizard
Now you should be able to access the Jupyter Notebook which allows to follow along with the lab.

Info

In case your session did expire or times out, you will need to re-run the code from the beginning to reinstall the dependencies and initilize everything.

Info

In every chapter/step you will find code snippets starting with #@title .... You can copy and paste these code snippets into a code cell in your Colab notebook to run them. If there is a code snippet without this, it is used to explain some functionality and does not need to be copied.

Setup dependencies

The ML Software Stack we will use in this lab consists of the following components:

TensorFlow
Provides automatic differentiation, GPU computation, and model building tools.
Keras
A high-level API that makes deep learning intuitive.
HuggingFace Datasets
Enables easy loading and preprocessing of datasets.
NumPy & Pandas
Used for numerical operations and tabular views.

Use the below code snippet to install and import the required dependencies in your Colab environment.

#@title Setup dependencies
!pip install -q datasets

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

from datasets import load_dataset
from sklearn.metrics import classification_report, confusion_matrix

Validate if a GPU is available

GPUs accelerate deep learning dramatically:

graph LR
    CPU["CPU (8–16 cores)"] -->|Slow| Train[Training Time]
    GPU["GPU (1000+ cores)"] -->|Fast| Train

Neural networks = millions of matrix multiplications
This results in: GPUs = massive speed boost.

Use the below code snippet to check if a GPU is available in your Colab environment.

#@title Check GPU
device_name = tf.test.gpu_device_name()
if device_name:
    print("✅ GPU available:", device_name)
else:
    print("⚠️ No GPU detected. Training will still work, but slower.")

Info

The free version of Google Colab does not guarantee GPU availability at all times. If no GPU is detected, you can try to enable it by going to Runtime -> Change runtime type and selecting GPU as the hardware accelerator. However, GPU availability may still be limited based on demand and usage policies. You can ignore this warning and continue with the lab, but training times will be significantly longer without a GPU.

Appendix - Deeper Dive into ML Math

Neural Node (Neuron) Equation

$$[ a = \sigma(Wx + b) ]$$

This equation describes how a single neuron produces an output.

`x` — Input(s)

The data coming into the neuron.
Each x is typically one feature from the dataset (e.g., pixel intensity, temperature, word embedding value).
Can be a single value or a vector of values.

Intuition: What the neuron “sees.”

`W` — Weights

Numbers that scale each input.
Learned during training.
Determine how important each input is.

Intuition: How much the neuron cares about each input.

`b` — Bias

A constant added to the weighted sum.
Allows the neuron to shift its activation left or right.
Prevents the neuron from being forced to pass through zero.

Intuition: The neuron’s built-in offset or threshold.

`Wx + b` — Weighted Sum

The linear combination of inputs and weights plus bias.
This is the raw, unactivated signal.

Intuition: The neuron’s total input signal before making a decision.

`\sigma(\cdot)` — Activation Function

A non-linear function (e.g., sigmoid, ReLU, tanh).
Determines how strongly the neuron “fires.”
Enables neural networks to model complex, non-linear patterns.

Intuition: The decision rule that turns signal into output.

`a` — Activation (Output)

The final output of the neuron.
Passed to the next layer or used as a prediction.

Intuition: What the neuron outputs to the rest of the network.

One-Sentence Summary

A neuron multiplies inputs by learned weights, adds a bias, and passes the result through an activation function to produce an output.

Training Data Setup

In this chapter you will prepare and vectorize the data on which the machine learning model will get trainined on. As not every data set looks the same, you will have to understand the structure of the data first, format it to fit the requirements of the framework and vectorize the data.

Chapters:

Prepare training, validation & test data

Load the Dataset

In this lab you will use the Enron Spam Dataset to train the model. This Dataset contains about 32.7k rows of different Spam and non-Spam emails. The dataset has been already split into a training and a test dataset. The training dataset part (usually 70% of the dataset) will be used to train the model. The test dataset part (the other 30 %) will be used by the model to validate the learned information.

To load the dataset we will use the load_dataset function. In addition this code will load and store each dataset in a variable using pandas for the inspection to provide you an overview on how much data is loaded and how the data is formatted.

#@title Load Enron-Spam dataset

dataset = load_dataset("SetFit/enron_spam")

print(dataset)
print("Train size:", len(dataset["train"]))
print("Test size:", len(dataset["test"]))

# Convert to pandas for easier inspection / plotting
train_df = dataset["train"].to_pandas()
test_df = dataset["test"].to_pandas()

train_df.head()

To further evaluate the data, let’s look at the distribution of the “good” (ham) and “bad” (spam) data. For this, use the variables from the previous step and create diagram of the data distribution.

For this, create a diagram plot using the type bar and print this on the console. In addition, let’s have a look on some examples for spam/ham.

#@title Evaluate the label distribution (training data)

print(train_df["label"].value_counts())
print(train_df["label"].value_counts(normalize=True))

train_df["label_text"].value_counts().plot(kind="bar")
plt.title("Label distribution (train)")
plt.xticks(rotation=0)
plt.show()

print("Example SPAM email:")
print(train_df[train_df["label_text"] == "spam"]["text"].iloc[0][:500])

print("\nExample HAM email:")
print(train_df[train_df["label_text"] == "ham"]["text"].iloc[0][:500])

Splitting the Training Dataset

Most HuggingFace datasets provide only:

a train set
a test set

However, machine-learning workflows require a validation set as well. This code manually creates an 80/20 split of the training data. to achive this, the parameter test_size=0.2is used. This Sets the proportion of the validation set to 20%. The Remaining 80% becomes the training data which are required for model tuning and avoiding overfitting.

The seed=42parameter sets a fixed random seed so the split is reproducible. Without a seed, each run would shuffle the data differently.

#@title Create train / validation / test splits

split = dataset["train"].train_test_split(test_size=0.2, seed=42)
train_ds_hf = split["train"]
val_ds_hf = split["test"]
test_ds_hf = dataset["test"]

print("Train:", len(train_ds_hf))
print("Val:", len(val_ds_hf))
print("Test:", len(test_ds_hf))

# FORCE into plain Python lists of strings
train_texts = list(train_ds_hf["text"])
val_texts   = list(val_ds_hf["text"])
test_texts  = list(test_ds_hf["text"])

train_labels = np.array(train_ds_hf["label"], dtype="int32")
val_labels   = np.array(val_ds_hf["label"], dtype="int32")
test_labels  = np.array(test_ds_hf["label"], dtype="int32")

Vectorize the Training Data

Text Vectorization

Modern neural networks cannot work directly with raw text. They require numerical input—typically sequences of integers. This section explains how text vectorization works, why it is necessary, and how the provided code implements it.

Text vectorization is the process of converting raw text (words, sentences, documents) into numerical representations that machine-learning models can understand.

Neural networks operate using mathematical operations on vectors and matrices. Because text is not inherently numeric, we need a way to map text → numbers in a consistent and meaningful way.

Why Do We Need Vectorization?

Neural networks require fixed-size numeric tensors Text varies in length and contains symbolic words, so we must convert it into:

numbers
fixed-length sequences
uniformly structured inputs

Words must be represented consistently Each word must map to the same numeric ID every time.

Example:

Word	Token ID
“movie”	85
“great”	12
“terrible”	313

This allows models to learn patterns like: “bad words → negative sentiment”

Vocabulary control Datasets may contain tens or hundreds of thousands of distinct words. Vectorization allows us to choose:

how many words to keep
how to handle unknown words (OOV tokens)
how long each sequence should be

Efficient training A clean numeric representation reduces memory usage and simplifies model operations.

How Does Vectorization Work Internally?

When using TextVectorization with output_mode=“int”, the process looks like this:

Tokenization Text is split into tokens (usually words or subwords).

"This movie was amazing"  
→ ["this", "movie", "was", "amazing"]

Vocabulary Building: Turning Words into IDs

Machine learning models cannot work directly with text. Before training begins, words must be converted into numbers.

During the .adapt() step, the text vectorization layer builds a vocabulary, which is simply a dictionary that maps words to integer IDs.

scans all training texts
counts how often each word appears
keeps the most frequent words (up to max_tokens)
assigns each word a unique integer ID

At this stage, words are treated as symbols, not meanings. The model has not learned anything yet—it is only creating a lookup table.

Example vocabulary:

{"":0, “":1, “the”:2, “movie”:3, “was”:4, “good”:5, …}

"<PAD>" is used to pad sequences so all inputs have the same length
"<OOV>" (out-of-vocabulary) represents words not seen during training

Text → Sequence conversion

["this", "movie", "was", "amazing"]
→ [14, 3, 4, 112]

Padding or truncation All sequences are resized to max_len.

Defining Hyperparameters

max_tokens = 2000  # vocabulary size
max_len    = 40    # sequence length

max_tokens
- Limits vocabulary to the 20,000 most frequent words.
- Rare words are replaced with token ID.
max_len
- Every sequence will be exactly 200 tokens, padded or truncated accordingly. This ensures that the neural network receives input tensors of consistent shape:

(batch_size, max_len)

Creating the Vectorization Layer

vectorize_layer = layers.TextVectorization(
    max_tokens=max_tokens,
    output_mode="int",
    output_sequence_length=max_len,
)

This layer will eventually turn each text sample into a fixed-length integer sequence, ready for embedding or model input.

Adapting the Layer

text_ds_for_adapt = tf.data.Dataset.from_tensor_slices(train_texts).batch(512)
vectorize_layer.adapt(text_ds_for_adapt)

You may ask yourself: Why do I need the adaption? Because the layer needs to:

build the vocabulary
compute token frequencies
map words → token IDs

This must be done only on training data to avoid information leakage.

As a summary, this is the complete code to vectorize the Enron Spam dataset for our machine learning model.

#@title Text vectorization

max_tokens = 2000  # vocabulary size
max_len    = 40    # sequence length

vectorize_layer = layers.TextVectorization(
    max_tokens=max_tokens,
    output_mode="int",
    output_sequence_length=max_len,
)

# Adapt vectorizer on training texts
text_ds_for_adapt = tf.data.Dataset.from_tensor_slices(train_texts).batch(512)
vectorize_layer.adapt(text_ds_for_adapt)

# Quick test
sample = train_texts[0]
print("Raw text:", sample[:200])
print("Token IDs:", vectorize_layer(tf.constant([sample]))[0][:30])

Interpreting the Vectorization Output

After calling adapt(), the TextVectorization layer has built a vocabulary and can now convert raw text into numeric sequences.

Here is what the output means:

Raw text

any software just for 15 $ - 99 $ understanding oem software
lead me not into temptation ; i can find the way myself .
# 3533 . the law disregards trifles .

[   46   241   127     8   166   311  1447  1733   241  1156    47    31
   123 18000    15    43   299     2   248  1503     1     2  1094     1
     1     0     0     0     0     0]

This array is the numerical representation of the text, where:

Each number corresponds to a word’s ID in the vocabulary and matches its exact position
- i.e. any = 46, software = 241, just = 127, etc.
The same word always maps to the same number
- For example, "software" appears twice and maps to the same ID (241) both times
The value 1 represents <OOV> (out-of-vocabulary)
- These are words not seen often enough (or at all) during training
The value 0 represents <PAD>
- Padding is added at the end so all sequences have the same length

Optimize the Dataset

Optimizing the Dataset for TensorFlow

TensorFlow’s tf.data API provides a fast, scalable, and memory-efficient way to load data during training. Instead of manually feeding NumPy arrays to the model, we wrap them into Dataset pipelines, which handle:

batching
shuffling
prefetching
efficient loading on CPU/GPU

The code below converts your processed text + labels into TensorFlow-ready datasets for training, validation, and testing.

#@title Build TensorFlow Dataset

batch_size = 256  # can be tuned

def make_dataset(texts, labels, training=True):
    ds = tf.data.Dataset.from_tensor_slices((texts, labels))
    if training:
        ds = ds.shuffle(buffer_size=len(texts), reshuffle_each_iteration=True)
    ds = ds.batch(batch_size).prefetch(tf.data.AUTOTUNE)
    return ds

train_ds = make_dataset(train_texts, train_labels, training=True)
val_ds   = make_dataset(val_texts, val_labels, training=False)
test_ds  = make_dataset(test_texts, test_labels, training=False)

train_ds

You can adjust the batch_size variable if you like. A batch is a group of samples processed together in one forward/backward pass. After your changes, monitor the model performance, the training speed and the memory usage of the model if you like.

Now that the data has been formatted and prepared for the machine learning model, proceed to the next chapter where you will build the actual model.

Model Building & Training

In this chapter, you will assemble the complete Transformer-based spam classifier model and train it on the dataset. You will see how to connect all the pieces together: data pipeline, embeddings, Transformer block, and final classifier.

Chapters:

Creating the Transformer

Building the Transformer Block

In this part of the workshop, you will construct a complete Transformer block from scratch using Keras. Transformers are the foundation of modern AI systems such as BERT, GPT, and Vision Transformers. By implementing the block yourself, you’ll understand how these models actually work under the hood.

We will build the block piece by piece and discuss what each component does and why we need it.

Start by Creating a Custom Layer

We begin by defining a new layer class called TransformerBlock. Using a custom layer gives us a reusable building block that we can stack when creating our model later. A custom layer is your own building block inside a neural network. Keras has built-in layers (Dense, Conv2D, etc.), but sometimes you need functionality that doesn’t exist yet. Transformers require a specific structure, so we create a reusable building block.

Why do we do this?

To encapsulate all Transformer logic in one place
To reuse it multiple times
To allow saving/loading the model later
To make the code cleaner and modular

class TransformerBlock(layers.Layer):

Inside this class, we will define all parts that make up a standard Transformer block.

Add Hyperparameters in the Constructor

Hyperparameters are settings you choose before training a model. They control the architecture and behavior of the layer but are not learned during training. Examples:

embedding dimension
number of attention heads
dropout rate

Why are hyperparameters important? They determine:

model capacity
complexity
memory usage
training stability

Choosing good hyperparameters can drastically improve performance.

Next, we prepare the configuration of the block in the init constructor:

def __init__(self, embed_dim=10, num_heads=2, ff_dim=8, rate=0.7, **kwargs):
    super().__init__(**kwargs)

Here you introduce the core parameters:

embed_dim – dimensionality of each input vector
num_heads – number of attention heads
ff_dim – hidden layer size in the feed-forward network
rate – dropout rate to reduce overfitting

Passing **kwargs to `super() ensures Keras can handle things like serialization.

Add the Multi-Head Self-Attention Layer

Transformers rely heavily on self-attention, where each token learns which other tokens are important. Self-attention lets each input position (token) look at all other positions and decide which ones matter. Example: In the sentence: The cat sat on the mat the word “cat” may pay more attention to “sat” than to “mat”.

What is multi-head attention?

Instead of having one attention mechanism, we have multiple heads. Each head can learn different relationships.

Why is this layer here?

Self-attention is the core mechanism that makes Transformers powerful. It allows the model to learn context and relationships without convolution or recurrence.

Add the attention layer:

self.att = layers.MultiHeadAttention(
    num_heads=num_heads,
    key_dim=embed_dim // num_heads,
)

Multi-head attention lets the model look at the input from several “perspectives” at once. This is one of the reasons why Transformers are so effective.

Build the Feed-Forward Network (FFN)

After attention, every Transformer block uses a small two-layer neural network: A tiny two-layer neural network that is applied to every token independently after attention. Self-attention mixes information between tokens, but the model still needs a non-linear transformation to increase expressiveness.

The FFN helps:

refine the representation
build complex features
stabilize and enrich the internal structure of the model

This is part of every Transformer block.

self.ffn = keras.Sequential([
    layers.Dense(ff_dim, activation="relu"),
    layers.Dense(embed_dim),
])

Why do we need this?

attention gathers information
the FFN transforms and refines that information
this combination gives the model expressive power

Add Layer Normalization, Dropout & Residual Connections

It normalizes the input features to stabilize training. Transformers rely heavily on this to avoid exploding/vanishing gradients.

What is Dropout?

A regularization technique that randomly drops units during training to prevent overfitting.

What are Residual Connections?

When you see: inputs + something that is a skip/residual connection.

Why are these components here?

Transformers often fail to train without them. They:

make gradients flow better
improve convergence
prevent overfitting
stabilize deep architectures

This pattern (attention → normalization → FFN → normalization) is standard in modern Transformer models.

To make the block stable and trainable, we add:

Layer Normalization
Dropout
Residual (skip) connections

self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)
self.dropout1 = layers.Dropout(rate)
self.dropout2 = layers.Dropout(rate)

Residual connections are essential—they help gradients flow through the network and prevent training from failing.

Define the Forward Pass

Now we describe how data moves through the block. This is where everything comes together.

Self-Attention + Residual + Normalization

The input attends to itself → self-attention
Dropout is applied
Add a residual connection: inputs + attn_output
Normalize the result

attn_output = self.att(inputs, inputs)
attn_output = self.dropout1(attn_output, training=training)
out1 = self.layernorm1(inputs + attn_output)

The input attends to itself

Dropout is applied
We add the residual connection: inputs + attn_output
Then normalize the result

Feed-Forward + Residual + Normalization

Apply the FFN
Dropout again
Add another residual connection
Apply the final normalization

ffn_output = self.ffn(out1)
ffn_output = self.dropout2(ffn_output, training=training)
return self.layernorm2(out1 + ffn_output)

You’ll notice the same pattern: 1.transformation 2.residual connection 3.normalization

This two-step structure is the core of every Transformer block.

Make the Layer Serializable

Finally, we add a get_config() method so Keras can save and reload the layer. Serialization means saving your model to disk so you can reload it later.

def get_config(self):
    config = super().get_config()
    config.update({
        "embed_dim": self.embed_dim,
        "num_heads": self.num_heads,
        "ff_dim": self.ff_dim,
        "rate": self.rate,
    })
    return config

This is required if you want to export the model or deploy it later.

Complete Code

#@title Define Transformer block

class TransformerBlock(layers.Layer):
    def __init__(self, embed_dim=10, num_heads=2, ff_dim=8, rate=0.7, **kwargs):
        # IMPORTANT: pass **kwargs to super() to accept 'trainable', 'dtype', etc.
        super().__init__(**kwargs)
        self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.ff_dim = ff_dim
        self.rate = rate

        # key_dim is embed_dim // num_heads -> here 5
        self.att = layers.MultiHeadAttention(
            num_heads=num_heads,
            key_dim=embed_dim // num_heads,
        )
        self.ffn = keras.Sequential(
            [
                layers.Dense(ff_dim, activation="relu"),
                layers.Dense(embed_dim),
            ]
        )
        self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = layers.Dropout(rate)
        self.dropout2 = layers.Dropout(rate)

    def call(self, inputs, training=False):
        attn_output = self.att(inputs, inputs)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(inputs + attn_output)

        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        return self.layernorm2(out1 + ffn_output)

    def get_config(self):
        # Let Keras know how to serialize our custom args
        config = super().get_config()
        config.update(
            {
                "embed_dim": self.embed_dim,
                "num_heads": self.num_heads,
                "ff_dim": self.ff_dim,
                "rate": self.rate,
            }
        )
        return config

Build the Classifier

Building a 10-Dimensional Transformer Spam Classifier

In this chapter, you will combine everything you’ve learned to build a full Transformer-based text classifier. This model will take raw text messages as input and predict whether the message is spam or not spam (ham).

You will see how to connect the data pipeline, embeddings, Transformer block, and final classifier into one coherent model.

Define the Model Hyperparameters

These are hyperparameters — settings that control the architecture of your model.

embed_dim → size of each token vector
num_heads → number of attention heads
ff_dim → internal size of the feed-forward network inside the Transformer
dropout_rate → percent of units randomly dropped during training

Why do we define them here?

To make the model easy to tune. Students can experiment with different values to see how they affect:

training stability
model accuracy
model capacity

These hyperparameters do not change during training; they shape the architecture.

embed_dim = 10   # requirement: 10-dimensional representation
num_heads = 2    # can be tuned (must divide embed_dim)
ff_dim    = 8   # feed-forward size (students can tune)
dropout_rate = 0.7

Create the Input Layer for Raw Text

text_input = layers.Input(shape=(), dtype=tf.string, name="text")

This is the entry point of your model. It tells Keras:

the input type → string
the input shape → a single text field

Why is this important?

Transformers normally operate on numeric sequences (token IDs), so we need a layer that accepts raw text before any processing.

Convert Text Into Token IDs

x = vectorize_layer(text_input)

The vectorize_layer maps text into integer token IDs.

"Hello world" → [17, 45]

Why do we do this?

Neural networks cannot operate on text directly. They only work with numbers. Vectorization is the first preprocessing step of any NLP model.

Embed the Token IDs Into 10-Dimensional Vectors

An embedding layer converts token IDs into dense vectors of length embed_dim.

If embed_dim = 10, then each token becomes a 10-dimensional representation.

x = layers.Embedding(
    input_dim=max_tokens,
    output_dim=embed_dim,
    name="token_embedding"
)(x)

Why do we need embeddings?

Because:

token IDs are arbitrary numbers
embeddings allow the model to learn semantic meaning
similar words get similar vectors

This is the first learnable layer of the model.

Apply the Transformer Block (Encoder)

You are applying the Transformer block you implemented earlier.

x = TransformerBlock(
    embed_dim=embed_dim,
    num_heads=num_heads,
    ff_dim=ff_dim,
    rate=dropout_rate,
)(x)

Why do we use a Transformer here?

Transformers are exceptionally good at:

modeling sequences
understanding context
learning relationships between tokens

In spam classification, word relationships matter:

“free” → suspicious
“free” + “win money” → very suspicious
“free” + “trial” → maybe okay

The Transformer captures these relationships.

Reduce the Sequence to a Single Vector

This operation takes all token embeddings from the sequence and computes one single averaged vector.

Example: If your sequence has 100 tokens → turn this into one vector of size 10.

x = layers.GlobalAveragePooling1D()(x)
x = layers.Dropout(dropout_rate)(x)

Why do we do this?

Because the classifier expects one vector per input, not one vector per token. Think of it like summarizing the entire message into its most important features. You may ask why add dropout? To reduce overfitting and help generalization.

Add the Output Layer for Binary Classification

We create a dense output neuron that predicts:

0 → ham (not spam)
1 → spam

output = layers.Dense(1, activation="sigmoid", name="spam_score")(x)

Build the Final Model

We connect:

the input
preprocessing
embeddings
Transformer
pooling
classifier
into a single trainable model.

model = keras.Model(inputs=text_input, outputs=output)

To get an overview of the final model architecture, you can print a summary() at the end:

model.summary()

Complete Code

#@title Build 10-dimensional Transformer spam classifier

embed_dim = 10   # requirement: 10-dimensional representation
num_heads = 2    # can be tuned (must divide embed_dim)
ff_dim    = 8   # feed-forward size (students can tune)
dropout_rate = 0.7

# Input is raw text
text_input = layers.Input(shape=(), dtype=tf.string, name="text")

# Text -> token IDs
x = vectorize_layer(text_input)

# Token IDs -> embeddings (10-dimensional)
x = layers.Embedding(
    input_dim=max_tokens,
    output_dim=embed_dim,
    name="token_embedding"
)(x)

# Transformer encoder
x = TransformerBlock(
    embed_dim=embed_dim,
    num_heads=num_heads,
    ff_dim=ff_dim,
    rate=dropout_rate,
)(x)

# Reduce sequence to a single vector
x = layers.GlobalAveragePooling1D()(x)
x = layers.Dropout(dropout_rate)(x)

# Output layer: spam (1) vs ham (0)
output = layers.Dense(1, activation="sigmoid", name="spam_score")(x)

model = keras.Model(inputs=text_input, outputs=output)

model.summary()

Compile and Build the Model

Compile the Model

#@title Compile model

learning_rate = 1e-3  # students can tune (= 0.001)

model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=learning_rate),
    loss="binary_crossentropy",
    metrics=[
        "accuracy",
        keras.metrics.AUC(name="auc"),
    ],
)

model.compile(...) tells Keras:

how to update the model (optimizer)
how to measure errors (loss function)
which metrics to compute during training/evaluation

It does not start training yet – it just configures the training procedure.

Learning rate

learning_rate = 1e-3  # = 0.001

The learning rate controls how big the steps are in gradient descent.

Too high → model jumps around, may never converge
Too low → training is very slow, may get stuck in a bad spot

Here: 1e-3 is a common default starting point.

Optimizer: Adam

An optimizer tells the model how to update weights to reduce the loss using gradients.

optimizer=keras.optimizers.Adam(learning_rate=learning_rate)

The Adam optimizer is great default optimizer to start with. It provides some cabapilities like:

Adaptive learning rates per parameter
Works well for many NLP and deep learning tasks
Handles noisy gradients better than vanilla SGD

Loss Function: Binary Crossentropy

The loss measures how far the predictions are from the true labels. It’s what the optimizer tries to minimize.

loss="binary_crossentropy",

We have choosen the binary_crossentropy because of:

We have a binary classification problem: spam (1) vs ham (0)
Output is a single probability (sigmoid)
Binary crossentropy is the standard choice for this setup

Metrics: Accuracy and AUC

Metrics are readable performance indicators shown during training. Unlike the loss, we don’t optimize them directly; we just track them.

Accuracy – how many predictions are correct (simple and intuitive)
AUC (Area Under ROC Curve) – how well the model separates spam vs ham across thresholds

AUC is great in this case due to Spam detection is often about ranking messages by risk, not just picking a fixed threshold. AUC gives a good idea of separability even if the class distribution is imbalanced.

metrics=[
    "accuracy",
    keras.metrics.AUC(name="auc"),
],

Train the Transformer model

In this step you will start with the training of the model. In this step, you can tweak/tune the training by adjusting the epoch parameter. An epoch is one full pass over the training dataset.

#@title Train the Transformer model

epochs = 1  # this can be tuned

history = model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=epochs,
)

The real magic happens during the model.fit(...) call. What happens during the call for each epoch:

The model iterates over train_ds
For each batch:
- makes predictions
- computes loss
- computes gradients
- updates weights via the optimizer
After an epoch, it evaluates on val_ds to measure generalization

With the submission of the validation_data we make sure not to overfit the model.

Info

Note: this step may take a while - lay back and grab a coffe or have a chat with you colleagues :-)

Analyse the training progress

This step is purly for your evaluation of the training progress. Let’s get an overview on whats good and whats bad:

Accuracy

train_accuracy describes how well the model fits the training data
val_accuracy describes how well it generalizes to unseen data

Good sign: both curves rising and staying reasonably close

Overfitting sign: training goes up, validation gets worse or stagnates

AUC

Higher AUC describes the model is better at distinguishing spam vs ham
AUC is especially useful when the classes are imbalanced

Loss

train_loss should go down as the model learns
val_loss ideally also decreases, then stabilizes

If the train_loss keeps decreasing and val_loss starts increasing after some epoch this would indicate a strong overfitting.

#@title Plot training curves

def plot_history(history, metric="accuracy"):
    plt.figure()
    plt.plot(history.history[metric], label=f"train_{metric}")
    plt.plot(history.history[f"val_{metric}"], label=f"val_{metric}")
    plt.xlabel("Epoch")
    plt.ylabel(metric.capitalize())
    plt.legend()
    plt.grid(True)
    plt.show()

plot_history(history, "accuracy")
plot_history(history, "auc")
plot_history(history, "loss")

Examples

Evaluation & Testing

In this section, you will evaluate the performance of your trained spam classifier on unseen test data and learn how to save and load your model for future use. You will also test the loaded model with custom email inputs to see how well it classifies spam versus ham.

Chapters:

Evaluation of the trained Model

Evaluate the Model based on the Test Dataset

After training the Transformer spam classifier, the final step is to evaluate how well the model performs on completely unseen data. This is done using the test set, which acts like the model’s final exam. Because the model did not encounter this data during training or validation, the test results give a realistic estimate of how the model will behave in real-world scenarios.

The first part of the code uses model.evaluate(test_ds) to compute three important metrics on the test dataset: the loss, the accuracy, and the AUC. The loss tells us how far off the predictions are on average, while accuracy shows the proportion of messages the model classified correctly. AUC (Area Under the ROC Curve) is especially helpful in spam detection because it measures how well the model separates spam from ham across all possible classification thresholds. Together, these metrics provide a high-level view of the model’s performance.

However, high-level metrics alone are not enough to understand the behavior of a classifier. To dig deeper, we collect raw predictions for each individual email in the test set. The model outputs a probability between 0 and 1 representing how likely a message is to be spam. We convert these probabilities into binary class predictions using a threshold of 0.5: values above or equal to 0.5 are labeled as spam, and everything below as ham. This allows us to compare predictions directly against the true labels.

With the predicted and actual labels, we can compute more detailed evaluation measures. The classification report provides precision, recall, and F1-scores for both classes. These metrics are vital, especially when the dataset is imbalanced, because accuracy alone may hide important weaknesses. For example, a model might achieve high accuracy by simply predicting everything as ham, but precision and recall would reveal that it is failing completely at catching spam. The precision for spam tells us how many messages predicted as spam were actually spam, while recall tells us how many real spam messages the model successfully detected. The F1-score provides a balanced measure of both.

Lastly, the confusion matrix shows the exact number of correct and incorrect predictions for each class. This helps identify systematic errors: false positives (ham incorrectly marked as spam) and false negatives (spam the model failed to detect). In a spam classifier, the balance between these two types of errors is crucial. Too many false positives annoy users with messages being wrongly filtered out, while too many false negatives let harmful or unwanted spam slip through.

This evaluation code will help you not only know how well the model performs overall, but also truly understand how it works, what types of mistakes it makes, and whether it is suitable for real-world use. Proper machine learning goes beyond accuracy and requires deeper diagnostic tools such as precision, recall, F1-score, and confusion matrices to truly understand a classifier’s strengths and weaknesses.

#@title Evaluate on test set

test_loss, test_acc, test_auc = model.evaluate(test_ds)
print(f"Test loss: {test_loss:.4f}")
print(f"Test accuracy: {test_acc:.4f}")
print(f"Test AUC: {test_auc:.4f}")

# Collect predictions for detailed metrics
test_texts_list = list(test_texts)
test_labels_array = np.array(test_labels)

pred_probs = model.predict(tf.constant(test_texts_list)).ravel()
pred_labels = (pred_probs >= 0.5).astype("int32")

print("\nClassification report:")
print(classification_report(test_labels_array, pred_labels, target_names=["ham", "spam"]))

print("\nConfusion matrix:")
print(confusion_matrix(test_labels_array, pred_labels))

Spam Classifier Interpretation Cheat Sheet

Test Metrics (Loss, Accuracy, AUC)

Loss

Measures how wrong the predictions are (lower is better)
If test loss is much higher than train loss → overfitting
If both losses stay high → underfitting or poor hyperparameters

Accuracy

Percentage of correct predictions
Can be misleading with imbalanced datasets

Interpretation:

90%+ is good
Always compare accuracy with precision/recall

AUC (Area Under ROC Curve)

Measures how well the model separates spam vs ham
Threshold‑independent

Interpretation:

0.5 → random guessing
0.7–0.8 → decent
0.8–0.9 → strong
0.9+ → excellent

Classification Report

Precision

“How many predicted spams were actually spam?”

High precision → few false alarms
Low precision → many ham→spam mistakes

Interpretation:

Spam precision < 0.80 → too many false positives
Ham precision < 0.95 → misclassification risk

Recall

“How many real spam messages did we catch?”

High recall → strong spam detection
Low recall → spam slipping through

Interpretation:

Spam recall < 0.80 → missing too much spam

F1‑score

Balanced combination of precision and recall.

Interpretation:

0.70–0.80 → OK
0.80–0.90 → strong
>0.90 → excellent

Confusion Matrix

A typical matrix:

	Pred Ham	Pred Spam
Actual Ham	TN	FP
Actual Spam	FN	TP

TP — True Positives

Correctly identified spam.

TN — True Negatives

Correctly identified ham.

FP — False Positives

Ham → predicted spam

Annoys users
Reduce by increasing threshold or improving precision

FN — False Negatives

Spam → predicted ham

Dangerous (spam gets through)
Reduce by lowering threshold or improving recall

Quick Decision Guide

High accuracy + low spam recall

→ Model misses spam
→ Lower threshold, tune hyperparameters

Low spam precision

→ Flags too many ham messages
→ Raise threshold

Loss decreases but validation metrics worsen

→ Overfitting
→ Add dropout, reduce epochs

AUC < 0.7

→ Poor class separation
→ Improve preprocessing or Transformer setup

Ideal Metric Profile

Accuracy ≥ 0.90
Spam precision ≥ 0.85
Spam recall ≥ 0.85
AUC ≥ 0.90
Balanced FP/FN rates

Save/Load & Test the Model

Save the model to disk

In this task, you will save the model to disk for later use or to share with others.

#@title Save model to disk

save_path = "enron_spam_transformer.keras"

model.save(
    save_path,
    include_optimizer=True,
)

print("Model saved to:", save_path)

Load the model from disk

In this task, you will load the saved model from disk e.g for use in another script or application.

#@title Load saved model

loaded_model = keras.models.load_model(
    save_path,
    custom_objects={"TransformerBlock": TransformerBlock},
)

loaded_model.summary()

Use the loaded model to evaluate on custom input

This section demonstrates how to use the loaded model to classify new email texts as spam or ham by using your own input. Feel free to change the example inputs and add your own.

#@title Test the trained model with custom input

def classify_email(text, model=loaded_model):
    text_tensor = tf.constant([text])
    prob = model.predict(text_tensor)[0][0]
    label = "SPAM" if prob >= 0.5 else "HAM"
    print(f"Text: {text[:200]}{'...' if len(text) > 200 else ''}")
    print(f"Predicted label: {label}  (prob = {prob:.3f})")
    return prob, label

# Try some examples:
examples = [
    "Get cheap Viagra now!!! Limited offer, click here to buy.",
    "Hi team, the meeting is scheduled for tomorrow at 10 AM.",
    "You have won $1,000,000. Please send your bank details.",
    "Hi Marc, how are you doing? I hope this email finds you well.",
    "Hi, I have a super secret offer for you - 200% off of Clothes if you click this link",
]

for e in examples:
    print("=" * 80)
    classify_email(e)

Info

Update the examples with your own and test how “phishy” your messeges are!

Final Challenge

Welcome to the final challenge of this course! This challenge is designed to test your understanding of the concepts and skills you’ve learned throughout the modules.

Tweak and improve your model based on the feedback and insights you’ve gathered so far. Once you’re satisfied with your model’s performance, submit it for validation.

Tip

While building your model, several variables have been left to be used for tweaking. For Example:

max_tokens
max_len
batch_size
ff_dim
dropout_rate
epochs

Tasks

Review the course material and tasks
Tweak and improve your model
Upload your final model to get validated

How to Submit Your Model

To submit your final model for validation, please use the following code snippet within your Colab notebook:

#@title Submit Final Model for Validation
import requests, zipfile, os, json, time

PROVISIONER = "http://ml-workshop.ftntlab.tech/claim"

claim = requests.post(PROVISIONER, timeout=20).json()
submit_url = claim["submit_url"]
token = claim["submit_token"]

print("[*] Your submit URL:", submit_url)
print("[*] Your token:", token)

model.save("submission.keras")
!zip submission.zip submission.keras
print("[*] Model saved and zipped.")

time.sleep(5)

with open("submission.zip", "rb") as f:
    r = requests.post(
        submit_url,
        files={"file": f},
        headers={"X-Submit-Token": token},
        timeout=180
    )
print("[*] Your Result:\n")
if r.headers.get("content-type","").startswith("application/json"):
    print(json.dumps(r.json(), indent=2))

Hints and Tips

Hint 1

Have a look at the epochs parameter in the training step. Maybe slightly increasing it to something between 4 - 6 will help?

Hint 2

Have a look at the ff_dim parameter in the Transformer block. Maybe increasing it will help to something between 32 - 128?

Hint 3

Have a look at the dropout_rate parameter in the Transformer block. Maybe decreasing it will help to something between 0.1 - 0.3?

Hint 4

Have a look at the max_len parameter in the text vectorization step. Maybe increasing it will help to something between 150- 250?

Hint 4

Have a look at the batch_size parameter in the data preparation step. Maybe increasing it will help to something between 64 - 128?

Hint 5

Have a look at the max_tokens parameter in the text vectorization step. Maybe increasing it will help to something between 15000 - 25000?

Hint 6

Have a look at the optimizer learning_rate in the model compilation step. Maybe decreasing it will help to something between 0.0001 - 0.0005?

AI 101 - Building Supervised ML Models

Welcome to this hands-on machine learning lab!

In this lab you will

What This Lab Covers—and What It Does Not

Learning Goals

Subsections of AI 101 - Building Supervised ML Models

Introduction

Subsections of Introduction

Introduction to AI, Machine Learning & Deep Learning

What Is Artificial Intelligence (AI)?

What Is Machine Learning? (Foundations & Intuition)

What Is Deep Learning (DL)?

Relationship Between AI, ML, and DL

Bringing It Back to This Workshop

Why Machine Learning Exists

How Models Learn

Mathematics (Simple Intuition)

Generalization, Underfitting, Overfitting

Types of Machine Learning Models

Classical Machine Learning

Why Classical ML Fails for Real NLP

Neural Networks: Core Concepts

What Is a Neural Network?

Hierarchical Learning

How Neural Networks Learn

Sequence Models: RNN, LSTM, GRU

RNN — Recurrent Neural Networks

LSTM — Long Short-Term Memory

GRU — Gated Recurrent Unit

Why Sequence Models Broke at Scale

CNNs & Feedforward Networks

Feedforward Networks (MLPs)

CNNs — Convolutional Neural Networks

Transformers & Self-Attention

The Self-Attention Mechanism

Multi-Head Attention

Transformers Architecture

Why Transformers Replaced LSTMs

Embeddings & Representation Learning

Why Text Must Be Converted Into Numbers

Semantic Vector Spaces

Embeddings Learned in This Workshop

Prepare the Environment

Setup Google Colab

Setup dependencies

Validate if a GPU is available

Appendix - Deeper Dive into ML Math

Neural Node (Neuron) Equation

x — Input(s)

W — Weights

b — Bias

Wx + b — Weighted Sum

\sigma(\cdot) — Activation Function

a — Activation (Output)

One-Sentence Summary

Training Data Setup

Subsections of Training Data Setup

Prepare training, validation & test data

Load the Dataset

Splitting the Training Dataset

Vectorize the Training Data

Text Vectorization

Why Do We Need Vectorization?

How Does Vectorization Work Internally?

Defining Hyperparameters

Creating the Vectorization Layer

Adapting the Layer

Interpreting the Vectorization Output

Raw text

Optimize the Dataset

Optimizing the Dataset for TensorFlow

Model Building & Training

Subsections of Model Building & Training

Creating the Transformer

Building the Transformer Block

Start by Creating a Custom Layer

Add Hyperparameters in the Constructor

Add the Multi-Head Self-Attention Layer

What is multi-head attention?

Why is this layer here?

`x` — Input(s)

`W` — Weights

`b` — Bias

`Wx + b` — Weighted Sum

`\sigma(\cdot)` — Activation Function

`a` — Activation (Output)