Vectorize the Training Data
Text Vectorization
Modern neural networks cannot work directly with raw text. They require numerical input—typically sequences of integers. This section explains how text vectorization works, why it is necessary, and how the provided code implements it.
Text vectorization is the process of converting raw text (words, sentences, documents) into numerical representations that machine-learning models can understand.
Neural networks operate using mathematical operations on vectors and matrices. Because text is not inherently numeric, we need a way to map text → numbers in a consistent and meaningful way.
Why Do We Need Vectorization?
- Neural networks require fixed-size numeric tensors Text varies in length and contains symbolic words, so we must convert it into:
- numbers
- fixed-length sequences
- uniformly structured inputs
- Words must be represented consistently Each word must map to the same numeric ID every time.
Example:
| Word | Token ID |
|---|---|
| “movie” | 85 |
| “great” | 12 |
| “terrible” | 313 |
This allows models to learn patterns like: “bad words → negative sentiment”
- Vocabulary control Datasets may contain tens or hundreds of thousands of distinct words. Vectorization allows us to choose:
- how many words to keep
- how to handle unknown words (OOV tokens)
- how long each sequence should be
- Efficient training A clean numeric representation reduces memory usage and simplifies model operations.
How Does Vectorization Work Internally?
When using TextVectorization with output_mode=“int”, the process looks like this:
- Tokenization Text is split into tokens (usually words or subwords).
"This movie was amazing"
→ ["this", "movie", "was", "amazing"]- Vocabulary Building: Turning Words into IDs
Machine learning models cannot work directly with text. Before training begins, words must be converted into numbers.
During the .adapt() step, the text vectorization layer builds a vocabulary, which is simply a dictionary that maps words to integer IDs.
- scans all training texts
- counts how often each word appears
- keeps the most frequent words (up to max_tokens)
- assigns each word a unique integer ID
At this stage, words are treated as symbols, not meanings. The model has not learned anything yet—it is only creating a lookup table.
- Example vocabulary:
{"
"<PAD>" is used to pad sequences so all inputs have the same length
"<OOV>" (out-of-vocabulary) represents words not seen during training- Text → Sequence conversion
["this", "movie", "was", "amazing"]
→ [14, 3, 4, 112]- Padding or truncation
All sequences are resized to
max_len.
Defining Hyperparameters
max_tokens = 2000 # vocabulary size
max_len = 40 # sequence lengthmax_tokens- Limits vocabulary to the 20,000 most frequent words.
- Rare words are replaced with
token ID.
max_len- Every sequence will be exactly 200 tokens, padded or truncated accordingly. This ensures that the neural network receives input tensors of consistent shape:
(batch_size, max_len)Creating the Vectorization Layer
vectorize_layer = layers.TextVectorization(
max_tokens=max_tokens,
output_mode="int",
output_sequence_length=max_len,
)This layer will eventually turn each text sample into a fixed-length integer sequence, ready for embedding or model input.
Adapting the Layer
text_ds_for_adapt = tf.data.Dataset.from_tensor_slices(train_texts).batch(512)
vectorize_layer.adapt(text_ds_for_adapt)You may ask yourself: Why do I need the adaption? Because the layer needs to:
- build the vocabulary
- compute token frequencies
- map words → token IDs
This must be done only on training data to avoid information leakage.
As a summary, this is the complete code to vectorize the Enron Spam dataset for our machine learning model.
#@title Text vectorization
max_tokens = 2000 # vocabulary size
max_len = 40 # sequence length
vectorize_layer = layers.TextVectorization(
max_tokens=max_tokens,
output_mode="int",
output_sequence_length=max_len,
)
# Adapt vectorizer on training texts
text_ds_for_adapt = tf.data.Dataset.from_tensor_slices(train_texts).batch(512)
vectorize_layer.adapt(text_ds_for_adapt)
# Quick test
sample = train_texts[0]
print("Raw text:", sample[:200])
print("Token IDs:", vectorize_layer(tf.constant([sample]))[0][:30])Interpreting the Vectorization Output
After calling adapt(), the TextVectorization layer has built a vocabulary and can now convert raw text into numeric sequences.
Here is what the output means:
Raw text
any software just for 15 $ - 99 $ understanding oem software
lead me not into temptation ; i can find the way myself .
# 3533 . the law disregards trifles .[ 46 241 127 8 166 311 1447 1733 241 1156 47 31
123 18000 15 43 299 2 248 1503 1 2 1094 1
1 0 0 0 0 0]This array is the numerical representation of the text, where:
- Each number corresponds to a word’s ID in the vocabulary and matches its exact position
- i.e. any = 46, software = 241, just = 127, etc.
- The same word always maps to the same number
- For example,
"software"appears twice and maps to the same ID (241) both times
- For example,
- The value
1represents<OOV>(out-of-vocabulary)- These are words not seen often enough (or at all) during training
- The value
0represents<PAD>- Padding is added at the end so all sequences have the same length
