Evaluation of the trained Model
Evaluate the Model based on the Test Dataset
After training the Transformer spam classifier, the final step is to evaluate how well the model performs on completely unseen data. This is done using the test set, which acts like the model’s final exam. Because the model did not encounter this data during training or validation, the test results give a realistic estimate of how the model will behave in real-world scenarios.
The first part of the code uses model.evaluate(test_ds) to compute three important metrics on the test dataset: the loss, the accuracy, and the AUC. The loss tells us how far off the predictions are on average, while accuracy shows the proportion of messages the model classified correctly. AUC (Area Under the ROC Curve) is especially helpful in spam detection because it measures how well the model separates spam from ham across all possible classification thresholds. Together, these metrics provide a high-level view of the model’s performance.
However, high-level metrics alone are not enough to understand the behavior of a classifier. To dig deeper, we collect raw predictions for each individual email in the test set. The model outputs a probability between 0 and 1 representing how likely a message is to be spam. We convert these probabilities into binary class predictions using a threshold of 0.5: values above or equal to 0.5 are labeled as spam, and everything below as ham. This allows us to compare predictions directly against the true labels.
With the predicted and actual labels, we can compute more detailed evaluation measures. The classification report provides precision, recall, and F1-scores for both classes. These metrics are vital, especially when the dataset is imbalanced, because accuracy alone may hide important weaknesses. For example, a model might achieve high accuracy by simply predicting everything as ham, but precision and recall would reveal that it is failing completely at catching spam. The precision for spam tells us how many messages predicted as spam were actually spam, while recall tells us how many real spam messages the model successfully detected. The F1-score provides a balanced measure of both.
Lastly, the confusion matrix shows the exact number of correct and incorrect predictions for each class. This helps identify systematic errors: false positives (ham incorrectly marked as spam) and false negatives (spam the model failed to detect). In a spam classifier, the balance between these two types of errors is crucial. Too many false positives annoy users with messages being wrongly filtered out, while too many false negatives let harmful or unwanted spam slip through.
This evaluation code will help you not only know how well the model performs overall, but also truly understand how it works, what types of mistakes it makes, and whether it is suitable for real-world use. Proper machine learning goes beyond accuracy and requires deeper diagnostic tools such as precision, recall, F1-score, and confusion matrices to truly understand a classifier’s strengths and weaknesses.
#@title Evaluate on test set
test_loss, test_acc, test_auc = model.evaluate(test_ds)
print(f"Test loss: {test_loss:.4f}")
print(f"Test accuracy: {test_acc:.4f}")
print(f"Test AUC: {test_auc:.4f}")
# Collect predictions for detailed metrics
test_texts_list = list(test_texts)
test_labels_array = np.array(test_labels)
pred_probs = model.predict(tf.constant(test_texts_list)).ravel()
pred_labels = (pred_probs >= 0.5).astype("int32")
print("\nClassification report:")
print(classification_report(test_labels_array, pred_labels, target_names=["ham", "spam"]))
print("\nConfusion matrix:")
print(confusion_matrix(test_labels_array, pred_labels))Spam Classifier Interpretation Cheat Sheet
Test Metrics (Loss, Accuracy, AUC)
Loss
- Measures how wrong the predictions are (lower is better)
- If test loss is much higher than train loss → overfitting
- If both losses stay high → underfitting or poor hyperparameters
Accuracy
- Percentage of correct predictions
- Can be misleading with imbalanced datasets
Interpretation:
- 90%+ is good
- Always compare accuracy with precision/recall
AUC (Area Under ROC Curve)
- Measures how well the model separates spam vs ham
- Threshold‑independent
Interpretation:
- 0.5 → random guessing
- 0.7–0.8 → decent
- 0.8–0.9 → strong
- 0.9+ → excellent
Classification Report
Precision
“How many predicted spams were actually spam?”
- High precision → few false alarms
- Low precision → many ham→spam mistakes
Interpretation:
- Spam precision < 0.80 → too many false positives
- Ham precision < 0.95 → misclassification risk
Recall
“How many real spam messages did we catch?”
- High recall → strong spam detection
- Low recall → spam slipping through
Interpretation:
- Spam recall < 0.80 → missing too much spam
F1‑score
Balanced combination of precision and recall.
Interpretation:
- 0.70–0.80 → OK
- 0.80–0.90 → strong
- >0.90 → excellent
Confusion Matrix
A typical matrix:
| Pred Ham | Pred Spam | |
|---|---|---|
| Actual Ham | TN | FP |
| Actual Spam | FN | TP |
TP — True Positives
Correctly identified spam.
TN — True Negatives
Correctly identified ham.
FP — False Positives
Ham → predicted spam
- Annoys users
- Reduce by increasing threshold or improving precision
FN — False Negatives
Spam → predicted ham
- Dangerous (spam gets through)
- Reduce by lowering threshold or improving recall
Quick Decision Guide
High accuracy + low spam recall
→ Model misses spam
→ Lower threshold, tune hyperparameters
Low spam precision
→ Flags too many ham messages
→ Raise threshold
Loss decreases but validation metrics worsen
→ Overfitting
→ Add dropout, reduce epochs
AUC < 0.7
→ Poor class separation
→ Improve preprocessing or Transformer setup
Ideal Metric Profile
- Accuracy ≥ 0.90
- Spam precision ≥ 0.85
- Spam recall ≥ 0.85
- AUC ≥ 0.90
- Balanced FP/FN rates