Optimize the Dataset
Optimizing the Dataset for TensorFlow
TensorFlow’s tf.data API provides a fast, scalable, and memory-efficient way to load data during training. Instead of manually feeding NumPy arrays to the model, we wrap them into Dataset pipelines, which handle:
- batching
- shuffling
- prefetching
- efficient loading on CPU/GPU
The code below converts your processed text + labels into TensorFlow-ready datasets for training, validation, and testing.
#@title Build TensorFlow Dataset
batch_size = 256 # can be tuned
def make_dataset(texts, labels, training=True):
ds = tf.data.Dataset.from_tensor_slices((texts, labels))
if training:
ds = ds.shuffle(buffer_size=len(texts), reshuffle_each_iteration=True)
ds = ds.batch(batch_size).prefetch(tf.data.AUTOTUNE)
return ds
train_ds = make_dataset(train_texts, train_labels, training=True)
val_ds = make_dataset(val_texts, val_labels, training=False)
test_ds = make_dataset(test_texts, test_labels, training=False)
train_dsYou can adjust the batch_size variable if you like. A batch is a group of samples processed together in one forward/backward pass.
After your changes, monitor the model performance, the training speed and the memory usage of the model if you like.
Now that the data has been formatted and prepared for the machine learning model, proceed to the next chapter where you will build the actual model.