# Building a PyTorch Training Loop

In order to be able to access the data on Hugging Face Hub and build the
data loaders for our training loop, we should import the necessary libraries
first

In [None]:
from datasets import load_dataset # Loading datasets from Hugging Face Hub
import torch # PyTorch
from torch.utils.data import DataLoader # PyTorch DataLoader for creating batches
from pprint import pprint # Pretty print
from tqdm import tqdm # Progress bar

In this tutorial, we are going to work with the
[PubChemQC-B3LYP/6-31G*//PM6
Dataset](https://huggingface.co/datasets/molssiai-hub/pubchemqc-b3lyp)
(PubChemQC-B3LYP for short) from the [PubChemQC dataset
collection](https://huggingface.co/collections/molssiai-hub/pubchemqc-datasets-669e5482260861ba7cce3d1c).
Let us set a few variables and load the dataset as shown below

After importing the modules, we set a few variables that will be used throughout
this demo.

In [None]:
# path to the dataset repository on the Hugging Face Hub
path = "molssiai-hub/pubchemqc-b3lyp"

# set the dataset configuration/subset name
name = "b3lyp_pm6"

# set the dataset split
split = "train"

# load the dataset
hub_dataset = load_dataset(path=path,
 name=name,
 split=split,
 streaming=True,
 trust_remote_code=True)

Here, we set the `streaming` parameter to `True` to avoid downloading the
dataset on disk and ensure streaming the data from the hub. In this mode, the
`load_dataset` function returns an `IterableDataset` object that can be iterated
over and provide access to the data. The `trust_remote_code` argument is also
set to `True` to allow the usage of a custom [load
script](https://huggingface.co/datasets/molssiai-hub/pubchemqc-b3lyp/blob/main/pubchemqc-b3lyp.py)
for the data.

By default, the Hugging Face data objects' `__getitem__` method returns a native
Python object (e.g., a dictionary). However, we can use the `with_format()`
method to specify the format of the data we want to access. In our case, we want
to use the `torch.tensor` format to build the data loaders for our training
loop. Let us transform our data and check the result

In [None]:
# set the dataset format to PyTorch tensors
hub_dataset = hub_dataset.with_format("torch")

# fetch the first data point
next(iter(hub_dataset.take(1)))

We can see that the type of the numerical features in our data sample are
transformed to `torch.tensor` objects. Let us access the `coordinates` field
to make this more clear

In [None]:
# fetch the first data point
data_point = next(iter(hub_dataset.take(1)))

# print the coordinates of the first data point and its type
data_point["coordinates"], type(data_point["coordinates"])

In the code snippet above, we have wrapped the `IterableDataset` object, `hub_dataset`,
inside an `iter()` function to create an iterator object and used the `next()` function
to iterate once over it and access the first sample in it.

Our PubChemQC-B3LYP `IterableDataset` object is divided into multiple shards
to enable multiprocessing and help shuffling the data.

In [None]:
print(f"the PubChemQC-B3LYP dataset has {hub_dataset.n_shards} shards")

If we want to shuffle our data, the shards will also be shuffled. This is
important to consider when building the PyTorch data loaders for our training
loop.

In [None]:
# shuffle the dataset
hub_dataset = hub_dataset.shuffle(seed=123, buffer_size=1000)

The `buffer_size` controls the size of a container object from which we randomly
sample examples from. For instance, when we call the `IterableDataset.shuffle()`
function, the first thousand examples in the buffer are randomly sampled and the
selected examples in the buffer are then replaced with new examples from the
dataset. The `buffer_size` argument is set to 1000 by default. 

A nice feature of the Hugging Face dataset objects is that they can be directly
passed to PyTorch DataLoaders as shown below

In [None]:
# create a PyTorch DataLoader with a batch size of 4
dataloader = DataLoader(hub_dataset, batch_size=4, collate_fn=lambda x: x)

By default, the `DataLoader` object will use a default collator function which
creates batches of data and transforms them into `torch.tensors`. For our
dataset examples, however, we cannot use the default collator function because
our data samples are not of the same length (different molecules may have
different number of atoms and coordinates). To circumvent this problem, we can
define a lambda function that yields each data point, which is a dictionary,
without any transformation.

Similar to the `hub_dataset`, we can also wrap the `dataloader` object inside an
iterator and use the `next()` function to access the first batch of data 

In [None]:
data_point = next(iter(dataloader))
data_point[0]["coordinates"]

## Building a Training Loop in PyTorch

Now that we know how to access, fetch and shuffle batches of data samples in our
PyTorch data loader, we can build a simple training loop to train a model

In [None]:
# set up the training loop
for epoch in range(1, 4, 1):

 # set the epoch
 hub_dataset.set_epoch(epoch)

 # iterate over the batches in the DataLoader
 for i, batch in enumerate(tqdm(dataloader, total=4, desc=f"Epoch {epoch}")):
 if i == 4:
 pprint(f"The isomeric SMILES from the first data point of the {i}th batch: {batch[0]['pubchem-isomeric-smiles']}",
 width=100,
 compact=True)
 break
 print(f"Epoch: {epoch}, Batch: {i+1}, Batch size: {len(batch)}")

In the code snippet above, we have used the `set_epoch(epoch)` function which
is often used with PyTorch data loaders and in distributed settings to augment the
random seed for reshuffling at the beginning of each epoch.