Datasets
========

The speechlm2 collection supports datasets that contain both audio and text data for training models that can understand speech and generate appropriate responses.
This section describes the dataset format, preparation, and usage with the speechlm2 models.

Dataset Format
--------------

Duplex S2S models use the Lhotse framework for audio data management. The primary datasets used are:

1. **DuplexS2SDataset**: For general duplex speech-to-speech models
2. **SALMDataset**: Specifically for the Speech-Augmented Language Model (SALM), which processes speech+text and outputs text.

DuplexS2S Dataset Structure
^^^^^^^^^^^^^^^^^^^^^^^^^^^

A typical dataset for speechlm2 models consists of:

1. **Audio files**: Contains source audio (input speech) and possibly target audio (output speech)
2. **Text transcriptions**: Associated text for both input and output speech
3. **Role identifiers**: To distinguish between speakers (e.g., "user" vs "agent")

The dataset organization is built around the concept of conversation turns, with each turn containing audio and text from either a user or an agent/assistant.

The datasets are primarily managed using Lhotse's CutSet format, which provides efficient handling of audio data and annotations. A typical Lhotse manifest includes:

- Audio recording information (path, duration, sample rate)
- Supervision information (transcripts, speaker roles, timing)
- Optional additional annotations

Example of a Lhotse cut:

.. code-block:: python

    {
        "id": "conversation_1",
        "start": 0,
        "duration": 10.7,
        "channel": 0,
        "supervisions": [
            {
                "id": "conversation_1_turn_0",
                "text": "Can you help me with this problem?",
                "start": 0,
                "duration": 5.2,
                "speaker": "user"
            },
            {
                "id": "conversation_1_turn_1",
                "text": "I can help you with that.",
                "start": 5.2,
                "duration": 3.1,
                "speaker": "assistant"
            }
        ],
        "recording": {
            "id": "conversation_1_user",
            "path": "/path/to/audio/conversation_1_user.wav",
            "sampling_rate": 16000,
            "num_samples": 171200,
            "duration": 10.7
        },
        "custom": {
            "target_audio": {
                "id": "conversation_1_assistant",
                "path": "/path/to/audio/conversation_1_assistant.wav",
                "sampling_rate": 22050,
                "num_samples": 235935,
                "duration": 10.7
            }
        }
    }

The DuplexS2SDataset performs several key operations when processing data:

1. **Turn Identification**: Each cut contains a list of `supervisions` with objects of type `lhotse.SupervisionSegment` that represent conversation turns with corresponding text and speaker information.

2. **Speaker Role Separation**: The text of each supervision is tokenized and identified as the model's output (when `supervision.speaker` is in `output_roles`, e.g., "agent" or "Assistant") or the model's input (when in `input_roles`, e.g., "user" or "User").

3. **Token Sequence Generation**:
   - `target_tokens` and `source_tokens` arrays are created with a length equal to `lhotse.utils.compute_num_frames(cut.duration, frame_length, cut.sampling_rate)`
   - The `frame_length` parameter (typically 80ms) determines the temporal resolution of token assignments
   - Each token is assigned to a position based on its corresponding audio segment's timing

4. **Token Offset Calculation**:
   - The starting position for each turn's tokens is determined using `lhotse.utils.compute_num_frames(supervision.start, frame_length)`
   - This ensures tokens are aligned with their corresponding audio segments

5. **Length Validation**:
   - If token sequences are too long compared to the audio duration, warnings are emitted
   - Tokens that extend beyond the audio length are truncated

This process ensures that the model can correctly align audio input with corresponding text, and learn to generate appropriate responses based on the conversation context.

DuplexS2SDataset
****************

This dataset class is designed for models that handle both speech understanding and speech generation. It processes audio inputs and prepares them for the model along with corresponding text.

.. code-block:: python

    from nemo.collections.speechlm2.data import DuplexS2SDataset
    
    dataset = DuplexS2SDataset(
        tokenizer=model.tokenizer,                   # Text tokenizer
        frame_length=0.08,                          # Frame length in seconds
        source_sample_rate=16000,                   # Input audio sample rate
        target_sample_rate=22050,                   # Output audio sample rate
        input_roles=["user", "User"],               # Roles considered as input
        output_roles=["agent", "Assistant"]         # Roles considered as output
    )

SALMDataset Structure
^^^^^^^^^^^^^^^^^^^^^

Data used for SALM can be either regular speech-to-text data (in any NeMo or Lhotse format), or a dataset of multi-turn conversions.
For the most part, please refer to `the Configuring multimodal dataloading section <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/datasets.html#configuring-multimodal-dataloading>`_ in the ASR documentation.

When using speech-to-text data, you'll need read it with a special ``lhotse_as_conversation`` data reader
that creates a two-turn, query+response, multi-modal conversation data types out of regular Lhotse cuts.
This approach makes SALM training more flexible, allowing straightforward combination of single-turn and multi-turn data.

Each audio turn is represented as a single token, defined in ``audio_locator_tag`` property, and automatically added to the model's tokenizer inside model code.
This token is replaced during the training/generation pass with its corresponding audio segment representation.

Example YAML configuration using existing ASR datasets with ``lhotse_as_conversation``:

.. code-block:: yaml

    data:
      train_ds:
        prompt_format: "llama3"  # Choose based on your model
        token_equivalent_duration: 0.08
        input_cfg:
          # Example 1: Using standard ASR Lhotse manifests (JSONL)
          - type: lhotse_as_conversation
            cuts_path: /path/to/librispeech_train_clean_100.jsonl.gz
            audio_locator_tag: "<|audioplaceholder|>"
            tags:
              context: "Transcribe the following audio:"
              # Optional system prompt can be uncommented
              # system_prompt: "You are a helpful assistant that transcribes audio accurately."
          
          # Example 2: Using tarred NeMo manifests
          - type: lhotse_as_conversation
            manifest_filepath: /path/to/tedlium_train_manifest.jsonl.gz
            tarred_audio_filepaths: /path/to/tedlium_shards/shard-{000000..000009}.tar
            audio_locator_tag: "<|audioplaceholder|>"
            tags:
              context: "Write down what is said in this recording:"
              
          # Example 3: Using Lhotse SHAR format
          - type: lhotse_as_conversation
            shar_path: /path/to/fisher_shar/
            audio_locator_tag: "<|audioplaceholder|>"
            tags:
              context: "Listen to this clip and write a transcript:"
    
      # ... other settings

Alternatively, one can provide an existing YAML file with their dataset composition and wrap 
it in a ``lhotse_as_conversation`` reader as follows:

.. code-block:: yaml

    data:
      train_ds:
        input_cfg:
          - type: lhotse_as_conversation
            input_cfg: /path/to/dataset_config.yaml
            audio_locator_tag: "<|audioplaceholder|>"
            tags:
              context: "Transcribe the following audio:"
              # Optional system prompt can be uncommented
              # system_prompt: "You are a helpful assistant that transcribes audio accurately."


The ``lhotse_as_conversation`` reader automatically creates a two-turn conversation from each ASR example:
1. Optionally, if ``system_prompt`` tag is provided, it's added as a special system turn for LLM models that support system prompts.
2. A user turn containing the audio and a text context (from the ``context`` tag)
3. An assistant turn containing the transcription (from the cut's supervision text)

If a ``context`` tag is provided in the configuration, it's added as a text turn before the audio.

SALMDataset
***********

This dataset class is specialized for the SALM model, which focuses on understanding speech input and generating text output.

.. code-block:: python

    from nemo.collections.speechlm2.data import SALMDataset
    
    dataset = SALMDataset(
        tokenizer=model.tokenizer,                   # Text tokenizer
    )

DataModule
----------

The DataModule class in the speechlm2 collection manages dataset loading, preparation, and batching for PyTorch Lightning training:

.. code-block:: python

    from nemo.collections.speechlm2.data import DataModule
    
    datamodule = DataModule(
        cfg_data,                  # Configuration dictionary for data
        tokenizer=model.tokenizer, # Text tokenizer
        dataset=dataset            # Instance of DuplexS2SDataset or SALMDataset
    )

The DataModule takes care of:
1. Setting up proper data parallel ranks for dataloaders
2. Instantiating the dataloaders with configuration from YAML
3. Managing multiple datasets for validation/testing

Bucketing for Efficient Training
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The DataModule supports bucketing for more efficient training. Bucketing groups samples of similar lengths together, which reduces padding and improves training efficiency. The key bucketing parameters are:

1. **batch_duration**: Target cumulative duration (in seconds) of samples in a batch
2. **bucket_duration_bins**: List of duration thresholds for bucketing
3. **use_bucketing**: Flag to enable/disable bucketing
4. **num_buckets**: Number of buckets to create
5. **bucket_buffer_size**: Number of samples to load into memory for bucket assignment

Example bucketing configuration:

.. code-block:: yaml

    train_ds:
      # ... other settings
      batch_duration: 100  # Target 100 seconds per batch
      bucket_duration_bins: [8.94766, 10.1551, 11.64118, 19.30376, 42.85]  # Duration thresholds
      use_bucketing: true  # Enable bucketing
      num_buckets: 5  # Create 5 buckets
      bucket_buffer_size: 5000  # Buffer size for bucket assignment

When bucketing is enabled:

1. Samples are grouped into buckets based on their duration
2. Each batch contains samples from the same bucket
3. The actual batch size can vary to maintain a consistent total duration
4. The target batch_duration ensures efficient GPU memory usage

Bucketing helps to:
- Reduce padding and increase effective batch size
- Improve training efficiency and convergence
- Manage memory usage with variable-length inputs

Data Configuration
------------------

A typical data configuration in YAML includes:

.. code-block:: yaml

    data:

      train_ds:
        sample_rate: ${data.target_sample_rate}
        input_cfg:
          - type: lhotse_shar
            shar_path: /path/to/train_data
        seed: 42
        shard_seed: "randomized"
        num_workers: 4
        # Optional bucketing settings
        batch_duration: 100
        bucket_duration_bins: [8.94766, 10.1551, 11.64118, 19.30376, 42.85]
        use_bucketing: true
        num_buckets: 5
        bucket_buffer_size: 5000
        # batch_size: 4  # alternative to bucketing
    
      validation_ds:
        datasets:
          val_set_name_0:
            shar_path: /path/to/validation_data_0
          val_set_name_1:
            shar_path: /path/to/validation_data_1
        sample_rate: ${data.target_sample_rate}
        batch_size: 4
        seed: 42
        shard_seed: "randomized"

Note that the actual dataset paths and blend are defined by the YAML config, not Python code. This makes it easy to change the dataset composition without modifying the code.
To learn more about the YAML data config, see :ref:`the Extended multi-dataset configuration format <asr-dataset-config-format>` section in the ASR documentation.

Preparing S2S Datasets
------------------

Creating Lhotse Manifests
^^^^^^^^^^^^^^^^^^^^^^^^^

To prepare your own dataset, you'll need to create Lhotse manifests from your audio files and transcripts:

.. code-block:: python

    from lhotse import CutSet, Recording, SupervisionSegment
    
    # Create a recording for user and assistant
    recording_user = Recording(
        id="conversation_1_user",
        path="/path/to/audio/conversation_1_user.wav",
        sampling_rate=16000,
        num_samples=171200,
        duration=10.7
    )
    recording_assistant = Recording(
        id="conversation_1_assistant",
        path="/path/to/audio/conversation_1_assistant.wav",
        sampling_rate=22050,
        num_samples=235935,
        duration=10.7
    )
    
    # Create supervision for this recording
    supervisions = [
        SupervisionSegment(
            id="conversation_1_turn_0",
            recording_id="conversation_1",
            start=0,
            duration=5.2,
            text="Can you help me with this problem?",
            speaker="user"
        ),
        SupervisionSegment(
            id="conversation_1_turn_1",
            recording_id="conversation_1",
            start=5.5,
            duration=3.1,
            text="I can help you with that.",
            speaker="assistant"
        ),
    ]
    
    # Create a CutSet
    # The assistant's response is located in target_audio field which makes it easy to replace
    # when using multiple models or speakers for synthetic data generation.
    cut = recording.to_cut()
    cut.supervisions = supervisions
    cut.target_audio = recording_assistant
    cutset = CutSet([cut])
    
    # Save to disk
    cutset.to_file("path/to/manifest.jsonl.gz")

Converting to SHAR Format
^^^^^^^^^^^^^^^^^^^^^^^^^

For efficient training, it's recommended to convert your Lhotse manifests to SHAR (SHarded ARchive) format:

.. code-block:: python

    from lhotse import CutSet
    from lhotse.shar import SharWriter
    
    cutset = CutSet.from_file("path/to/manifest.jsonl.gz")
    cutset.to_shar("path/to/train_shar", fields={"recording": "flac", "target_audio": "flac"}, shard_size=100)
    

Training with Prepared Datasets
-------------------------------

Once your datasets are prepared, you can use them to train a model:

.. code-block:: python

    # Load configuration
    config_path = "path/to/config.yaml"
    cfg = OmegaConf.load(config_path)
    
    # The training data paths are available in the config file:
    # cfg.data.train_ds.input_cfg[0].shar_path = "path/to/train_shar"
    
    # Create dataset and datamodule
    dataset = DuplexS2SDataset(
        tokenizer=model.tokenizer,
        frame_length=cfg.data.frame_length,
        source_sample_rate=cfg.data.source_sample_rate,
        target_sample_rate=cfg.data.target_sample_rate,
        input_roles=cfg.data.input_roles,
        output_roles=cfg.data.output_roles,
    )
    datamodule = DataModule(cfg.data, tokenizer=model.tokenizer, dataset=dataset)
    
    # Train the model
    trainer.fit(model, datamodule)

Example S2S Datasets
--------------------

While there are no publicly available datasets specifically formatted for Duplex S2S models yet, you can adapt conversation datasets with audio recordings such as:

1. Fisher Corpus
2. Switchboard Corpus
3. CallHome
4. Synthetic conversation datasets generated using TTS

You would need to format these datasets as Lhotse manifests with appropriate speaker role annotations to use them with the speechlm2 S2S models.