Spaces:

subhankarg
/

MagpieTTS_Internal_Demo

Runtime error

App Files Files Community

MagpieTTS_Internal_Demo / docs /source /speechlm2 /datasets.rst

subhankarg

Upload folder using huggingface_hub

0558aa4 verified 11 days ago

raw

history blame contribute delete

15.9 kB

	Datasets
	========

	The speechlm2 collection supports datasets that contain both audio and text data for training models that can understand speech and generate appropriate responses.
	This section describes the dataset format, preparation, and usage with the speechlm2 models.

	Dataset Format
	--------------

	Duplex S2S models use the Lhotse framework for audio data management. The primary datasets used are:

	1. DuplexS2SDataset: For general duplex speech-to-speech models
	2. SALMDataset: Specifically for the Speech-Augmented Language Model (SALM), which processes speech+text and outputs text.

	DuplexS2S Dataset Structure
	^^^^^^^^^^^^^^^^^^^^^^^^^^^

	A typical dataset for speechlm2 models consists of:

	1. Audio files: Contains source audio (input speech) and possibly target audio (output speech)
	2. Text transcriptions: Associated text for both input and output speech
	3. Role identifiers: To distinguish between speakers (e.g., "user" vs "agent")

	The dataset organization is built around the concept of conversation turns, with each turn containing audio and text from either a user or an agent/assistant.

	The datasets are primarily managed using Lhotse's CutSet format, which provides efficient handling of audio data and annotations. A typical Lhotse manifest includes:

	- Audio recording information (path, duration, sample rate)
	- Supervision information (transcripts, speaker roles, timing)
	- Optional additional annotations

	Example of a Lhotse cut:

	.. code-block:: python

	{
	"id": "conversation_1",
	"start": 0,
	"duration": 10.7,
	"channel": 0,
	"supervisions": [
	{
	"id": "conversation_1_turn_0",
	"text": "Can you help me with this problem?",
	"start": 0,
	"duration": 5.2,
	"speaker": "user"
	},
	{
	"id": "conversation_1_turn_1",
	"text": "I can help you with that.",
	"start": 5.2,
	"duration": 3.1,
	"speaker": "assistant"
	}
	],
	"recording": {
	"id": "conversation_1_user",
	"path": "/path/to/audio/conversation_1_user.wav",
	"sampling_rate": 16000,
	"num_samples": 171200,
	"duration": 10.7
	},
	"custom": {
	"target_audio": {
	"id": "conversation_1_assistant",
	"path": "/path/to/audio/conversation_1_assistant.wav",
	"sampling_rate": 22050,
	"num_samples": 235935,
	"duration": 10.7
	}
	}
	}

	The DuplexS2SDataset performs several key operations when processing data:

	1. Turn Identification: Each cut contains a list of `supervisions` with objects of type `lhotse.SupervisionSegment` that represent conversation turns with corresponding text and speaker information.

	2. Speaker Role Separation: The text of each supervision is tokenized and identified as the model's output (when `supervision.speaker` is in `output_roles`, e.g., "agent" or "Assistant") or the model's input (when in `input_roles`, e.g., "user" or "User").

	3. Token Sequence Generation:
	- `target_tokens` and `source_tokens` arrays are created with a length equal to `lhotse.utils.compute_num_frames(cut.duration, frame_length, cut.sampling_rate)`
	- The `frame_length` parameter (typically 80ms) determines the temporal resolution of token assignments
	- Each token is assigned to a position based on its corresponding audio segment's timing

	4. Token Offset Calculation:
	- The starting position for each turn's tokens is determined using `lhotse.utils.compute_num_frames(supervision.start, frame_length)`
	- This ensures tokens are aligned with their corresponding audio segments

	5. Length Validation:
	- If token sequences are too long compared to the audio duration, warnings are emitted
	- Tokens that extend beyond the audio length are truncated

	This process ensures that the model can correctly align audio input with corresponding text, and learn to generate appropriate responses based on the conversation context.

	DuplexS2SDataset
	****************

	This dataset class is designed for models that handle both speech understanding and speech generation. It processes audio inputs and prepares them for the model along with corresponding text.

	.. code-block:: python

	from nemo.collections.speechlm2.data import DuplexS2SDataset

	dataset = DuplexS2SDataset(
	tokenizer=model.tokenizer, # Text tokenizer
	frame_length=0.08, # Frame length in seconds
	source_sample_rate=16000, # Input audio sample rate
	target_sample_rate=22050, # Output audio sample rate
	input_roles=["user", "User"], # Roles considered as input
	output_roles=["agent", "Assistant"] # Roles considered as output
	)

	SALMDataset Structure
	^^^^^^^^^^^^^^^^^^^^^

	Data used for SALM can be either regular speech-to-text data (in any NeMo or Lhotse format), or a dataset of multi-turn conversions.
	For the most part, please refer to `the Configuring multimodal dataloading section <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/datasets.html#configuring-multimodal-dataloading>`_ in the ASR documentation.

	When using speech-to-text data, you'll need read it with a special ``lhotse_as_conversation`` data reader
	that creates a two-turn, query+response, multi-modal conversation data types out of regular Lhotse cuts.
	This approach makes SALM training more flexible, allowing straightforward combination of single-turn and multi-turn data.

	Each audio turn is represented as a single token, defined in ``audio_locator_tag`` property, and automatically added to the model's tokenizer inside model code.
	This token is replaced during the training/generation pass with its corresponding audio segment representation.

	Example YAML configuration using existing ASR datasets with ``lhotse_as_conversation``:

	.. code-block:: yaml

	data:
	train_ds:
	prompt_format: "llama3" # Choose based on your model
	token_equivalent_duration: 0.08
	input_cfg:
	# Example 1: Using standard ASR Lhotse manifests (JSONL)
	- type: lhotse_as_conversation
	cuts_path: /path/to/librispeech_train_clean_100.jsonl.gz
	audio_locator_tag: "<\|audioplaceholder\|>"
	tags:
	context: "Transcribe the following audio:"
	# Optional system prompt can be uncommented
	# system_prompt: "You are a helpful assistant that transcribes audio accurately."

	# Example 2: Using tarred NeMo manifests
	- type: lhotse_as_conversation
	manifest_filepath: /path/to/tedlium_train_manifest.jsonl.gz
	tarred_audio_filepaths: /path/to/tedlium_shards/shard-{000000..000009}.tar
	audio_locator_tag: "<\|audioplaceholder\|>"
	tags:
	context: "Write down what is said in this recording:"

	# Example 3: Using Lhotse SHAR format
	- type: lhotse_as_conversation
	shar_path: /path/to/fisher_shar/
	audio_locator_tag: "<\|audioplaceholder\|>"
	tags:
	context: "Listen to this clip and write a transcript:"

	# ... other settings

	Alternatively, one can provide an existing YAML file with their dataset composition and wrap
	it in a ``lhotse_as_conversation`` reader as follows:

	.. code-block:: yaml

	data:
	train_ds:
	input_cfg:
	- type: lhotse_as_conversation
	input_cfg: /path/to/dataset_config.yaml
	audio_locator_tag: "<\|audioplaceholder\|>"
	tags:
	context: "Transcribe the following audio:"
	# Optional system prompt can be uncommented
	# system_prompt: "You are a helpful assistant that transcribes audio accurately."


	The ``lhotse_as_conversation`` reader automatically creates a two-turn conversation from each ASR example:
	1. Optionally, if ``system_prompt`` tag is provided, it's added as a special system turn for LLM models that support system prompts.
	2. A user turn containing the audio and a text context (from the ``context`` tag)
	3. An assistant turn containing the transcription (from the cut's supervision text)

	If a ``context`` tag is provided in the configuration, it's added as a text turn before the audio.

	SALMDataset
	***********

	This dataset class is specialized for the SALM model, which focuses on understanding speech input and generating text output.

	.. code-block:: python

	from nemo.collections.speechlm2.data import SALMDataset

	dataset = SALMDataset(
	tokenizer=model.tokenizer, # Text tokenizer
	)

	DataModule
	----------

	The DataModule class in the speechlm2 collection manages dataset loading, preparation, and batching for PyTorch Lightning training:

	.. code-block:: python

	from nemo.collections.speechlm2.data import DataModule

	datamodule = DataModule(
	cfg_data, # Configuration dictionary for data
	tokenizer=model.tokenizer, # Text tokenizer
	dataset=dataset # Instance of DuplexS2SDataset or SALMDataset
	)

	The DataModule takes care of:
	1. Setting up proper data parallel ranks for dataloaders
	2. Instantiating the dataloaders with configuration from YAML
	3. Managing multiple datasets for validation/testing

	Bucketing for Efficient Training
	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

	The DataModule supports bucketing for more efficient training. Bucketing groups samples of similar lengths together, which reduces padding and improves training efficiency. The key bucketing parameters are:

	1. batch_duration: Target cumulative duration (in seconds) of samples in a batch
	2. bucket_duration_bins: List of duration thresholds for bucketing
	3. use_bucketing: Flag to enable/disable bucketing
	4. num_buckets: Number of buckets to create
	5. bucket_buffer_size: Number of samples to load into memory for bucket assignment

	Example bucketing configuration:

	.. code-block:: yaml

	train_ds:
	# ... other settings
	batch_duration: 100 # Target 100 seconds per batch
	bucket_duration_bins: [8.94766, 10.1551, 11.64118, 19.30376, 42.85] # Duration thresholds
	use_bucketing: true # Enable bucketing
	num_buckets: 5 # Create 5 buckets
	bucket_buffer_size: 5000 # Buffer size for bucket assignment

	When bucketing is enabled:

	1. Samples are grouped into buckets based on their duration
	2. Each batch contains samples from the same bucket
	3. The actual batch size can vary to maintain a consistent total duration
	4. The target batch_duration ensures efficient GPU memory usage

	Bucketing helps to:
	- Reduce padding and increase effective batch size
	- Improve training efficiency and convergence
	- Manage memory usage with variable-length inputs

	Data Configuration
	------------------

	A typical data configuration in YAML includes:

	.. code-block:: yaml

	data:

	train_ds:
	sample_rate: ${data.target_sample_rate}
	input_cfg:
	- type: lhotse_shar
	shar_path: /path/to/train_data
	seed: 42
	shard_seed: "randomized"
	num_workers: 4
	# Optional bucketing settings
	batch_duration: 100
	bucket_duration_bins: [8.94766, 10.1551, 11.64118, 19.30376, 42.85]
	use_bucketing: true
	num_buckets: 5
	bucket_buffer_size: 5000
	# batch_size: 4 # alternative to bucketing

	validation_ds:
	datasets:
	val_set_name_0:
	shar_path: /path/to/validation_data_0
	val_set_name_1:
	shar_path: /path/to/validation_data_1
	sample_rate: ${data.target_sample_rate}
	batch_size: 4
	seed: 42
	shard_seed: "randomized"

	Note that the actual dataset paths and blend are defined by the YAML config, not Python code. This makes it easy to change the dataset composition without modifying the code.
	To learn more about the YAML data config, see :ref:`the Extended multi-dataset configuration format <asr-dataset-config-format>` section in the ASR documentation.

	Preparing S2S Datasets
	------------------

	Creating Lhotse Manifests
	^^^^^^^^^^^^^^^^^^^^^^^^^

	To prepare your own dataset, you'll need to create Lhotse manifests from your audio files and transcripts:

	.. code-block:: python

	from lhotse import CutSet, Recording, SupervisionSegment

	# Create a recording for user and assistant
	recording_user = Recording(
	id="conversation_1_user",
	path="/path/to/audio/conversation_1_user.wav",
	sampling_rate=16000,
	num_samples=171200,
	duration=10.7
	)
	recording_assistant = Recording(
	id="conversation_1_assistant",
	path="/path/to/audio/conversation_1_assistant.wav",
	sampling_rate=22050,
	num_samples=235935,
	duration=10.7
	)

	# Create supervision for this recording
	supervisions = [
	SupervisionSegment(
	id="conversation_1_turn_0",
	recording_id="conversation_1",
	start=0,
	duration=5.2,
	text="Can you help me with this problem?",
	speaker="user"
	),
	SupervisionSegment(
	id="conversation_1_turn_1",
	recording_id="conversation_1",
	start=5.5,
	duration=3.1,
	text="I can help you with that.",
	speaker="assistant"
	),
	]

	# Create a CutSet
	# The assistant's response is located in target_audio field which makes it easy to replace
	# when using multiple models or speakers for synthetic data generation.
	cut = recording.to_cut()
	cut.supervisions = supervisions
	cut.target_audio = recording_assistant
	cutset = CutSet([cut])

	# Save to disk
	cutset.to_file("path/to/manifest.jsonl.gz")

	Converting to SHAR Format
	^^^^^^^^^^^^^^^^^^^^^^^^^

	For efficient training, it's recommended to convert your Lhotse manifests to SHAR (SHarded ARchive) format:

	.. code-block:: python

	from lhotse import CutSet
	from lhotse.shar import SharWriter

	cutset = CutSet.from_file("path/to/manifest.jsonl.gz")
	cutset.to_shar("path/to/train_shar", fields={"recording": "flac", "target_audio": "flac"}, shard_size=100)


	Training with Prepared Datasets
	-------------------------------

	Once your datasets are prepared, you can use them to train a model:

	.. code-block:: python

	# Load configuration
	config_path = "path/to/config.yaml"
	cfg = OmegaConf.load(config_path)

	# The training data paths are available in the config file:
	# cfg.data.train_ds.input_cfg[0].shar_path = "path/to/train_shar"

	# Create dataset and datamodule
	dataset = DuplexS2SDataset(
	tokenizer=model.tokenizer,
	frame_length=cfg.data.frame_length,
	source_sample_rate=cfg.data.source_sample_rate,
	target_sample_rate=cfg.data.target_sample_rate,
	input_roles=cfg.data.input_roles,
	output_roles=cfg.data.output_roles,
	)
	datamodule = DataModule(cfg.data, tokenizer=model.tokenizer, dataset=dataset)

	# Train the model
	trainer.fit(model, datamodule)

	Example S2S Datasets
	--------------------

	While there are no publicly available datasets specifically formatted for Duplex S2S models yet, you can adapt conversation datasets with audio recordings such as:

	1. Fisher Corpus
	2. Switchboard Corpus
	3. CallHome
	4. Synthetic conversation datasets generated using TTS

	You would need to format these datasets as Lhotse manifests with appropriate speaker role annotations to use them with the speechlm2 S2S models.