Tahoe-100M-SCVI-v1 / README.md

Update README.md

b5283a7 verified 9 days ago

8.71 kB

	---
	# For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
	# Doc / guide: https://huggingface.co/docs/hub/model-cards
	library_name: scvi-tools
	license: mit
	tags:
	- biology
	- genomics
	- single-cell
	- model_cls_name:SCVI
	- scvi_version:1.2.0
	- anndata_version:0.11.1
	- modality:rna
	- tissue:None
	- annotated:True
	---

	# Model Card for Tahoe-100M-SCVI-v1

	<!-- Provide a quick summary of what the model is/does. -->

	An SCVI model and minified AnnData of the [Tahoe-100M](https://doi.org/10.1101/2025.02.20.639398) dataset from Vevo Tx.

	## Model Details

	### Model Description

	<!-- Provide a longer summary of what this model is. -->

	Tahoe-100M-SCVI-v1

	- Developed by: Vevo Tx
	- Model type: SCVI variational autoencoder
	- License: This model is licensed under the MIT License.

	### Model Architecture

	SCVI model

	Layers: 1, Hidden Units: 128, Latent Dimensions: 10

	### Parameters
	40,390,510

	## Intended Use

	<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

	### Direct Use

	<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->

	- Decoding Tahoe-100M data representation vectors to gene expression.
	- Encoding scRNA-seq data to Tahoe-100M cell state representation space.

	### Downstream Use

	<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->

	- Adaptation to additional scRNA-seq data

	### Intended Users

	- Computational biologists analyzing gene expression responses to drug perturbations.
	- Machine learning researchers developing methods for downstream drug response prediction.

	## Bias, Risks, and Limitations

	<!-- This section is meant to convey both technical and sociotechnical limitations. -->

	Reconstruced gene expression values may be inaccurate. Calibration analysis shows that the model generates counts that contains the observed counts within the 95% confidence intervals from the posterior predictice distribution 97.7% of the time. However, a naive baseline of producing only 0-counts achieves 97.4% on the same metric.

	The Tahoe-100M data is based on cancer cell lines under drug treatment, and the model is trained to represent this data. The model may not be directly applicable to other forms of scRNA-seq data, such as that from primary cells.

	{{ bias_recommendations \| default("Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.", true)}}

	## How to Get Started with the Model

	Use the code below to get started with the model.

	Loading the minified AnnData will require 41 GB storage (saved in the `cache-dir`)) and RAM. The model itself requires ~1 GB GPU memory.

	```
	> import scvi.hub

	> tahoe_hubmodel = scvi.hub.HubModel.pull_from_huggingface_hub(
	repo_name = 'vevotx/Tahoe-100M-SCVI-v1',
	cache_dir = '/path/to/cache'
	)

	> tahoe = tahoe_hubmodel.model

	> tahoe
	SCVI model with the following parameters:
	n_hidden: 128, n_latent: 10, n_layers: 1, dropout_rate: 0.1, dispersion: gene, gene_likelihood: nb,
	latent_distribution: normal.
	Training status: Trained
	Model's adata is minified?: True

	> tahoe.adata
	AnnData object with n_obs × n_vars = 95624334 × 62710
	obs: 'sample', 'species', 'gene_count', 'tscp_count', 'mread_count', 'bc1_wind', 'bc2_wind', 'bc3_wind', 'bc1_well', 'bc2_well', 'bc3_well', 'id', 'drugname_drugconc', 'drug', 'INT_ID', 'NUM.SNPS', 'NUM.READS', 'demuxlet_call', 'BEST.LLK', 'NEXT.LLK', 'DIFF.LLK.BEST.NEXT', 'BEST.POSTERIOR', 'SNG.POSTERIOR', 'SNG.BEST.LLK', 'SNG.NEXT.LLK', 'SNG.ONLY.POSTERIOR', 'DBL.BEST.LLK', 'DIFF.LLK.SNG.DBL', 'sublibrary', 'BARCODE', 'pcnt_mito', 'S_score', 'G2M_score', 'phase', 'pass_filter', 'dataset', '_scvi_batch', '_scvi_labels', '_scvi_observed_lib_size', 'plate', 'Cell_Name_Vevo', 'Cell_ID_Cellosaur'
	var: 'gene_id', 'genome', 'SUB_LIB_ID'
	uns: '_scvi_adata_minify_type', '_scvi_manager_uuid', '_scvi_uuid'
	obsm: 'X_latent_qzm', 'X_latent_qzv', '_scvi_latent_qzm', '_scvi_latent_qzv'
	layers: 'counts'

	> # Take some random genes
	> gene_list = tahoe.adata.var.sample(10).index

	> # Take some random cells
	> cell_indices = tahoe.adata.obs.sample(10).index

	> # Decoode gene expression
	> gene_expression = tahoe.get_normalized_expression(tahoe.adata[cell_indices], gene_list = gene_list)
	> print(gene_expression)
	gene_name TSPAN13 ZSCAN9 ENSG00000200991 ENSG00000224901 \
	BARCODE_SUB_LIB_ID
	73_177_027-lib_2615 0.000036 0.000005 4.255257e-10 9.856240e-08
	63_080_025-lib_2087 0.000012 0.000012 3.183158e-10 1.124618e-07
	01_070_028-lib_1543 0.000005 0.000010 1.604187e-10 1.022676e-07
	07_110_046-lib_1885 0.000035 0.000018 2.597950e-09 1.063819e-07
	93_082_010-lib_2285 0.000008 0.000009 8.147555e-10 9.102466e-08
	94_154_081-lib_2562 0.000035 0.000014 5.600219e-10 6.891351e-08
	47_102_103-lib_2596 0.000021 0.000010 7.320031e-10 1.190017e-07
	92_138_169-lib_2356 0.000038 0.000015 3.393952e-10 7.600610e-08
	35_035_133-lib_2378 0.000041 0.000004 1.503101e-10 9.447428e-08
	06_084_182-lib_2611 0.000007 0.000014 5.135248e-10 7.896663e-08

	gene_name RN7SL69P ENSG00000263301 ENSG00000269886 \
	BARCODE_SUB_LIB_ID
	73_177_027-lib_2615 2.390874e-10 1.896764e-07 7.665454e-08
	63_080_025-lib_2087 1.934646e-10 2.205981e-07 6.038700e-08
	01_070_028-lib_1543 9.687608e-11 9.900592e-08 5.225622e-08
	07_110_046-lib_1885 1.694676e-09 2.274248e-07 7.741949e-08
	93_082_010-lib_2285 6.253397e-10 2.593786e-07 7.113768e-08
	94_154_081-lib_2562 3.700961e-10 2.083358e-07 6.379186e-08
	47_102_103-lib_2596 4.534019e-10 2.551739e-07 4.840992e-08
	92_138_169-lib_2356 2.018963e-10 2.067301e-07 4.144172e-08
	35_035_133-lib_2378 8.090239e-11 1.658230e-07 3.890900e-08
	06_084_182-lib_2611 3.474709e-10 1.025397e-07 4.995985e-08
	...
	47_102_103-lib_2596 1.975285e-09 7.876221e-08 1.513182e-08
	92_138_169-lib_2356 1.214693e-09 4.208334e-08 1.091937e-08
	35_035_133-lib_2378 1.049879e-09 8.961482e-08 1.650536e-08
	06_084_182-lib_2611 2.311277e-09 5.680565e-08 1.824982e-08
	```

	## Training Details

	### Training Data

	<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

	Tahoe-100M

	Zhang, Jesse, Airol A. Ubas, Richard de Borja, Valentine Svensson, Nicole Thomas, Neha Thakar, Ian Lai, et al. 2025. “Tahoe-100M: A Giga-Scale Single-Cell Perturbation Atlas for Context-Dependent Gene Function and Cellular Modeling.” bioRxiv. https://doi.org/10.1101/2025.02.20.639398.

	### Training Procedure

	<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

	The model was trained using the SCVI `.train()` method. One plate (plate 14) of the training data was held out for training to be used for evaluation and criticism. A callback was used to evaluate reconstruction error of the training set and validation set every N minibatch rather than every epoch since a single epoch is too large to give informative training curves. An additional callback function was used to save snapshots of the model state at every epoch.


	#### Training Hyperparameters

	- Training regime: fp32 precision was used for training.

	#### Speeds, Sizes, Times [optional]

	<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->

	## Evaluation

	<!-- This section describes the evaluation protocols and provides the results. -->

	### Testing Data, Factors & Metrics

	#### Testing Data

	<!-- This should link to a Dataset Card if possible. -->

	Data in the minified AnnData where the 'plate' column equals '14' was held out from training and used for evaluation and criticism.

	#### Metrics

	<!-- These are the evaluation metrics being used, ideally with a description of why. -->

	The main metric is reconstruction error, defined as the average negative log likelihood of the observed counts given the representation vectors. This model uses a negative binomial likelihood.