--- # For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1 # Doc / guide: https://huggingface.co/docs/hub/model-cards library_name: scvi-tools license: mit tags: - biology - genomics - single-cell - model_cls_name:SCVI - scvi_version:1.2.0 - anndata_version:0.11.1 - modality:rna - tissue:None - annotated:True --- # Model Card for Tahoe-100M-SCVI-v1 An SCVI model and minified AnnData of the [Tahoe-100M](https://doi.org/10.1101/2025.02.20.639398) dataset from Vevo Tx. ## Model Details ### Model Description Tahoe-100M-SCVI-v1 - **Developed by:** Vevo Tx - **Model type:** SCVI variational autoencoder - **License:** This model is licensed under the MIT License. ### Model Architecture SCVI model Layers: 1, Hidden Units: 128, Latent Dimensions: 10 ### Parameters 40,390,510 ## Intended Use ### Direct Use - Decoding Tahoe-100M data representation vectors to gene expression. - Encoding scRNA-seq data to Tahoe-100M cell state representation space. ### Downstream Use - Adaptation to additional scRNA-seq data ### Intended Users - **Computational biologists** analyzing gene expression responses to drug perturbations. - **Machine learning researchers** developing methods for downstream drug response prediction. ## Bias, Risks, and Limitations Reconstruced gene expression values may be inaccurate. Calibration analysis shows that the model generates counts that contains the observed counts within the 95% confidence intervals from the posterior predictice distribution 97.7% of the time. However, a naive baseline of producing only 0-counts achieves 97.4% on the same metric. The Tahoe-100M data is based on cancer cell lines under drug treatment, and the model is trained to represent this data. The model may not be directly applicable to other forms of scRNA-seq data, such as that from primary cells. {{ bias_recommendations | default("Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.", true)}} ## How to Get Started with the Model Use the code below to get started with the model. Loading the minified AnnData will require 41 GB storage (saved in the `cache-dir`)) and RAM. The model itself requires ~1 GB GPU memory. ``` > import scvi.hub > tahoe_hubmodel = scvi.hub.HubModel.pull_from_huggingface_hub( repo_name = 'vevotx/Tahoe-100M-SCVI-v1', cache_dir = '/path/to/cache' ) > tahoe = tahoe_hubmodel.model > tahoe SCVI model with the following parameters: n_hidden: 128, n_latent: 10, n_layers: 1, dropout_rate: 0.1, dispersion: gene, gene_likelihood: nb, latent_distribution: normal. Training status: Trained Model's adata is minified?: True > tahoe.adata AnnData object with n_obs × n_vars = 95624334 × 62710 obs: 'sample', 'species', 'gene_count', 'tscp_count', 'mread_count', 'bc1_wind', 'bc2_wind', 'bc3_wind', 'bc1_well', 'bc2_well', 'bc3_well', 'id', 'drugname_drugconc', 'drug', 'INT_ID', 'NUM.SNPS', 'NUM.READS', 'demuxlet_call', 'BEST.LLK', 'NEXT.LLK', 'DIFF.LLK.BEST.NEXT', 'BEST.POSTERIOR', 'SNG.POSTERIOR', 'SNG.BEST.LLK', 'SNG.NEXT.LLK', 'SNG.ONLY.POSTERIOR', 'DBL.BEST.LLK', 'DIFF.LLK.SNG.DBL', 'sublibrary', 'BARCODE', 'pcnt_mito', 'S_score', 'G2M_score', 'phase', 'pass_filter', 'dataset', '_scvi_batch', '_scvi_labels', '_scvi_observed_lib_size', 'plate', 'Cell_Name_Vevo', 'Cell_ID_Cellosaur' var: 'gene_id', 'genome', 'SUB_LIB_ID' uns: '_scvi_adata_minify_type', '_scvi_manager_uuid', '_scvi_uuid' obsm: 'X_latent_qzm', 'X_latent_qzv', '_scvi_latent_qzm', '_scvi_latent_qzv' layers: 'counts' > # Take some random genes > gene_list = tahoe.adata.var.sample(10).index > # Take some random cells > cell_indices = tahoe.adata.obs.sample(10).index > # Decoode gene expression > gene_expression = tahoe.get_normalized_expression(tahoe.adata[cell_indices], gene_list = gene_list) > print(gene_expression) gene_name TSPAN13 ZSCAN9 ENSG00000200991 ENSG00000224901 \ BARCODE_SUB_LIB_ID 73_177_027-lib_2615 0.000036 0.000005 4.255257e-10 9.856240e-08 63_080_025-lib_2087 0.000012 0.000012 3.183158e-10 1.124618e-07 01_070_028-lib_1543 0.000005 0.000010 1.604187e-10 1.022676e-07 07_110_046-lib_1885 0.000035 0.000018 2.597950e-09 1.063819e-07 93_082_010-lib_2285 0.000008 0.000009 8.147555e-10 9.102466e-08 94_154_081-lib_2562 0.000035 0.000014 5.600219e-10 6.891351e-08 47_102_103-lib_2596 0.000021 0.000010 7.320031e-10 1.190017e-07 92_138_169-lib_2356 0.000038 0.000015 3.393952e-10 7.600610e-08 35_035_133-lib_2378 0.000041 0.000004 1.503101e-10 9.447428e-08 06_084_182-lib_2611 0.000007 0.000014 5.135248e-10 7.896663e-08 gene_name RN7SL69P ENSG00000263301 ENSG00000269886 \ BARCODE_SUB_LIB_ID 73_177_027-lib_2615 2.390874e-10 1.896764e-07 7.665454e-08 63_080_025-lib_2087 1.934646e-10 2.205981e-07 6.038700e-08 01_070_028-lib_1543 9.687608e-11 9.900592e-08 5.225622e-08 07_110_046-lib_1885 1.694676e-09 2.274248e-07 7.741949e-08 93_082_010-lib_2285 6.253397e-10 2.593786e-07 7.113768e-08 94_154_081-lib_2562 3.700961e-10 2.083358e-07 6.379186e-08 47_102_103-lib_2596 4.534019e-10 2.551739e-07 4.840992e-08 92_138_169-lib_2356 2.018963e-10 2.067301e-07 4.144172e-08 35_035_133-lib_2378 8.090239e-11 1.658230e-07 3.890900e-08 06_084_182-lib_2611 3.474709e-10 1.025397e-07 4.995985e-08 ... 47_102_103-lib_2596 1.975285e-09 7.876221e-08 1.513182e-08 92_138_169-lib_2356 1.214693e-09 4.208334e-08 1.091937e-08 35_035_133-lib_2378 1.049879e-09 8.961482e-08 1.650536e-08 06_084_182-lib_2611 2.311277e-09 5.680565e-08 1.824982e-08 ``` ## Training Details ### Training Data Tahoe-100M Zhang, Jesse, Airol A. Ubas, Richard de Borja, Valentine Svensson, Nicole Thomas, Neha Thakar, Ian Lai, et al. 2025. “Tahoe-100M: A Giga-Scale Single-Cell Perturbation Atlas for Context-Dependent Gene Function and Cellular Modeling.” bioRxiv. https://doi.org/10.1101/2025.02.20.639398. ### Training Procedure The model was trained using the SCVI `.train()` method. One plate (plate 14) of the training data was held out for training to be used for evaluation and criticism. A callback was used to evaluate reconstruction error of the training set and validation set every N minibatch rather than every epoch since a single epoch is too large to give informative training curves. An additional callback function was used to save snapshots of the model state at every epoch. #### Training Hyperparameters - **Training regime:** fp32 precision was used for training. #### Speeds, Sizes, Times [optional] ## Evaluation ### Testing Data, Factors & Metrics #### Testing Data Data in the minified AnnData where the 'plate' column equals '14' was held out from training and used for evaluation and criticism. #### Metrics The main metric is reconstruction error, defined as the average negative log likelihood of the observed counts given the representation vectors. This model uses a negative binomial likelihood.