File size: 8,712 Bytes
803f524 b5283a7 803f524 b5283a7 803f524 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 |
---
# For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
# Doc / guide: https://huggingface.co/docs/hub/model-cards
library_name: scvi-tools
license: mit
tags:
- biology
- genomics
- single-cell
- model_cls_name:SCVI
- scvi_version:1.2.0
- anndata_version:0.11.1
- modality:rna
- tissue:None
- annotated:True
---
# Model Card for Tahoe-100M-SCVI-v1
<!-- Provide a quick summary of what the model is/does. -->
An SCVI model and minified AnnData of the [Tahoe-100M](https://doi.org/10.1101/2025.02.20.639398) dataset from Vevo Tx.
## Model Details
### Model Description
<!-- Provide a longer summary of what this model is. -->
Tahoe-100M-SCVI-v1
- **Developed by:** Vevo Tx
- **Model type:** SCVI variational autoencoder
- **License:** This model is licensed under the MIT License.
### Model Architecture
SCVI model
Layers: 1, Hidden Units: 128, Latent Dimensions: 10
### Parameters
40,390,510
## Intended Use
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
### Direct Use
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
- Decoding Tahoe-100M data representation vectors to gene expression.
- Encoding scRNA-seq data to Tahoe-100M cell state representation space.
### Downstream Use
<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
- Adaptation to additional scRNA-seq data
### Intended Users
- **Computational biologists** analyzing gene expression responses to drug perturbations.
- **Machine learning researchers** developing methods for downstream drug response prediction.
## Bias, Risks, and Limitations
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
Reconstruced gene expression values may be inaccurate. Calibration analysis shows that the model generates counts that contains the observed counts within the 95% confidence intervals from the posterior predictice distribution 97.7% of the time. However, a naive baseline of producing only 0-counts achieves 97.4% on the same metric.
The Tahoe-100M data is based on cancer cell lines under drug treatment, and the model is trained to represent this data. The model may not be directly applicable to other forms of scRNA-seq data, such as that from primary cells.
{{ bias_recommendations | default("Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.", true)}}
## How to Get Started with the Model
Use the code below to get started with the model.
Loading the minified AnnData will require 41 GB storage (saved in the `cache-dir`)) and RAM. The model itself requires ~1 GB GPU memory.
```
> import scvi.hub
> tahoe_hubmodel = scvi.hub.HubModel.pull_from_huggingface_hub(
repo_name = 'vevotx/Tahoe-100M-SCVI-v1',
cache_dir = '/path/to/cache'
)
> tahoe = tahoe_hubmodel.model
> tahoe
SCVI model with the following parameters:
n_hidden: 128, n_latent: 10, n_layers: 1, dropout_rate: 0.1, dispersion: gene, gene_likelihood: nb,
latent_distribution: normal.
Training status: Trained
Model's adata is minified?: True
> tahoe.adata
AnnData object with n_obs × n_vars = 95624334 × 62710
obs: 'sample', 'species', 'gene_count', 'tscp_count', 'mread_count', 'bc1_wind', 'bc2_wind', 'bc3_wind', 'bc1_well', 'bc2_well', 'bc3_well', 'id', 'drugname_drugconc', 'drug', 'INT_ID', 'NUM.SNPS', 'NUM.READS', 'demuxlet_call', 'BEST.LLK', 'NEXT.LLK', 'DIFF.LLK.BEST.NEXT', 'BEST.POSTERIOR', 'SNG.POSTERIOR', 'SNG.BEST.LLK', 'SNG.NEXT.LLK', 'SNG.ONLY.POSTERIOR', 'DBL.BEST.LLK', 'DIFF.LLK.SNG.DBL', 'sublibrary', 'BARCODE', 'pcnt_mito', 'S_score', 'G2M_score', 'phase', 'pass_filter', 'dataset', '_scvi_batch', '_scvi_labels', '_scvi_observed_lib_size', 'plate', 'Cell_Name_Vevo', 'Cell_ID_Cellosaur'
var: 'gene_id', 'genome', 'SUB_LIB_ID'
uns: '_scvi_adata_minify_type', '_scvi_manager_uuid', '_scvi_uuid'
obsm: 'X_latent_qzm', 'X_latent_qzv', '_scvi_latent_qzm', '_scvi_latent_qzv'
layers: 'counts'
> # Take some random genes
> gene_list = tahoe.adata.var.sample(10).index
> # Take some random cells
> cell_indices = tahoe.adata.obs.sample(10).index
> # Decoode gene expression
> gene_expression = tahoe.get_normalized_expression(tahoe.adata[cell_indices], gene_list = gene_list)
> print(gene_expression)
gene_name TSPAN13 ZSCAN9 ENSG00000200991 ENSG00000224901 \
BARCODE_SUB_LIB_ID
73_177_027-lib_2615 0.000036 0.000005 4.255257e-10 9.856240e-08
63_080_025-lib_2087 0.000012 0.000012 3.183158e-10 1.124618e-07
01_070_028-lib_1543 0.000005 0.000010 1.604187e-10 1.022676e-07
07_110_046-lib_1885 0.000035 0.000018 2.597950e-09 1.063819e-07
93_082_010-lib_2285 0.000008 0.000009 8.147555e-10 9.102466e-08
94_154_081-lib_2562 0.000035 0.000014 5.600219e-10 6.891351e-08
47_102_103-lib_2596 0.000021 0.000010 7.320031e-10 1.190017e-07
92_138_169-lib_2356 0.000038 0.000015 3.393952e-10 7.600610e-08
35_035_133-lib_2378 0.000041 0.000004 1.503101e-10 9.447428e-08
06_084_182-lib_2611 0.000007 0.000014 5.135248e-10 7.896663e-08
gene_name RN7SL69P ENSG00000263301 ENSG00000269886 \
BARCODE_SUB_LIB_ID
73_177_027-lib_2615 2.390874e-10 1.896764e-07 7.665454e-08
63_080_025-lib_2087 1.934646e-10 2.205981e-07 6.038700e-08
01_070_028-lib_1543 9.687608e-11 9.900592e-08 5.225622e-08
07_110_046-lib_1885 1.694676e-09 2.274248e-07 7.741949e-08
93_082_010-lib_2285 6.253397e-10 2.593786e-07 7.113768e-08
94_154_081-lib_2562 3.700961e-10 2.083358e-07 6.379186e-08
47_102_103-lib_2596 4.534019e-10 2.551739e-07 4.840992e-08
92_138_169-lib_2356 2.018963e-10 2.067301e-07 4.144172e-08
35_035_133-lib_2378 8.090239e-11 1.658230e-07 3.890900e-08
06_084_182-lib_2611 3.474709e-10 1.025397e-07 4.995985e-08
...
47_102_103-lib_2596 1.975285e-09 7.876221e-08 1.513182e-08
92_138_169-lib_2356 1.214693e-09 4.208334e-08 1.091937e-08
35_035_133-lib_2378 1.049879e-09 8.961482e-08 1.650536e-08
06_084_182-lib_2611 2.311277e-09 5.680565e-08 1.824982e-08
```
## Training Details
### Training Data
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
Tahoe-100M
Zhang, Jesse, Airol A. Ubas, Richard de Borja, Valentine Svensson, Nicole Thomas, Neha Thakar, Ian Lai, et al. 2025. “Tahoe-100M: A Giga-Scale Single-Cell Perturbation Atlas for Context-Dependent Gene Function and Cellular Modeling.” bioRxiv. https://doi.org/10.1101/2025.02.20.639398.
### Training Procedure
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
The model was trained using the SCVI `.train()` method. One plate (plate 14) of the training data was held out for training to be used for evaluation and criticism. A callback was used to evaluate reconstruction error of the training set and validation set every N minibatch rather than every epoch since a single epoch is too large to give informative training curves. An additional callback function was used to save snapshots of the model state at every epoch.
#### Training Hyperparameters
- **Training regime:** fp32 precision was used for training.
#### Speeds, Sizes, Times [optional]
<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
## Evaluation
<!-- This section describes the evaluation protocols and provides the results. -->
### Testing Data, Factors & Metrics
#### Testing Data
<!-- This should link to a Dataset Card if possible. -->
Data in the minified AnnData where the 'plate' column equals '14' was held out from training and used for evaluation and criticism.
#### Metrics
<!-- These are the evaluation metrics being used, ideally with a description of why. -->
The main metric is reconstruction error, defined as the average negative log likelihood of the observed counts given the representation vectors. This model uses a negative binomial likelihood.
|