M Christenson commited on
Commit
d33e9c9
·
verified ·
1 Parent(s): 35ff507

Upload folder using huggingface_hub

Browse files
Files changed (4) hide show
  1. README.md +9 -204
  2. config.json +12 -3
  3. model.safetensors +1 -1
  4. processor_config.json +0 -0
README.md CHANGED
@@ -1,204 +1,9 @@
1
- # SCimilarity Model
2
-
3
- ## Model Details
4
-
5
- - **Model Name**: SCimilarity
6
- - **Version**: 1.0 [deeplife version]
7
- - **Type**: Metric learning framework for single-cell RNA-seq data
8
- - **Paper**: [Scalable querying of human cell atlases via a foundational model reveals commonalities across fibrosis-associated macrophages
9
- ](https://www.biorxiv.org/content/10.1101/2023.07.18.549537v1)
10
- - **Original Implementation**: [SCimilarity GitHub Repository](https://github.com/genentech/scimilarity)
11
-
12
- ## Model Description
13
-
14
- SCimilarity is a metric learning framework that learns and searches a unified and interpretable representation of single-cell RNA-seq data. It enables annotation of cell types and instant querying for cell states across tens of millions of profiles. In the context of DeepLife ML Infra, we focus on its cell embedding capabilities.
15
-
16
- ### Abstract
17
-
18
- Single-cell RNA-seq (scRNA-seq) studies have profiled over 100 million human cells across diseases, developmental stages, and perturbations to date. A singular view of this vast and growing expression landscape could help reveal novel associations between cell states and diseases, discover cell states in unexpected tissue contexts, and relate in vivo cells to in vitro models. However, these require a common, scalable representation of cell profiles from across the body, a general measure of their similarity, and an efficient way to query these data. Here, we present SCimilarity, a metric learning framework to learn and search a unified and interpretable representation that annotates cell types and instantaneously queries for a cell state across tens of millions of profiles. We demonstrate SCimilarity on a 22.7 million cell corpus assembled across 399 published scRNA-seq studies, showing accurate integration, annotation and querying. We experimentally validated SCimilarity by querying across tissues for a macrophage subset originally identified in interstitial lung disease, and showing that cells with similar profiles are found in other fibrotic diseases, tissues, and a 3D hydrogel system, which we then repurposed to yield this cell state in vitro. SCimilarity serves as a foundational model for single cell gene expression data and enables researchers to query for similar cellular states across the entire human body, providing a powerful tool for generating novel biological insights from the growing Human Cell Atlas.
19
-
20
- ### Key Features
21
-
22
- - Generates unified embeddings for single-cell expression profiles
23
- - Enables efficient querying and annotation across large-scale datasets
24
- - Generalizes to new studies without retraining
25
- - Supports discovery of novel cell state associations across diseases and tissues
26
-
27
- ## Intended Use
28
-
29
- SCimilarity is designed for researchers working with single-cell RNA sequencing (scRNA-seq) data. Within the DeepLife ML Infra framework, it can be used for:
30
-
31
- - Generating cell embeddings from scRNA-seq data
32
- - Querying for similar cell states across large datasets
33
- - Annotating cell types in new datasets
34
- - Discovering novel associations between cell states and diseases
35
-
36
- ## Training Data
37
-
38
- The model was trained on a corpus of 22.7 million cells assembled from 399 published scRNA-seq studies. For detailed information about the training data, please refer to the original paper.
39
-
40
- ## Performance
41
-
42
- SCimilarity has demonstrated:
43
-
44
- - Accurate integration and annotation across a large corpus of cells
45
- - Efficient querying for similar cell states across tissues and diseases
46
- - Ability to reveal novel biological insights, as validated experimentally
47
-
48
- For specific performance metrics, please refer to the original paper.
49
-
50
- ## Limitations
51
-
52
- - The model's performance may vary for cell types or states that are underrepresented in the training data
53
- - As with any embedding model, care should be taken when interpreting similarities, especially across different experimental conditions or protocols
54
-
55
- ## Ethical Considerations
56
-
57
- Users should be aware that while the data used to train SCimilarity is from public sources, it represents human tissue samples and should be treated with appropriate respect and consideration. Researchers using this model should adhere to ethical guidelines for human subjects research.
58
-
59
- ## Usage
60
-
61
- To use the SCimilarity model within the DeepLife ML Infra:
62
-
63
- 1. Install the package:
64
- ```
65
- pip install deeplife-mlinfra
66
- ```
67
-
68
- 2. Import and use the model:
69
- ```python
70
- import anndata as ad
71
- from huggingface_hub import hf_hub_download
72
- from dl_models.models.scimilarity.model import SCimilarityEmbedModel
73
- from dl_models.models.scimilarity.processor import SCimilarityProcessor
74
-
75
- # Load the model and preprocessor
76
- model = SCimilarityEmbedModel.from_pretrained("deeplife/scimilarity_model")
77
- preprocessor = SCimilarityProcessor.from_pretrained("deeplife/scimilarity_model")
78
- model.eval()
79
-
80
- # Load your data (example using a sample dataset)
81
- filepath = hf_hub_download(
82
- repo_id="deeplife/h5ad_samples",
83
- filename="GSE136831small.h5ad",
84
- repo_type="dataset",
85
- )
86
- adata = ad.read_h5ad(filepath)
87
-
88
- # Preprocess and create a dataloader
89
- dataloader = preprocessor.transform_to_dataloader(adata, batch_size=256)
90
-
91
- # Get embeddings
92
- for batch in dataloader:
93
- embed = model.get_cell_embeddings(batch)
94
- break # This gets embeddings for the first batch
95
-
96
- # You can now use these embeddings for downstream tasks
97
- ```
98
-
99
- For visualization of the embeddings, you can use techniques like PCA or UMAP:
100
-
101
- ```python
102
- import numpy as np
103
- from sklearn.decomposition import PCA
104
- import matplotlib.pyplot as plt
105
- import umap
106
-
107
- # Convert embed to numpy
108
- embed_np = embed.detach().cpu().numpy()
109
-
110
- # Perform PCA
111
- pca = PCA(n_components=2)
112
- embed_pca = pca.fit_transform(embed_np)
113
-
114
- # Perform UMAP
115
- umap_reducer = umap.UMAP(n_components=2, random_state=42)
116
- embed_umap = umap_reducer.fit_transform(embed_np)
117
-
118
- # Plot the results
119
- fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 8))
120
-
121
- # PCA plot
122
- scatter1 = ax1.scatter(embed_pca[:, 0], embed_pca[:, 1], alpha=0.7)
123
- ax1.set_title('SCimilarity Embeddings - PCA')
124
- ax1.set_xlabel('PC1')
125
- ax1.set_ylabel('PC2')
126
- plt.colorbar(scatter1, ax=ax1)
127
-
128
- # UMAP plot
129
- scatter2 = ax2.scatter(embed_umap[:, 0], embed_umap[:, 1], alpha=0.7)
130
- ax2.set_title('SCimilarity Embeddings - UMAP')
131
- ax2.set_xlabel('UMAP1')
132
- ax2.set_ylabel('UMAP2')
133
- plt.colorbar(scatter2, ax=ax2)
134
-
135
- plt.tight_layout()
136
- plt.show()
137
- ```
138
-
139
- For more detailed usage instructions, please refer to the [documentation](https://github.com/deeplifeai/deeplife-mlinfra).
140
-
141
- ## Citation
142
-
143
- If you use this model in your research, please cite both the original SCimilarity paper and the DeepLife ML Infra package:
144
-
145
- ```
146
- @article{yoo2023scimilarity,
147
- title={SCimilarity: a scalable and universal cell state similarity metric for single cell RNA-sequencing data},
148
- author={Yoo, Byungjin and Nawy, Tal and Hu, Yuanjie and Szeto, Gregory L and Wuster, Arthur},
149
- journal={bioRxiv},
150
- pages={2023.07.18.549537},
151
- year={2023},
152
- publisher={Cold Spring Harbor Laboratory}
153
- }
154
-
155
- @software{deeplife_mlinfra,
156
- title={DeepLife ML Infra: Infrastructure for Biological Deep Learning Models},
157
- author={DeepLife AI Team},
158
- year={2023},
159
- url={https://github.com/deeplifeai/deeplife-mlinfra},
160
- version={1.0.0}
161
- }
162
- ```
163
-
164
- ## License
165
-
166
- ### Code License
167
-
168
- The SCimilarity code is licensed under the Apache License, Version 2.0. The full text of the license is as follows:
169
-
170
- ```
171
- Copyright 2023 Genentech, Inc.
172
-
173
- Licensed under the Apache License, Version 2.0 (the "License");
174
- you may not use this file except in compliance with the License.
175
- You may obtain a copy of the License at
176
-
177
- http://www.apache.org/licenses/LICENSE-2.0
178
-
179
- Unless required by applicable law or agreed to in writing, software
180
- distributed under the License is distributed on an "AS IS" BASIS,
181
- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
182
- See the License for the specific language governing permissions and
183
- limitations under the License.
184
- ```
185
-
186
- ### Model Weights License
187
-
188
- The SCimilarity model weights are licensed under the Creative Commons Attribution Share Alike 4.0 International license. Users are free to share and adapt the material under the following terms:
189
-
190
- - Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made.
191
- - ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.
192
-
193
- For the full text of this license, please visit: [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/)
194
-
195
- ## Additional Resources
196
-
197
- - [SCimilarity Documentation](https://genentech.github.io/scimilarity/index.html)
198
- - [Pretrained Model Weights and Data](https://zenodo.org/records/10685499)
199
-
200
- ## Contact
201
-
202
- For questions or issues related to this model implementation in DeepLife ML Infra, please open an issue in the [repository](https://github.com/deeplifeai/deeplife-mlinfra).
203
-
204
- For questions about the original SCimilarity model, please refer to the [original repository](https://github.com/genentech/scimilarity).
 
1
+ ---
2
+ tags:
3
+ - model_hub_mixin
4
+ - pytorch_model_hub_mixin
5
+ ---
6
+
7
+ This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
8
+ - Library: [More Information Needed]
9
+ - Docs: [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
config.json CHANGED
@@ -1,12 +1,21 @@
1
  {
2
- "n_genes": 28231,
3
- "latent_dim": 128,
 
 
 
 
 
 
 
 
4
  "hidden_dim": [
5
  1024,
6
  1024,
7
  1024
8
  ],
9
- "dropout": 0.5,
10
  "input_dropout": 0.4,
 
 
11
  "residual": false
12
  }
 
1
  {
2
+ "_class_module": "dl_models.models.scimilarity.model",
3
+ "_class_name": "SCimilarityEmbedModelConfig",
4
+ "dropout": 0.5,
5
+ "git_repo_info": {
6
+ "author": "gucky92",
7
+ "branch": "inference_handler_improvement",
8
+ "commit_message": "update to inference handlers",
9
+ "sha": "474710e7ac3c44febe5bd1ba303a23439321549e",
10
+ "timestamp": "2024-09-27T13:17:43+02:00"
11
+ },
12
  "hidden_dim": [
13
  1024,
14
  1024,
15
  1024
16
  ],
 
17
  "input_dropout": 0.4,
18
+ "latent_dim": 128,
19
+ "n_genes": 28231,
20
  "residual": false
21
  }
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:d7a4124e4905dbf5dad6f99fb6f3dfb51fcd058927ce85de91ef5d41f41684ec
3
  size 124611380
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e6192cc824ecfa6a2ed1aa28790dad815a334cb7a594b00c77d62ce6efa7c1f5
3
  size 124611380
processor_config.json CHANGED
The diff for this file is too large to render. See raw diff