avelezarce commited on
Commit
dab34db
·
verified ·
1 Parent(s): 8180984

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +44 -1
README.md CHANGED
@@ -4,6 +4,43 @@ tags:
4
  - single-cell
5
  - biology
6
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
 
8
  All rights belong to:
9
 
@@ -14,4 +51,10 @@ All rights belong to:
14
  year={2020},
15
  eprint={2002.12328},
16
  primaryClass={cs.CL}
17
- }
 
 
 
 
 
 
 
4
  - single-cell
5
  - biology
6
  ---
7
+ # scGPT
8
+ scGPT is A foundation model for single-cell biology based on a generative pre trained transformer across a repository of over 33 million cells.
9
+
10
+ # Abstract
11
+ Generative pretrained models have achieved remarkable success in various domains such as language and computer vision. Specifically, the combination of large-scale diverse datasets and pretrained transformers has emerged as a promising approach for developing foundation models. Drawing parallels between language and cellular biology (in which texts comprise words; similarly, cells are defined by genes), our study probes the applicability of foundation models to advance cellular biology and genetic research. Using burgeoning single-cell sequencing data, we have constructed a foundation model for single-cell biology, scGPT, based on a generative pretrained transformer across a repository of over 33 million cells. Our findings illustrate that scGPT effectively distills critical biological insights concerning genes and cells. Through further adaptation of transfer learning, scGPT can be optimized to achieve superior performance across diverse downstream applications. This includes tasks such as cell type annotation, multi-batch integration, multi-omic integration, perturbation response prediction and gene network inference.
12
+
13
+ # Code
14
+
15
+ ```python
16
+ from tdc.multi_pred.anndata_dataset import DataLoader
17
+ from tdc import tdc_hf_interface
18
+ from tdc.model_server.tokenizers.scgpt import scGPTTokenizer
19
+ import torch
20
+
21
+ scgpt = tdc_hf_interface("scGPT")
22
+ model = scgpt.load() # this line can cause segmentation fault
23
+ tokenizer = scGPTTokenizer()
24
+ gene_ids = adata.var["feature_name"].to_numpy() # Convert to numpy array
25
+ tokenized_data = tokenizer.tokenize_cell_vectors(
26
+ adata.X.toarray(), gene_ids)
27
+ embeds = model(torch.tensor([x[1] for x in tokenized_data])).last_hidden_state
28
+ ```
29
+
30
+ # TDC Citation
31
+ ```
32
+ @inproceedings{
33
+ velez-arce2024signals,
34
+ title={Signals in the Cells: Multimodal and Contextualized Machine Learning Foundations for Therapeutics},
35
+ author={Alejandro Velez-Arce and Kexin Huang and Michelle M Li and Xiang Lin and Wenhao Gao and Bradley Pentelute and Tianfan Fu and Manolis Kellis and Marinka Zitnik},
36
+ booktitle={NeurIPS 2024 Workshop on AI for New Drug Modalities},
37
+ year={2024},
38
+ url={https://openreview.net/forum?id=kL8dlYp6IM}
39
+ }
40
+ ```
41
+ # Additional Citations
42
+
43
+ - Cui, H., Wang, C., Maan, H. et al. scGPT: toward building a foundation model for single-cell multi-omics using generative AI. Nat Methods 21, 1470–1480 (2024). https://doi.org/10.1038/s41592-024-02201-0
44
 
45
  All rights belong to:
46
 
 
51
  year={2020},
52
  eprint={2002.12328},
53
  primaryClass={cs.CL}
54
+ }
55
+
56
+ # Model Homepage
57
+ https://huggingface.co/metehergul/scgpt
58
+
59
+ # Model Github
60
+ https://github.com/bowang-lab/scGPT