Spaces:
Running
Running
File size: 4,579 Bytes
fbcb0c1 7169e4d 2ef204f 7169e4d 2ef204f 7169e4d 2ef204f 7169e4d 2ef204f 7169e4d 2ef204f 7169e4d 2ef204f 6d42f96 2ef204f 7169e4d 6d42f96 7169e4d 6d42f96 7169e4d fbcb0c1 7169e4d 2ef204f 7169e4d fbcb0c1 7169e4d 2ef204f 7169e4d fbcb0c1 7169e4d fbcb0c1 7169e4d 2ef204f 6d42f96 7169e4d fbcb0c1 7169e4d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 |
### Running Instructions
##### 1. Set reference dataset
Application compares the target sequences with the average embedding of the reference sequences. The reference sequences can be `Omicron` or `Other`:
* If you select `Omicron`, application uses the average embedding of [2000 Omicron sequences](https://huggingface.co/spaces/smtnkc/cov-snn-app/resolve/main/omicron.csv). This embedding is already generated. Thus, you do not need to upload a reference dataset.
* If you select `Other`, you should upload a CSV file with the reference sequences. CSV file must have ``sequence`` column. Then, the model generates average embedding of the given sequences and uses it as a reference.
##### 2. Set target dataset
Upload a CSV file with the target sequences. CSV file must have ``accession_id`` and ``sequence`` columns.
See [the example target file](https://huggingface.co/spaces/smtnkc/cov-snn-app/resolve/main/target.csv) which includes 10 Omicron (``EPI_ISL_177...``) and 10 Eris (``EPI_ISL_189...``) sequences. It is important to note that the selected Omicron sequences are currently circulating and are not included in the training dataset.
##### 3. Get output
The output will be a dataframe with the following columns:
| Column | Description |
|------------------|-------------------------------------------------------------|
| `accession_id` | Accession ID |
| `log10(sc)` | Log-scaled semantic change |
| `log10(sp)` | Log-scaled sequence probability |
| `log10(ip)` | Log-scaled inverse perplexity |
| `rank_by_sc` | Rank by semantic change |
| `rank_by_sp` | Rank by sequence probability |
| `rank_by_ip` | Rank by inverse perplexity |
| `rank_by_scsp` | Rank by semantic change + Rank by sequence probability |
| `rank_by_scip` | Rank by semantic change + Rank by inverse perplexity |
**Note:** All ranks are in descending order, with the default sorting metric being `rank_by_scip`.
See [the output](https://huggingface.co/spaces/smtnkc/cov-snn-app/resolve/main/output.csv) for [the example target file](https://huggingface.co/spaces/smtnkc/cov-snn-app/resolve/main/target.csv).
### The Ranking Mechanism
In the original implementation of [Constrained Semantic Change Search](https://www.science.org/doi/10.1126/science.abd7331) (CSCS), grammaticality (`gr`) is determined by sequence probability (`sp`). We propose using inverse perplexity (`ip`) as a more robust metric for grammaticality.
Sequences with both high semantic change (`sc`) and high grammaticality (`gr`) are expected to have a greater escape potential. We rank the sequences in descending order, assigning smaller rank values to those with higher escape potential. Consequently, the output is sorted based on `rank_by_scip`, with the top element possessing the smallest `rank_by_scip` and indicating the sequence with the highest escape potential.
### Model Details
This application uses the pre-trained model with the highest zero-shot test accuracy (91.5%).
#### Training Parameters:
| Parameter | Value |
|---------|-----|
| Base Model | CoV-RoBERTa_2048 |
| Loss function | Contrastive |
| Max Seq Len | 1280 |
| Positive Set | Omicron |
| Negative Set | Delta |
| Pooling | Max |
| ReLU | 0.2 |
| Dropout | 0.1 |
| Learning Rate | 0.001 |
| Batch size | 32 |
| Margin | 2.0 |
| Epochs | [0, 9] |
To train models for specific use cases, please refer to the instructions in [our GitHub repository](https://github.com/smtnkc/CoV-SNN).
#### Training Results:
| Checkpoint | Test Loss | Test Acc | Zero-shot Loss | Zero-shot Acc |
|---------|-----|-----|-----|-----|
| 2 | 0.0236 | 99.7% | 1.0627 | 50.0% |
| 4 | 0.2941 | 89.4% | 0.2286 | 91.5% |
### Dependencies
```bash
conda create -n spaces python=3.8
conda activate spaces
pip install -r requirements.txt
```
**requirements.txt**
```bash
numpy==1.21.0
pandas==2.0.2
sentence-transformers==2.2.2
transformers==4.30.2
tokenizers==0.13.3
scanpy==1.9.3
scikit-learn==1.2.2
scipy==1.10.1
plotly==5.24.1
huggingface-hub==0.25.2
torch-optimizer==0.3.0
torchmetrics==0.9.0
torch==1.12.1+cu113
torchvision==0.13.1+cu113
torchaudio==0.12.1
--extra-index-url https://download.pytorch.org/whl/cu113
```
### More Information
[[email protected]](mailto:[email protected]) |