File size: 4,579 Bytes
fbcb0c1
7169e4d
2ef204f
7169e4d
2ef204f
7169e4d
2ef204f
7169e4d
2ef204f
7169e4d
2ef204f
 
 
 
 
 
 
7169e4d
2ef204f
 
 
 
 
 
 
 
 
 
 
 
 
 
6d42f96
2ef204f
 
 
 
7169e4d
6d42f96
7169e4d
6d42f96
7169e4d
fbcb0c1
7169e4d
2ef204f
7169e4d
fbcb0c1
7169e4d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2ef204f
7169e4d
fbcb0c1
7169e4d
 
 
 
 
 
 
fbcb0c1
7169e4d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2ef204f
6d42f96
7169e4d
 
 
 
 
 
 
 
fbcb0c1
7169e4d
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
### Running Instructions

##### 1. Set reference dataset

Application compares the target sequences with the average embedding of the reference sequences. The reference sequences can be `Omicron` or `Other`:

* If you select `Omicron`, application uses the average embedding of [2000 Omicron sequences](https://huggingface.co/spaces/smtnkc/cov-snn-app/resolve/main/omicron.csv). This embedding is already generated. Thus, you do not need to upload a reference dataset.

* If you select `Other`, you should upload a CSV file with the reference sequences. CSV file must have ``sequence`` column. Then, the model generates average embedding of the given sequences and uses it as a reference.

##### 2. Set target dataset

Upload a CSV file with the target sequences. CSV file must have ``accession_id`` and ``sequence`` columns.

See [the example target file](https://huggingface.co/spaces/smtnkc/cov-snn-app/resolve/main/target.csv) which includes 10 Omicron (``EPI_ISL_177...``) and 10 Eris (``EPI_ISL_189...``)  sequences. It is important to note that the selected Omicron sequences are currently circulating and are not included in the training dataset.

##### 3. Get output

The output will be a dataframe with the following columns: 

| Column           | Description                                                 |
|------------------|-------------------------------------------------------------|
| `accession_id`   | Accession ID                                                |
| `log10(sc)`      | Log-scaled semantic change                                  |
| `log10(sp)`      | Log-scaled sequence probability                             |
| `log10(ip)`      | Log-scaled inverse perplexity                               |
| `rank_by_sc`     | Rank by semantic change                                     |
| `rank_by_sp`     | Rank by sequence probability                                |
| `rank_by_ip`     | Rank by inverse perplexity                                  |
| `rank_by_scsp`   | Rank by semantic change + Rank by sequence probability      |
| `rank_by_scip`   | Rank by semantic change + Rank by inverse perplexity        |

**Note:** All ranks are in descending order, with the default sorting metric being `rank_by_scip`.

See [the output](https://huggingface.co/spaces/smtnkc/cov-snn-app/resolve/main/output.csv) for [the example target file](https://huggingface.co/spaces/smtnkc/cov-snn-app/resolve/main/target.csv).

### The Ranking Mechanism

In the original implementation of [Constrained Semantic Change Search](https://www.science.org/doi/10.1126/science.abd7331) (CSCS), grammaticality (`gr`) is determined by sequence probability (`sp`). We propose using inverse perplexity (`ip`) as a more robust metric for grammaticality.

Sequences with both high semantic change (`sc`) and high grammaticality (`gr`) are expected to have a greater escape potential. We rank the sequences in descending order, assigning smaller rank values to those with higher escape potential. Consequently, the output is sorted based on `rank_by_scip`, with the top element possessing the smallest `rank_by_scip` and indicating the sequence with the highest escape potential.

### Model Details

This application uses the pre-trained model with the highest zero-shot test accuracy (91.5%).

#### Training Parameters:
 
| Parameter | Value |
|---------|-----| 
| Base Model | CoV-RoBERTa_2048 |
| Loss function | Contrastive |
| Max Seq Len | 1280 |
| Positive Set | Omicron |
| Negative Set | Delta |
| Pooling | Max |
| ReLU | 0.2 |
| Dropout | 0.1 |
| Learning Rate | 0.001 |
| Batch size | 32 |
| Margin | 2.0 |
| Epochs | [0, 9] |

To train models for specific use cases, please refer to the instructions in [our GitHub repository](https://github.com/smtnkc/CoV-SNN).

#### Training Results:

| Checkpoint | Test Loss | Test Acc | Zero-shot Loss | Zero-shot Acc |
|---------|-----|-----|-----|-----|
| 2 | 0.0236 | 99.7% | 1.0627 | 50.0% | 
| 4 | 0.2941 | 89.4% | 0.2286 | 91.5% | 


### Dependencies

```bash
conda create -n spaces python=3.8
conda activate spaces
pip install -r requirements.txt
```

**requirements.txt**
```bash
numpy==1.21.0
pandas==2.0.2
sentence-transformers==2.2.2
transformers==4.30.2
tokenizers==0.13.3
scanpy==1.9.3
scikit-learn==1.2.2
scipy==1.10.1
plotly==5.24.1
huggingface-hub==0.25.2
torch-optimizer==0.3.0
torchmetrics==0.9.0
torch==1.12.1+cu113
torchvision==0.13.1+cu113
torchaudio==0.12.1
--extra-index-url https://download.pytorch.org/whl/cu113
```

### More Information
[[email protected]](mailto:[email protected])