### Running Instructions

##### 1. Set reference dataset

Application compares the target sequences with the average embedding of the reference sequences. The reference sequences can be `Omicron` or `Other`:

* If you select `Omicron`, application uses the average embedding of [2000 Omicron sequences](https://huggingface.co/spaces/smtnkc/cov-snn-app/resolve/main/omicron.csv). This embedding is already generated. Thus, you do not need to upload a reference dataset.

* If you select `Other`, you should upload a CSV file with the reference sequences. CSV file must have ``sequence`` column. Then, the model generates average embedding of the given sequences and uses it as a reference.

##### 2. Set target dataset

Upload a CSV file with the target sequences. CSV file must have ``accession_id`` and ``sequence`` columns.

See [the example target file](https://huggingface.co/spaces/smtnkc/cov-snn-app/resolve/main/target.csv) which includes 10 Omicron (``EPI_ISL_177...``) and 10 Eris (``EPI_ISL_189...``)  sequences. It is important to note that the selected Omicron sequences are currently circulating and are not included in the training dataset.

##### 3. Get output

The output will be a dataframe with the following columns: 

| Column           | Description                                                 |
|------------------|-------------------------------------------------------------|
| `accession_id`   | Accession ID                                                |
| `log10(sc)`      | Log-scaled semantic change                                  |
| `log10(sp)`      | Log-scaled sequence probability                             |
| `log10(ip)`      | Log-scaled inverse perplexity                               |
| `rank_by_sc`     | Rank by semantic change                                     |
| `rank_by_sp`     | Rank by sequence probability                                |
| `rank_by_ip`     | Rank by inverse perplexity                                  |
| `rank_by_scsp`   | Rank by semantic change + Rank by sequence probability      |
| `rank_by_scip`   | Rank by semantic change + Rank by inverse perplexity        |

**Note:** All ranks are in descending order, with the default sorting metric being `rank_by_scip`.

See [the output](https://huggingface.co/spaces/smtnkc/cov-snn-app/resolve/main/output.csv) for [the example target file](https://huggingface.co/spaces/smtnkc/cov-snn-app/resolve/main/target.csv).

### The Ranking Mechanism

In the original implementation of [Constrained Semantic Change Search](https://www.science.org/doi/10.1126/science.abd7331) (CSCS), grammaticality (`gr`) is determined by sequence probability (`sp`). We propose using inverse perplexity (`ip`) as a more robust metric for grammaticality.

Sequences with both high semantic change (`sc`) and high grammaticality (`gr`) are expected to have a greater escape potential. We rank the sequences in descending order, assigning smaller rank values to those with higher escape potential. Consequently, the output is sorted based on `rank_by_scip`, with the top element possessing the smallest `rank_by_scip` and indicating the sequence with the highest escape potential.

### Model Details

This application uses the pre-trained model with the highest zero-shot test accuracy (91.5%).

#### Training Parameters:
 
| Parameter | Value |
|---------|-----| 
| Base Model | CoV-RoBERTa_2048 |
| Loss function | Contrastive |
| Max Seq Len | 1280 |
| Positive Set | Omicron |
| Negative Set | Delta |
| Pooling | Max |
| ReLU | 0.2 |
| Dropout | 0.1 |
| Learning Rate | 0.001 |
| Batch size | 32 |
| Margin | 2.0 |
| Epochs | [0, 9] |

To train models for specific use cases, please refer to the instructions in [our GitHub repository](https://github.com/smtnkc/CoV-SNN).

#### Training Results:

| Checkpoint | Test Loss | Test Acc | Zero-shot Loss | Zero-shot Acc |
|---------|-----|-----|-----|-----|
| 2 | 0.0236 | 99.7% | 1.0627 | 50.0% | 
| 4 | 0.2941 | 89.4% | 0.2286 | 91.5% | 


### Dependencies

```bash
conda create -n spaces python=3.8
conda activate spaces
pip install -r requirements.txt
```

**requirements.txt**
```bash
numpy==1.21.0
pandas==2.0.2
sentence-transformers==2.2.2
transformers==4.30.2
tokenizers==0.13.3
scanpy==1.9.3
scikit-learn==1.2.2
scipy==1.10.1
plotly==5.24.1
huggingface-hub==0.25.2
torch-optimizer==0.3.0
torchmetrics==0.9.0
torch==1.12.1+cu113
torchvision==0.13.1+cu113
torchaudio==0.12.1
--extra-index-url https://download.pytorch.org/whl/cu113
```

### More Information
[samettenekeci@gmail.com](mailto:samettenekeci@gmail.com)