### Running Instructions ##### 1. Set reference dataset Application compares the target sequences with the average embedding of the reference sequences. The reference sequences can be `Omicron` or `Other`: * If you select `Omicron`, application uses the average embedding of [2000 Omicron sequences](https://huggingface.co/spaces/smtnkc/cov-snn-app/resolve/main/omicron.csv). This embedding is already generated. Thus, you do not need to upload a reference dataset. * If you select `Other`, you should upload a CSV file with the reference sequences. CSV file must have ``sequence`` column. Then, the model generates average embedding of the given sequences and uses it as a reference. ##### 2. Set target dataset Upload a CSV file with the target sequences. CSV file must have ``accession_id`` and ``sequence`` columns. See [the example target file](https://huggingface.co/spaces/smtnkc/cov-snn-app/resolve/main/target.csv) which includes 10 Omicron (``EPI_ISL_177...``) and 10 Eris (``EPI_ISL_189...``) sequences. It is important to note that the selected Omicron sequences are currently circulating and are not included in the training dataset. ##### 3. Get output The output will be a dataframe with the following columns: | Column | Description | |------------------|-------------------------------------------------------------| | `accession_id` | Accession ID | | `log10(sc)` | Log-scaled semantic change | | `log10(sp)` | Log-scaled sequence probability | | `log10(ip)` | Log-scaled inverse perplexity | | `rank_by_sc` | Rank by semantic change | | `rank_by_sp` | Rank by sequence probability | | `rank_by_ip` | Rank by inverse perplexity | | `rank_by_scsp` | Rank by semantic change + Rank by sequence probability | | `rank_by_scip` | Rank by semantic change + Rank by inverse perplexity | **Note:** All ranks are in descending order, with the default sorting metric being `rank_by_scip`. See [the output](https://huggingface.co/spaces/smtnkc/cov-snn-app/resolve/main/output.csv) for [the example target file](https://huggingface.co/spaces/smtnkc/cov-snn-app/resolve/main/target.csv). ### The Ranking Mechanism In the original implementation of [Constrained Semantic Change Search](https://www.science.org/doi/10.1126/science.abd7331) (CSCS), grammaticality (`gr`) is determined by sequence probability (`sp`). We propose using inverse perplexity (`ip`) as a more robust metric for grammaticality. Sequences with both high semantic change (`sc`) and high grammaticality (`gr`) are expected to have a greater escape potential. We rank the sequences in descending order, assigning smaller rank values to those with higher escape potential. Consequently, the output is sorted based on `rank_by_scip`, with the top element possessing the smallest `rank_by_scip` and indicating the sequence with the highest escape potential. ### Model Details This application uses the pre-trained model with the highest zero-shot test accuracy (91.5%). #### Training Parameters: | Parameter | Value | |---------|-----| | Base Model | CoV-RoBERTa_2048 | | Loss function | Contrastive | | Max Seq Len | 1280 | | Positive Set | Omicron | | Negative Set | Delta | | Pooling | Max | | ReLU | 0.2 | | Dropout | 0.1 | | Learning Rate | 0.001 | | Batch size | 32 | | Margin | 2.0 | | Epochs | [0, 9] | To train models for specific use cases, please refer to the instructions in [our GitHub repository](https://github.com/smtnkc/CoV-SNN). #### Training Results: | Checkpoint | Test Loss | Test Acc | Zero-shot Loss | Zero-shot Acc | |---------|-----|-----|-----|-----| | 2 | 0.0236 | 99.7% | 1.0627 | 50.0% | | 4 | 0.2941 | 89.4% | 0.2286 | 91.5% | ### Dependencies ```bash conda create -n spaces python=3.8 conda activate spaces pip install -r requirements.txt ``` **requirements.txt** ```bash numpy==1.21.0 pandas==2.0.2 sentence-transformers==2.2.2 transformers==4.30.2 tokenizers==0.13.3 scanpy==1.9.3 scikit-learn==1.2.2 scipy==1.10.1 plotly==5.24.1 huggingface-hub==0.25.2 torch-optimizer==0.3.0 torchmetrics==0.9.0 torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113 ``` ### More Information [samettenekeci@gmail.com](mailto:samettenekeci@gmail.com)