cov-snn-app / INSTRUCTIONS.md
smtnkc
Using rank_by_scip
6d42f96

A newer version of the Streamlit SDK is available: 1.43.2

Upgrade

Running Instructions

1. Set reference dataset

Application compares the target sequences with the average embedding of the reference sequences. The reference sequences can be Omicron or Other:

  • If you select Omicron, application uses the average embedding of 2000 Omicron sequences. This embedding is already generated. Thus, you do not need to upload a reference dataset.

  • If you select Other, you should upload a CSV file with the reference sequences. CSV file must have sequence column. Then, the model generates average embedding of the given sequences and uses it as a reference.

2. Set target dataset

Upload a CSV file with the target sequences. CSV file must have accession_id and sequence columns.

See the example target file which includes 10 Omicron (EPI_ISL_177...) and 10 Eris (EPI_ISL_189...) sequences. It is important to note that the selected Omicron sequences are currently circulating and are not included in the training dataset.

3. Get output

The output will be a dataframe with the following columns:

Column Description
accession_id Accession ID
log10(sc) Log-scaled semantic change
log10(sp) Log-scaled sequence probability
log10(ip) Log-scaled inverse perplexity
rank_by_sc Rank by semantic change
rank_by_sp Rank by sequence probability
rank_by_ip Rank by inverse perplexity
rank_by_scsp Rank by semantic change + Rank by sequence probability
rank_by_scip Rank by semantic change + Rank by inverse perplexity

Note: All ranks are in descending order, with the default sorting metric being rank_by_scip.

See the output for the example target file.

The Ranking Mechanism

In the original implementation of Constrained Semantic Change Search (CSCS), grammaticality (gr) is determined by sequence probability (sp). We propose using inverse perplexity (ip) as a more robust metric for grammaticality.

Sequences with both high semantic change (sc) and high grammaticality (gr) are expected to have a greater escape potential. We rank the sequences in descending order, assigning smaller rank values to those with higher escape potential. Consequently, the output is sorted based on rank_by_scip, with the top element possessing the smallest rank_by_scip and indicating the sequence with the highest escape potential.

Model Details

This application uses the pre-trained model with the highest zero-shot test accuracy (91.5%).

Training Parameters:

Parameter Value
Base Model CoV-RoBERTa_2048
Loss function Contrastive
Max Seq Len 1280
Positive Set Omicron
Negative Set Delta
Pooling Max
ReLU 0.2
Dropout 0.1
Learning Rate 0.001
Batch size 32
Margin 2.0
Epochs [0, 9]

To train models for specific use cases, please refer to the instructions in our GitHub repository.

Training Results:

Checkpoint Test Loss Test Acc Zero-shot Loss Zero-shot Acc
2 0.0236 99.7% 1.0627 50.0%
4 0.2941 89.4% 0.2286 91.5%

Dependencies

conda create -n spaces python=3.8
conda activate spaces
pip install -r requirements.txt

requirements.txt

numpy==1.21.0
pandas==2.0.2
sentence-transformers==2.2.2
transformers==4.30.2
tokenizers==0.13.3
scanpy==1.9.3
scikit-learn==1.2.2
scipy==1.10.1
plotly==5.24.1
huggingface-hub==0.25.2
torch-optimizer==0.3.0
torchmetrics==0.9.0
torch==1.12.1+cu113
torchvision==0.13.1+cu113
torchaudio==0.12.1
--extra-index-url https://download.pytorch.org/whl/cu113

More Information

[email protected]