INSTRUCTIONS.md · smtnkc/cov-snn-app at main

Running Instructions

1. Set reference dataset

Application compares the target sequences with the average embedding of the reference sequences. The reference sequences can be Omicron or Other:

If you select Omicron, application uses the average embedding of 2000 Omicron sequences. This embedding is already generated. Thus, you do not need to upload a reference dataset.
If you select Other, you should upload a CSV file with the reference sequences. CSV file must have sequence column. Then, the model generates average embedding of the given sequences and uses it as a reference.

2. Set target dataset

Upload a CSV file with the target sequences. CSV file must have accession_id and sequence columns.

See the example target file which includes 10 Omicron (EPI_ISL_177...) and 10 Eris (EPI_ISL_189...) sequences. It is important to note that the selected Omicron sequences are currently circulating and are not included in the training dataset.

3. Get output

The output will be a dataframe with the following columns:

Column	Description
`accession_id`	Accession ID
`log10(sc)`	Log-scaled semantic change
`log10(sp)`	Log-scaled sequence probability
`log10(ip)`	Log-scaled inverse perplexity
`rank_by_sc`	Rank by semantic change
`rank_by_sp`	Rank by sequence probability
`rank_by_ip`	Rank by inverse perplexity
`rank_by_scsp`	Rank by semantic change + Rank by sequence probability
`rank_by_scip`	Rank by semantic change + Rank by inverse perplexity

Note: All ranks are in descending order, with the default sorting metric being rank_by_scip.

See the output for the example target file.

The Ranking Mechanism

In the original implementation of Constrained Semantic Change Search (CSCS), grammaticality (gr) is determined by sequence probability (sp). We propose using inverse perplexity (ip) as a more robust metric for grammaticality.

Sequences with both high semantic change (sc) and high grammaticality (gr) are expected to have a greater escape potential. We rank the sequences in descending order, assigning smaller rank values to those with higher escape potential. Consequently, the output is sorted based on rank_by_scip, with the top element possessing the smallest rank_by_scip and indicating the sequence with the highest escape potential.

Model Details

This application uses the pre-trained model with the highest zero-shot test accuracy (91.5%).

Training Parameters:

Parameter	Value
Base Model	CoV-RoBERTa_2048
Loss function	Contrastive
Max Seq Len	1280
Positive Set	Omicron
Negative Set	Delta
Pooling	Max
ReLU	0.2
Dropout	0.1
Learning Rate	0.001
Batch size	32
Margin	2.0
Epochs	[0, 9]

To train models for specific use cases, please refer to the instructions in our GitHub repository.

Training Results:

Checkpoint	Test Loss	Test Acc	Zero-shot Loss	Zero-shot Acc
2	0.0236	99.7%	1.0627	50.0%
4	0.2941	89.4%	0.2286	91.5%

Dependencies

conda create -n spaces python=3.8
conda activate spaces
pip install -r requirements.txt

requirements.txt

numpy==1.21.0
pandas==2.0.2
sentence-transformers==2.2.2
transformers==4.30.2
tokenizers==0.13.3
scanpy==1.9.3
scikit-learn==1.2.2
scipy==1.10.1
plotly==5.24.1
huggingface-hub==0.25.2
torch-optimizer==0.3.0
torchmetrics==0.9.0
torch==1.12.1+cu113
torchvision==0.13.1+cu113
torchaudio==0.12.1
--extra-index-url https://download.pytorch.org/whl/cu113

More Information

[email protected]