Spaces:
Running
A newer version of the Streamlit SDK is available:
1.43.2
Running Instructions
1. Set reference dataset
Application compares the target sequences with the average embedding of the reference sequences. The reference sequences can be Omicron
or Other
:
If you select
Omicron
, application uses the average embedding of 2000 Omicron sequences. This embedding is already generated. Thus, you do not need to upload a reference dataset.If you select
Other
, you should upload a CSV file with the reference sequences. CSV file must havesequence
column. Then, the model generates average embedding of the given sequences and uses it as a reference.
2. Set target dataset
Upload a CSV file with the target sequences. CSV file must have accession_id
and sequence
columns.
See the example target file which includes 10 Omicron (EPI_ISL_177...
) and 10 Eris (EPI_ISL_189...
) sequences. It is important to note that the selected Omicron sequences are currently circulating and are not included in the training dataset.
3. Get output
The output will be a dataframe with the following columns:
Column | Description |
---|---|
accession_id |
Accession ID |
log10(sc) |
Log-scaled semantic change |
log10(sp) |
Log-scaled sequence probability |
log10(ip) |
Log-scaled inverse perplexity |
rank_by_sc |
Rank by semantic change |
rank_by_sp |
Rank by sequence probability |
rank_by_ip |
Rank by inverse perplexity |
rank_by_scsp |
Rank by semantic change + Rank by sequence probability |
rank_by_scip |
Rank by semantic change + Rank by inverse perplexity |
Note: All ranks are in descending order, with the default sorting metric being rank_by_scip
.
See the output for the example target file.
The Ranking Mechanism
In the original implementation of Constrained Semantic Change Search (CSCS), grammaticality (gr
) is determined by sequence probability (sp
). We propose using inverse perplexity (ip
) as a more robust metric for grammaticality.
Sequences with both high semantic change (sc
) and high grammaticality (gr
) are expected to have a greater escape potential. We rank the sequences in descending order, assigning smaller rank values to those with higher escape potential. Consequently, the output is sorted based on rank_by_scip
, with the top element possessing the smallest rank_by_scip
and indicating the sequence with the highest escape potential.
Model Details
This application uses the pre-trained model with the highest zero-shot test accuracy (91.5%).
Training Parameters:
Parameter | Value |
---|---|
Base Model | CoV-RoBERTa_2048 |
Loss function | Contrastive |
Max Seq Len | 1280 |
Positive Set | Omicron |
Negative Set | Delta |
Pooling | Max |
ReLU | 0.2 |
Dropout | 0.1 |
Learning Rate | 0.001 |
Batch size | 32 |
Margin | 2.0 |
Epochs | [0, 9] |
To train models for specific use cases, please refer to the instructions in our GitHub repository.
Training Results:
Checkpoint | Test Loss | Test Acc | Zero-shot Loss | Zero-shot Acc |
---|---|---|---|---|
2 | 0.0236 | 99.7% | 1.0627 | 50.0% |
4 | 0.2941 | 89.4% | 0.2286 | 91.5% |
Dependencies
conda create -n spaces python=3.8
conda activate spaces
pip install -r requirements.txt
requirements.txt
numpy==1.21.0
pandas==2.0.2
sentence-transformers==2.2.2
transformers==4.30.2
tokenizers==0.13.3
scanpy==1.9.3
scikit-learn==1.2.2
scipy==1.10.1
plotly==5.24.1
huggingface-hub==0.25.2
torch-optimizer==0.3.0
torchmetrics==0.9.0
torch==1.12.1+cu113
torchvision==0.13.1+cu113
torchaudio==0.12.1
--extra-index-url https://download.pytorch.org/whl/cu113