---
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
datasets:
- CyCraftAI/CyPHER
extra_gated_fields:
  First Name: text
  Last Name: text
  Date of birth: date_picker
  Country: country
  Affiliation: text
  Job title:
    type: select
    options:
      - Student
      - Research Graduate
      - AI researcher
      - AI developer/engineer
      - Reporter
      - Other
  geo: ip_location
---

# CmdCaliper-base
## [[Dataset](https://huggingface.co/datasets/CyCraftAI/CyPHER)] [[Code](https://github.com/cycraft-corp/CmdCaliper)] [[Paper](https://arxiv.org/abs/2411.01176)]

The CmdCaliper models are the first embedding models specifically designed for command-line embeddings, developed by CyCraft AI Lab. Our evaluation results demonstrate that even the smallest version of CmdCaliper, with approximately 30 million parameters, can outperform state-of-the-art sentence embedding models that have over 10 times more parameters (335 million) across various command-line-specific tasks.

CmdCaliper offers three models of different sizes: CmdCaliper-large, CmdCaliper-base, and CmdCaliper-small. This provides flexible options to accommodate various hardware resource constraints.

CmdCaliper was introduced in the EMNLP 2024 paper titled "CmdCaliper: A Semantic-Aware Command-Line Embedding Model and Dataset for Security Research".

## Metric
| Methods             | Model Parameters   | MRR @3 | MRR @10 | Top @3 | Top @10 |
|---------------------|--------------------|--------|---------|--------|---------|
| Levenshtein distance | -                  | 71.23  | 72.45   | 74.99  | 81.83   |
| Word2Vec           | -                  | 45.83  | 46.93   | 48.49  | 54.86   |
|                     |                    |        |         |        |         |
| E5-small           | Small (0.03B)      | 81.59  | 82.6    | 84.97  | 90.59   |
| GTE-small          | Small (0.03B)      | 82.35  | 83.28   | 85.39  | 90.84   |
| CmdCaliper-small    | Small (0.03B)      | **86.81** | **87.78** | **89.21** | **94.76** |
|                     |                    |        |         |        |         |
| BGE-en-base        | Base (0.11B)       | 79.49  | 80.41   | 82.33  | 87.39   |
| E5-base             | Base (0.11B)       | 83.16  | 84.07   | 86.14  | 91.56   |
| GTR-base           | Base (0.11B)       | 81.55  | 82.51   | 84.54  | 90.1    |
| GTE-base            | Base (0.11B)       | 78.2   | 79.07   | 81.22  | 86.14   |
| CmdCaliper-base     | Base (0.11B)       | **87.56** | **88.47** | **90.27** | **95.26** |
|                     |                    |        |         |        |         |
| BGE-en-large        | Large (0.34B)      | 84.11  | 84.92   | 86.64  | 91.09   |
| E5-large            | Large (0.34B)      | 84.12  | 85.04   | 87.32  | 92.59   |
| GTR-large           | Large (0.34B)      | 88.09  | 88.68   | 91.27  | 94.58    |
| GTE-large           | Large (0.34B)      | 84.26   | 85.03   | 87.14  | 91.41    |
| CmdCaliper-large    | Large (0.34B)  | **89.12** | **89.91** | **91.45** | **95.65** |

## Usage
### HuggingFace Transformers
```python
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel

def average_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

input_texts = [
    'cronjob schedule daily 00:00 ./program.exe',
    'schtasks /create /tn "TaskName" /tr "C:\program.exe" /sc daily /st 00:00',
    'xcopy C:\Program Files (x86) E:\Program Files /E /H /K /O /X'
]

tokenizer = AutoTokenizer.from_pretrained("CyCraftAI/CmdCaliper-base")
model = AutoModel.from_pretrained("CyCraftAI/CmdCaliper-base")

# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')

outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# (Optionally) normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())
```

### Sentence Transformers
```python
from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("CyCraftAI/CmdCaliper-base")
# Run inference
sentences = [
    'cronjob schedule daily 00:00 ./program.exe',
    'schtasks /create /tn "TaskName" /tr "C:\program.exe" /sc daily /st 00:00',
    'xcopy C:\Program Files (x86) E:\Program Files /E /H /K /O /X'
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
```

## Limitation
This model focuses exclusively on Windows command lines. Additionally, any lengthy texts will be truncated to a maximum of 512 tokens.

## Citation
```
@inproceedings{huang2024cmdcaliper,
  title={CmdCaliper: A Semantic-Aware Command-Line Embedding Model and Dataset for Security Research},
  author={SianYao Huang, ChengLin Yang, CheYu Lin, and ChunYing Huang},
  booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,
  year={2024}
} 
```