Clinical Contrastive ModernBERT
This is a fine-tuned Clinical ModernBERT model trained with contrastive learning for clinical note embeddings.
Model Details
- Base Model: Simonlee711/Clinical_ModernBERT
- Architecture: ModernBERT with contrastive learning head
- Training Method: Triplet loss contrastive learning
- Vocabulary Size: 50370 tokens
- Special Tokens: Includes
[ENTITY]
token (ID: 50368) - Max Sequence Length: 8192 tokens
- Hidden Size: 768
- Layers: 22
Special Features
- โ Extended Vocabulary: Custom tokens for clinical text processing
- โ
Entity Masking:
[ENTITY]
token for anonymizing sensitive information - โ Contrastive Learning: Trained to produce semantically meaningful embeddings
- โ Clinical Domain: Specialized for medical/clinical text understanding
Performance
The model achieves:
- Cosine Similarity: 0.85 (on clinical note similarity tasks)
- Triplet Accuracy: 0.92 (on contrastive learning validation)
Usage
Basic Usage
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("nikhil061307/contrastive-learning-bert-added-token")
model = AutoModel.from_pretrained("nikhil061307/contrastive-learning-bert-added-token")
def get_embeddings(text, max_length=512):
# Tokenize
inputs = tokenizer(
text,
padding=True,
truncation=True,
max_length=max_length,
return_tensors='pt'
)
# Get embeddings
with torch.no_grad():
outputs = model(**inputs)
# Mean pooling
attention_mask = inputs['attention_mask']
token_embeddings = outputs.last_hidden_state
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
embeddings = torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
# Normalize (important for contrastive learning models)
embeddings = F.normalize(embeddings, p=2, dim=1)
return embeddings
# Example usage
clinical_note = "Patient presents with chest pain and shortness of breath. Vital signs stable."
embeddings = get_embeddings(clinical_note)
print(f"Embedding shape: {embeddings.shape}")
Entity Masking
# Use [ENTITY] token for anonymization
text_with_entities = "Patient [ENTITY] presents with chest pain."
embeddings = get_embeddings(text_with_entities)
# Check if [ENTITY] token is available
entity_token_id = tokenizer.convert_tokens_to_ids('[ENTITY]')
print(f"[ENTITY] token ID: {entity_token_id}")
Similarity Comparison
def compute_similarity(text1, text2):
emb1 = get_embeddings(text1)
emb2 = get_embeddings(text2)
# Cosine similarity
similarity = torch.cosine_similarity(emb1, emb2)
return similarity.item()
# Compare clinical notes
note1 = "Patient has acute myocardial infarction."
note2 = "Patient diagnosed with heart attack."
similarity = compute_similarity(note1, note2)
print(f"Similarity: {similarity:.3f}")
Training Details
This model was fine-tuned using:
- Loss Function: Triplet loss with margin
- Training Data: Clinical notes with positive/negative pairs
- Optimization: Contrastive learning approach
- Special Tokens: Added
[ENTITY]
and[EMPTY]
tokens
Files Included
tokenizer_config.json
special_tokens_map.json
tokenizer.json
model.safetensors
pytorch_model.bin
training_args.bin
Technical Specifications
- Model Type: ModernBERT
- Parameters: ~109M (22 layers ร 768 hidden size)
- Precision: float32
- Framework: PyTorch + Transformers
- Compatible: transformers >= 4.44.0
Citation
If you use this model, please cite:
@misc{clinical-contrastive-modernbert,
title={Clinical Contrastive ModernBERT},
author={Your Name},
year={2025},
url={https://huggingface.co/nikhil061307/contrastive-learning-bert-added-token}
}
License
Follows the same license as the base model: Simonlee711/Clinical_ModernBERT
- Downloads last month
- 4
Model tree for nikhil061307/contrastive-learning-bert-added-token
Base model
answerdotai/ModernBERT-base
Finetuned
Simonlee711/Clinical_ModernBERT
Evaluation results
- Cosine Similarity on Clinical Notes Datasetself-reported0.850
- Triplet Accuracy on Clinical Notes Datasetself-reported0.920