rootxhacker/arthemis-embedding

This is a text embedding model finetuned from arthemislm-base on the all-nli-pair, all-nli-pair-class, all-nli-pair-score, all-nli-triplet, stsb, quora and natural-questions datasets. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

The Arthemis Embedding model is a 155.8M parameter text embedding model that incorporates Spiking Neural Networks (SNNs) and Liquid Time Constants (LTCs) for enhanced temporal dynamics and semantic representation learning. This neuromorphic architecture provides unique advantages in classification tasks while maintaining competitive performance across various text understanding benchmarks.

This embedding model performs on par with jinaai/jina-embeddings-v2-base-en on MTEB

Model Details

Model Type: Text Embedding
Supported Languages: English
Number of Parameters: 155.8M
Context Length: 1024 tokens
Embedding Dimension: 768
Base Model: arthemislm-base
Training Data: all-nli-pair, all-nli-pair-class, all-nli-pair-score, all-nli-triplet, stsb, quora, natural-questions

Architecture Features

Spiking Neural Networks in attention mechanisms for temporal processing
Liquid Time Constants in feed-forward layers for adaptive dynamics
12-layer transformer backbone with neuromorphic enhancements
RoPE positional encoding for sequence understanding
Surrogate gradient training for differentiable spike computation

Inference

In this gist you can find code for inference of this embedding model

https://gist.github.com/harishsg993010/220c24f0b2c41a6287a8579cd17c838f

Usage (Python)

Using this model with the custom implementation:

from transformers import AutoTokenizer
import torch
import numpy as np

# Load model (using the custom MTEBLlamaSNNLTCEncoder)
from mteb_benchmark_snn_ltc import MTEBLlamaSNNLTCEncoder

model = MTEBLlamaSNNLTCEncoder('rootxhacker/arthemis-embedding')

# Encode sentences
sentences = ["This is an example sentence", "Each sentence is converted"]
embeddings = model.encode(sentences, task_name="similarity")

print(f"Embeddings shape: {embeddings.shape}")  # (2, 768)
print(f"Embedding dimension: {embeddings.shape[1]}")

Usage (Custom Implementation)

For direct usage with the neuromorphic architecture:

import torch
import torch.nn as nn
from transformers import AutoTokenizer

# Initialize tokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-small")
tokenizer.pad_token = tokenizer.eos_token

# Load the model
model = MTEBLlamaSNNLTCEncoder('rootxhacker/arthemis-embedding')

# Process text
sentences = ['This is an example sentence', 'Each sentence is converted']
embeddings = model.encode(sentences, task_name="embedding_task")

# Use embeddings for similarity
from scipy.spatial.distance import cosine
similarity = 1 - cosine(embeddings[0], embeddings[1])
print(f"Cosine similarity: {similarity:.4f}")

Evaluation

The model has been evaluated on 41 tasks from the MTEB (Massive Text Embedding Benchmark):

MTEB Performance

Task Type	Average Score	Tasks Count	Best Individual Score
Classification	42.78	8	Amazon Counterfactual: 65.43
STS	39.96	8	STS17: 58.48
Clustering	28.54	8	ArXiv Hierarchical: 49.82
Retrieval	12.41	5	Twitter URL: 53.78
Other	13.07	12	Ask Ubuntu: 43.56

Overall MTEB Score: 27.05 (across 41 tasks)

Notable Individual Results

Task	Score	Task Type
Amazon Counterfactual Classification	65.43	Classification
STS17	58.48	Semantic Similarity
Toxic Conversations Classification	55.54	Classification
IMDB Classification	51.69	Classification
SICK-R	49.24	Semantic Similarity
ArXiv Hierarchical Clustering	49.82	Clustering
Banking77 Classification	29.98	Classification
STSBenchmark	36.82	Semantic Similarity

Model Strengths

Classification Excellence: Superior performance on text classification tasks with 42.78% average
Semantic Understanding: Strong semantic textual similarity capabilities (39.96% average)
Neuromorphic Advantages: Unique spiking neural architecture provides enhanced pattern recognition
Temporal Processing: Liquid time constants enable adaptive sequence processing
Robust Embeddings: 768-dimensional vectors capture rich semantic representations

Applications

Text Classification: Financial intent detection, sentiment analysis, content moderation
Semantic Search: Document retrieval and similarity matching
Clustering: Automatic text organization and topic discovery
Content Safety: Toxic content detection and content moderation
Question Answering: Similarity-based answer retrieval
Paraphrase Mining: Finding semantically equivalent text pairs
Semantic Textual Similarity: Measuring text similarity for various applications

Training Details

The model was finetuned from the arthemislm-base foundation model using multiple high-quality datasets:

all-nli-pair: Natural Language Inference pair datasets
all-nli-pair-class: Classification variants of NLI pairs
all-nli-pair-score: Scored NLI pairs for similarity learning
all-nli-triplet: Triplet learning from NLI data
stsb: Semantic Textual Similarity Benchmark
quora: Quora Question Pairs for paraphrase detection
natural-questions: Google's Natural Questions dataset

The neuromorphic enhancements were integrated during training to provide:

Spiking neuron dynamics in attention layers
Liquid time constant adaptation in feed-forward networks
Surrogate gradient optimization for spike-based learning
Enhanced temporal pattern recognition capabilities

Technical Specifications

Architecture: Transformer with SNN/LTC enhancements
Hidden Size: 768
Intermediate Size: 2048  
Attention Heads: 12
Layers: 12
Max Position Embeddings: 1024
Vocabulary Size: 50,257
Spiking Threshold: 1.0
LTC Hidden Size: 256
Training Precision: FP32

Citation

@misc{arthemis-embedding-2024,
  title={Arthemis Embedding: A Neuromorphic Text Embedding Model},
  author={rootxhacker},
  year={2024},
  howpublished={\url{https://huggingface.co/rootxhacker/arthemis-embedding}}
}

License

Please refer to the model files for licensing information.

rootxhacker
/

arthemis-embedding