File size: 3,088 Bytes

6309f03
d6e6daa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6309f03
 
d6e6daa
6309f03
d6e6daa
6309f03
 
 
d6e6daa
 
 
 
6309f03
d6e6daa
6309f03
d6e6daa
6309f03
d6e6daa
6309f03
d6e6daa
 
6309f03
d6e6daa
 
 
6309f03
d6e6daa
 
 
6309f03
d6e6daa
 
 
6309f03
d6e6daa
6309f03
d6e6daa
 
 
6309f03
d6e6daa
 
6309f03
d6e6daa
 
 
6309f03
d6e6daa
 
 
 
6309f03
d6e6daa
 
6309f03
d6e6daa
6309f03
d6e6daa
6309f03
d6e6daa
 
6309f03
d6e6daa
6309f03
d6e6daa
 
6309f03
d6e6daa
6309f03
d6e6daa
 
6309f03
d6e6daa
6309f03
d6e6daa

---
tags:
- transformers
- text-classification
- russian
- constructicon
- nlp
- linguistics
base_model: intfloat/multilingual-e5-large
language:
- ru
pipeline_tag: text-classification
widget:
- text: "passage: NP-Nom так и VP-Pfv[Sep]query: Петр так и замер."
  example_title: "Positive example"
- text: "passage: NP-Nom так и VP-Pfv[Sep]query: Мы хорошо поработали."
  example_title: "Negative example"
- text: "passage: мягко говоря, Cl[Sep]query: Мягко говоря, это была ошибка."
  example_title: "Positive example"
---

# Russian Constructicon Classifier

A binary classification model for determining whether a Russian Constructicon pattern is present in a given text example. Fine-tuned from [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) in two stages: first as a semantic model on Russian Constructicon data, then for binary classification.

## Model Details

- **Base model:** intfloat/multilingual-e5-large
- **Task:** Binary text classification
- **Language:** Russian
- **Training:** Two-stage fine-tuning on Russian Constructicon data

## Usage

### Primary Usage (RusCxnPipe Library)

This model is designed for use with the [RusCxnPipe](https://github.com/Futyn-Maker/ruscxnpipe) library:

```python
from ruscxnpipe import ConstructionClassifier

classifier = ConstructionClassifier(
    model_name="Futyn-Maker/ruscxn-classifier"
)

# Classify candidates (output from semantic search)
queries = ["Петр так и замер."]
candidates = [[{"id": "pattern1", "pattern": "NP-Nom так и VP-Pfv"}]]

results = classifier.classify_candidates(queries, candidates)
print(results[0][0]['is_present'])  # 1 if present, 0 if absent
```

### Direct Usage

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model = AutoModelForSequenceClassification.from_pretrained("Futyn-Maker/ruscxn-classifier")
tokenizer = AutoTokenizer.from_pretrained("Futyn-Maker/ruscxn-classifier")

# Format: "passage: [pattern][Sep]query: [example]"
text = "passage: NP-Nom так и VP-Pfv[Sep]query: Петр так и замер."
inputs = tokenizer(text, return_tensors="pt", truncation=True)

with torch.no_grad():
    outputs = model(**inputs)
    prediction = torch.softmax(outputs.logits, dim=-1)
    is_present = torch.argmax(prediction, dim=-1).item()

print(f"Construction present: {is_present}")  # 1 = present, 0 = absent
```

## Input Format

The model expects input in the format: `"passage: [pattern][Sep]query: [example]"`

- **query:** The Russian text to analyze
- **passage:** The constructicon pattern to check for

## Training

1. **Stage 1:** Semantic embedding training on Russian Constructicon examples and patterns
2. **Stage 2:** Binary classification fine-tuning to predict construction presence

## Output

- **Label 0:** Construction is NOT present in the text
- **Label 1:** Construction IS present in the text

## Framework Versions

- Transformers: 4.51.3
- PyTorch: 2.7.0+cu126
- Python: 3.10.12
```