File size: 5,251 Bytes
7c0fe2b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
467d115
7c0fe2b
 
 
 
 
 
 
e53fdfc
 
7c0fe2b
 
3e31206
 
 
467d115
3e31206
 
 
 
467d115
 
 
3e31206
 
 
 
467d115
3e31206
 
 
fe26bb0
 
 
 
3e31206
ee6cdfb
3e31206
 
 
 
 
 
fe26bb0
 
3e31206
 
 
 
 
a3da7d3
467d115
a3da7d3
 
 
 
3e31206
 
 
 
467d115
3e31206
 
467d115
 
3e31206
 
467d115
 
 
 
 
 
 
 
 
 
 
 
 
 
cc016ca
 
 
 
 
467d115
e53fdfc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
467d115
7afac26
 
 
4650249
 
 
7afac26
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
---
language:
  - en
tags:
  - charboundary
  - sentence-boundary-detection
  - paragraph-detection
  - legal-text
  - legal-nlp
  - text-segmentation
  - cpu
  - document-processing
  - rag
license: mit
library_name: charboundary
pipeline_tag: text-classification
datasets:
  - alea-institute/alea-legal-benchmark-sentence-paragraph-boundaries
  - alea-institute/kl3m-data-snapshot-20250324
metrics:
  - accuracy
  - f1
  - precision
  - recall
  - throughput
papers:
  - https://arxiv.org/abs/2504.04131
---

# CharBoundary small Model

This is the small model for the [CharBoundary](https://github.com/alea-institute/charboundary) library (v0.5.0),
a fast character-based sentence and paragraph boundary detection system optimized for legal text.

## Model Details

- **Size**: small
- **Model Size**: 3.0 MB (SKOPS compressed)
- **Memory Usage**: 1026 MB at runtime
- **Training Data**: Legal text with ~50,000 samples from [KL3M dataset](https://huggingface.co/datasets/alea-institute/kl3m-data-snapshot-20250324)
- **Model Type**: Random Forest (32 trees, max depth 16)
- **Format**: scikit-learn model (serialized with skops)
- **Task**: Character-level boundary detection for text segmentation
- **License**: MIT
- **Throughput**: ~748K characters/second

## Usage

> **Important:** When loading models from Hugging Face Hub, you must set `trust_model=True` to allow loading custom class types.
> 
> **Security Note:** The ONNX model variants are recommended in security-sensitive environments as they don't require bypassing skops security measures with `trust_model=True`. See the [ONNX versions](https://huggingface.co/alea-institute/charboundary-small-onnx) for a safer alternative.

```python
# pip install charboundary
from huggingface_hub import hf_hub_download
from charboundary import TextSegmenter

# Download the model
model_path = hf_hub_download(repo_id="alea-institute/charboundary-small", filename="model.pkl")

# Load the model (trust_model=True is required when loading from external sources)
segmenter = TextSegmenter.load(model_path, trust_model=True)

# Use the model
text = "This is a test sentence. Here's another one!"
sentences = segmenter.segment_to_sentences(text)
print(sentences)
# Output: ['This is a test sentence.', " Here's another one!"]

# Segment to spans
sentence_spans = segmenter.get_sentence_spans(text)
print(sentence_spans)
# Output: [(0, 24), (24, 44)]
```

## Performance

The model uses a character-based random forest classifier with the following configuration:
- Window Size: 5 characters before, 3 characters after potential boundary
- Accuracy: 0.9970
- F1 Score: 0.7730
- Precision: 0.7460
- Recall: 0.9870

### Dataset-specific Performance

| Dataset | Precision | F1 | Recall |
|---------|-----------|-------|--------|
| ALEA SBD Benchmark | 0.624 | 0.718 | 0.845 |
| SCOTUS | 0.926 | 0.773 | 0.664 |
| Cyber Crime | 0.939 | 0.837 | 0.755 |
| BVA | 0.937 | 0.870 | 0.812 |
| Intellectual Property | 0.927 | 0.883 | 0.843 |

## Available Models

CharBoundary comes in three sizes, balancing accuracy and efficiency:

| Model | Format | Size (MB) | Memory (MB) | Throughput (chars/sec) | F1 Score |
|-------|--------|-----------|-------------|------------------------|----------|
| Small | [SKOPS](https://huggingface.co/alea-institute/charboundary-small) / [ONNX](https://huggingface.co/alea-institute/charboundary-small-onnx) | 3.0 / 0.5 | 1,026 | ~748K | 0.773 |
| Medium | [SKOPS](https://huggingface.co/alea-institute/charboundary-medium) / [ONNX](https://huggingface.co/alea-institute/charboundary-medium-onnx) | 13.0 / 2.6 | 1,897 | ~587K | 0.779 |
| Large | [SKOPS](https://huggingface.co/alea-institute/charboundary-large) / [ONNX](https://huggingface.co/alea-institute/charboundary-large-onnx) | 60.0 / 13.0 | 5,734 | ~518K | 0.782 |

## Paper and Citation

This model is part of the research presented in the following paper:

```
@article{bommarito2025precise,
  title={Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary},
  author={Bommarito, Michael J and Katz, Daniel Martin and Bommarito, Jillian},
  journal={arXiv preprint arXiv:2504.04131},
  year={2025}
}
```

For more details on the model architecture, training, and evaluation, please see:
- [Paper: "Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary"](https://arxiv.org/abs/2504.04131)
- [CharBoundary GitHub repository](https://github.com/alea-institute/charboundary)
- [Annotated training data](https://huggingface.co/datasets/alea-institute/alea-legal-benchmark-sentence-paragraph-boundaries)

## Contact

This model is developed and maintained by the [ALEA Institute](https://aleainstitute.ai). 

For technical support, collaboration opportunities, or general inquiries:
 
- GitHub: https://github.com/alea-institute/kl3m-model-research
- Email: [email protected]
- Website: https://aleainstitute.ai

For any questions, please contact [ALEA Institute](https://aleainstitute.ai) at [[email protected]](mailto:[email protected]) or
create an issue on this repository or [GitHub](https://github.com/alea-institute/kl3m-model-research).

![https://aleainstitute.ai](https://aleainstitute.ai/images/alea-logo-ascii-1x1.png)