Update README for small model
Browse files
README.md
CHANGED
@@ -15,6 +15,7 @@ license: mit
|
|
15 |
library_name: charboundary
|
16 |
pipeline_tag: text-classification
|
17 |
datasets:
|
|
|
18 |
- alea-institute/kl3m-data-snapshot-20250324
|
19 |
metrics:
|
20 |
- accuracy
|
@@ -29,16 +30,19 @@ papers:
|
|
29 |
# CharBoundary small Model
|
30 |
|
31 |
This is the small model for the [CharBoundary](https://github.com/alea-institute/charboundary) library (v0.5.0),
|
32 |
-
a fast character-based sentence and paragraph boundary detection system.
|
33 |
|
34 |
## Model Details
|
35 |
|
36 |
- **Size**: small
|
37 |
-
- **
|
|
|
|
|
38 |
- **Model Type**: Random Forest (32 trees, max depth 16)
|
39 |
- **Format**: scikit-learn model (serialized with skops)
|
40 |
- **Task**: Character-level boundary detection for text segmentation
|
41 |
- **License**: MIT
|
|
|
42 |
|
43 |
## Usage
|
44 |
|
@@ -56,17 +60,50 @@ segmenter = TextSegmenter.load(model_path)
|
|
56 |
text = "This is a test sentence. Here's another one!"
|
57 |
sentences = segmenter.segment_to_sentences(text)
|
58 |
print(sentences)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
59 |
```
|
60 |
|
61 |
## Performance
|
62 |
|
63 |
-
The model uses a random forest classifier with the following configuration:
|
64 |
- Window Size: 5 characters before, 3 characters after potential boundary
|
65 |
- Accuracy: 0.9970
|
66 |
-
- F1 Score: 0.
|
67 |
-
- Precision: 0.
|
68 |
- Recall: 0.9870
|
69 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
70 |
## Paper and Citation
|
71 |
|
72 |
This model is part of the research presented in the following paper:
|
@@ -83,3 +120,4 @@ This model is part of the research presented in the following paper:
|
|
83 |
For more details on the model architecture, training, and evaluation, please see:
|
84 |
- [Paper: "Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary"](https://arxiv.org/abs/2504.04131)
|
85 |
- [CharBoundary GitHub repository](https://github.com/alea-institute/charboundary)
|
|
|
|
15 |
library_name: charboundary
|
16 |
pipeline_tag: text-classification
|
17 |
datasets:
|
18 |
+
- alea-institute/alea-legal-benchmark-sentence-paragraph-boundaries
|
19 |
- alea-institute/kl3m-data-snapshot-20250324
|
20 |
metrics:
|
21 |
- accuracy
|
|
|
30 |
# CharBoundary small Model
|
31 |
|
32 |
This is the small model for the [CharBoundary](https://github.com/alea-institute/charboundary) library (v0.5.0),
|
33 |
+
a fast character-based sentence and paragraph boundary detection system optimized for legal text.
|
34 |
|
35 |
## Model Details
|
36 |
|
37 |
- **Size**: small
|
38 |
+
- **Model Size**: 3.0 MB (SKOPS compressed)
|
39 |
+
- **Memory Usage**: 1026 MB at runtime
|
40 |
+
- **Training Data**: Legal text with ~50,000 samples from [KL3M dataset](https://huggingface.co/datasets/alea-institute/kl3m-data-snapshot-20250324)
|
41 |
- **Model Type**: Random Forest (32 trees, max depth 16)
|
42 |
- **Format**: scikit-learn model (serialized with skops)
|
43 |
- **Task**: Character-level boundary detection for text segmentation
|
44 |
- **License**: MIT
|
45 |
+
- **Throughput**: ~748K characters/second
|
46 |
|
47 |
## Usage
|
48 |
|
|
|
60 |
text = "This is a test sentence. Here's another one!"
|
61 |
sentences = segmenter.segment_to_sentences(text)
|
62 |
print(sentences)
|
63 |
+
|
64 |
+
# Segment to paragraphs
|
65 |
+
paragraphs = segmenter.segment_to_paragraphs(text)
|
66 |
+
print(paragraphs)
|
67 |
+
|
68 |
+
# Get character-level spans
|
69 |
+
sentence_spans = segmenter.segment_to_sentence_spans(text)
|
70 |
+
print(sentence_spans) # [(0, 24), (25, 42)]
|
71 |
```
|
72 |
|
73 |
## Performance
|
74 |
|
75 |
+
The model uses a character-based random forest classifier with the following configuration:
|
76 |
- Window Size: 5 characters before, 3 characters after potential boundary
|
77 |
- Accuracy: 0.9970
|
78 |
+
- F1 Score: 0.7730
|
79 |
+
- Precision: 0.7460
|
80 |
- Recall: 0.9870
|
81 |
|
82 |
+
### Dataset-specific Performance
|
83 |
+
|
84 |
+
| Dataset | Precision | F1 | Recall |
|
85 |
+
|---------|-----------|-------|--------|
|
86 |
+
| ALEA SBD Benchmark | 0.624 | 0.718 | 0.845 |
|
87 |
+
| SCOTUS | 0.926 | 0.773 | 0.664 |
|
88 |
+
| Cyber Crime | 0.939 | 0.837 | 0.755 |
|
89 |
+
| BVA | 0.937 | 0.870 | 0.812 |
|
90 |
+
| Intellectual Property | 0.927 | 0.883 | 0.843 |
|
91 |
+
|
92 |
+
## Available Models
|
93 |
+
|
94 |
+
CharBoundary comes in three sizes, balancing accuracy and efficiency:
|
95 |
+
|
96 |
+
| Model | Size (MB) | Memory (MB) | Throughput (chars/sec) | F1 Score |
|
97 |
+
|-------|-----------|-------------|------------------------|----------|
|
98 |
+
| [Small](https://huggingface.co/alea-institute/charboundary-small) | 3.0 | 1,026 | ~748K | 0.773 |
|
99 |
+
| [Medium](https://huggingface.co/alea-institute/charboundary-medium) | 13.0 | 1,897 | ~587K | 0.779 |
|
100 |
+
| [Large](https://huggingface.co/alea-institute/charboundary-large) | 60.0 | 5,734 | ~518K | 0.782 |
|
101 |
+
|
102 |
+
ONNX-optimized versions of each model are also available:
|
103 |
+
- [Small ONNX](https://huggingface.co/alea-institute/charboundary-small-onnx)
|
104 |
+
- [Medium ONNX](https://huggingface.co/alea-institute/charboundary-medium-onnx)
|
105 |
+
- [Large ONNX](https://huggingface.co/alea-institute/charboundary-large-onnx)
|
106 |
+
|
107 |
## Paper and Citation
|
108 |
|
109 |
This model is part of the research presented in the following paper:
|
|
|
120 |
For more details on the model architecture, training, and evaluation, please see:
|
121 |
- [Paper: "Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary"](https://arxiv.org/abs/2504.04131)
|
122 |
- [CharBoundary GitHub repository](https://github.com/alea-institute/charboundary)
|
123 |
+
- [Annotated training data](https://huggingface.co/datasets/alea-institute/alea-legal-benchmark-sentence-paragraph-boundaries)
|