alea-institute
/

charboundary-small

@@ -15,6 +15,7 @@ license: mit
 library_name: charboundary
 pipeline_tag: text-classification
 datasets:
   - alea-institute/kl3m-data-snapshot-20250324
 metrics:
   - accuracy
@@ -29,16 +30,19 @@ papers:
 # CharBoundary small Model
 This is the small model for the [CharBoundary](https://github.com/alea-institute/charboundary) library (v0.5.0),
-a fast character-based sentence and paragraph boundary detection system.
 ## Model Details
 - **Size**: small
-- **Training Data**: Legal text with ~50,000 samples
 - **Model Type**: Random Forest (32 trees, max depth 16)
 - **Format**: scikit-learn model (serialized with skops)
 - **Task**: Character-level boundary detection for text segmentation
 - **License**: MIT
 ## Usage
@@ -56,17 +60,50 @@ segmenter = TextSegmenter.load(model_path)
 text = "This is a test sentence. Here's another one!"
 sentences = segmenter.segment_to_sentences(text)
 print(sentences)
 ```
 ## Performance
-The model uses a random forest classifier with the following configuration:
 - Window Size: 5 characters before, 3 characters after potential boundary
 - Accuracy: 0.9970
-- F1 Score: 0.9880
-- Precision: 0.9900
 - Recall: 0.9870
 ## Paper and Citation
 This model is part of the research presented in the following paper:
@@ -83,3 +120,4 @@ This model is part of the research presented in the following paper:
 For more details on the model architecture, training, and evaluation, please see:
 - [Paper: "Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary"](https://arxiv.org/abs/2504.04131)
 - [CharBoundary GitHub repository](https://github.com/alea-institute/charboundary)

 library_name: charboundary
 pipeline_tag: text-classification
 datasets:
+  - alea-institute/alea-legal-benchmark-sentence-paragraph-boundaries
   - alea-institute/kl3m-data-snapshot-20250324
 metrics:
   - accuracy
 # CharBoundary small Model
 This is the small model for the [CharBoundary](https://github.com/alea-institute/charboundary) library (v0.5.0),
+a fast character-based sentence and paragraph boundary detection system optimized for legal text.
 ## Model Details
 - **Size**: small
+- **Model Size**: 3.0 MB (SKOPS compressed)
+- **Memory Usage**: 1026 MB at runtime
+- **Training Data**: Legal text with ~50,000 samples from [KL3M dataset](https://huggingface.co/datasets/alea-institute/kl3m-data-snapshot-20250324)
 - **Model Type**: Random Forest (32 trees, max depth 16)
 - **Format**: scikit-learn model (serialized with skops)
 - **Task**: Character-level boundary detection for text segmentation
 - **License**: MIT
+- **Throughput**: ~748K characters/second
 ## Usage
 text = "This is a test sentence. Here's another one!"
 sentences = segmenter.segment_to_sentences(text)
 print(sentences)
+# Segment to paragraphs
+paragraphs = segmenter.segment_to_paragraphs(text)
+print(paragraphs)
+# Get character-level spans
+sentence_spans = segmenter.segment_to_sentence_spans(text)
+print(sentence_spans)  # [(0, 24), (25, 42)]
 ```
 ## Performance
+The model uses a character-based random forest classifier with the following configuration:
 - Window Size: 5 characters before, 3 characters after potential boundary
 - Accuracy: 0.9970
+- F1 Score: 0.7730
+- Precision: 0.7460
 - Recall: 0.9870
+### Dataset-specific Performance
+| Dataset | Precision | F1 | Recall |
+|---------|-----------|-------|--------|
+| ALEA SBD Benchmark | 0.624 | 0.718 | 0.845 |
+| SCOTUS | 0.926 | 0.773 | 0.664 |
+| Cyber Crime | 0.939 | 0.837 | 0.755 |
+| BVA | 0.937 | 0.870 | 0.812 |
+| Intellectual Property | 0.927 | 0.883 | 0.843 |
+## Available Models
+CharBoundary comes in three sizes, balancing accuracy and efficiency:
+| Model | Size (MB) | Memory (MB) | Throughput (chars/sec) | F1 Score |
+|-------|-----------|-------------|------------------------|----------|
+| [Small](https://huggingface.co/alea-institute/charboundary-small) | 3.0 | 1,026 | ~748K | 0.773 |
+| [Medium](https://huggingface.co/alea-institute/charboundary-medium) | 13.0 | 1,897 | ~587K | 0.779 |
+| [Large](https://huggingface.co/alea-institute/charboundary-large) | 60.0 | 5,734 | ~518K | 0.782 |
+ONNX-optimized versions of each model are also available:
+- [Small ONNX](https://huggingface.co/alea-institute/charboundary-small-onnx)
+- [Medium ONNX](https://huggingface.co/alea-institute/charboundary-medium-onnx)
+- [Large ONNX](https://huggingface.co/alea-institute/charboundary-large-onnx)
 ## Paper and Citation
 This model is part of the research presented in the following paper:
 For more details on the model architecture, training, and evaluation, please see:
 - [Paper: "Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary"](https://arxiv.org/abs/2504.04131)
 - [CharBoundary GitHub repository](https://github.com/alea-institute/charboundary)
+- [Annotated training data](https://huggingface.co/datasets/alea-institute/alea-legal-benchmark-sentence-paragraph-boundaries)