alea-institute commited on
Commit
467d115
·
verified ·
1 Parent(s): e53fdfc

Update README for small model

Browse files
Files changed (1) hide show
  1. README.md +43 -5
README.md CHANGED
@@ -15,6 +15,7 @@ license: mit
15
  library_name: charboundary
16
  pipeline_tag: text-classification
17
  datasets:
 
18
  - alea-institute/kl3m-data-snapshot-20250324
19
  metrics:
20
  - accuracy
@@ -29,16 +30,19 @@ papers:
29
  # CharBoundary small Model
30
 
31
  This is the small model for the [CharBoundary](https://github.com/alea-institute/charboundary) library (v0.5.0),
32
- a fast character-based sentence and paragraph boundary detection system.
33
 
34
  ## Model Details
35
 
36
  - **Size**: small
37
- - **Training Data**: Legal text with ~50,000 samples
 
 
38
  - **Model Type**: Random Forest (32 trees, max depth 16)
39
  - **Format**: scikit-learn model (serialized with skops)
40
  - **Task**: Character-level boundary detection for text segmentation
41
  - **License**: MIT
 
42
 
43
  ## Usage
44
 
@@ -56,17 +60,50 @@ segmenter = TextSegmenter.load(model_path)
56
  text = "This is a test sentence. Here's another one!"
57
  sentences = segmenter.segment_to_sentences(text)
58
  print(sentences)
 
 
 
 
 
 
 
 
59
  ```
60
 
61
  ## Performance
62
 
63
- The model uses a random forest classifier with the following configuration:
64
  - Window Size: 5 characters before, 3 characters after potential boundary
65
  - Accuracy: 0.9970
66
- - F1 Score: 0.9880
67
- - Precision: 0.9900
68
  - Recall: 0.9870
69
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
70
  ## Paper and Citation
71
 
72
  This model is part of the research presented in the following paper:
@@ -83,3 +120,4 @@ This model is part of the research presented in the following paper:
83
  For more details on the model architecture, training, and evaluation, please see:
84
  - [Paper: "Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary"](https://arxiv.org/abs/2504.04131)
85
  - [CharBoundary GitHub repository](https://github.com/alea-institute/charboundary)
 
 
15
  library_name: charboundary
16
  pipeline_tag: text-classification
17
  datasets:
18
+ - alea-institute/alea-legal-benchmark-sentence-paragraph-boundaries
19
  - alea-institute/kl3m-data-snapshot-20250324
20
  metrics:
21
  - accuracy
 
30
  # CharBoundary small Model
31
 
32
  This is the small model for the [CharBoundary](https://github.com/alea-institute/charboundary) library (v0.5.0),
33
+ a fast character-based sentence and paragraph boundary detection system optimized for legal text.
34
 
35
  ## Model Details
36
 
37
  - **Size**: small
38
+ - **Model Size**: 3.0 MB (SKOPS compressed)
39
+ - **Memory Usage**: 1026 MB at runtime
40
+ - **Training Data**: Legal text with ~50,000 samples from [KL3M dataset](https://huggingface.co/datasets/alea-institute/kl3m-data-snapshot-20250324)
41
  - **Model Type**: Random Forest (32 trees, max depth 16)
42
  - **Format**: scikit-learn model (serialized with skops)
43
  - **Task**: Character-level boundary detection for text segmentation
44
  - **License**: MIT
45
+ - **Throughput**: ~748K characters/second
46
 
47
  ## Usage
48
 
 
60
  text = "This is a test sentence. Here's another one!"
61
  sentences = segmenter.segment_to_sentences(text)
62
  print(sentences)
63
+
64
+ # Segment to paragraphs
65
+ paragraphs = segmenter.segment_to_paragraphs(text)
66
+ print(paragraphs)
67
+
68
+ # Get character-level spans
69
+ sentence_spans = segmenter.segment_to_sentence_spans(text)
70
+ print(sentence_spans) # [(0, 24), (25, 42)]
71
  ```
72
 
73
  ## Performance
74
 
75
+ The model uses a character-based random forest classifier with the following configuration:
76
  - Window Size: 5 characters before, 3 characters after potential boundary
77
  - Accuracy: 0.9970
78
+ - F1 Score: 0.7730
79
+ - Precision: 0.7460
80
  - Recall: 0.9870
81
 
82
+ ### Dataset-specific Performance
83
+
84
+ | Dataset | Precision | F1 | Recall |
85
+ |---------|-----------|-------|--------|
86
+ | ALEA SBD Benchmark | 0.624 | 0.718 | 0.845 |
87
+ | SCOTUS | 0.926 | 0.773 | 0.664 |
88
+ | Cyber Crime | 0.939 | 0.837 | 0.755 |
89
+ | BVA | 0.937 | 0.870 | 0.812 |
90
+ | Intellectual Property | 0.927 | 0.883 | 0.843 |
91
+
92
+ ## Available Models
93
+
94
+ CharBoundary comes in three sizes, balancing accuracy and efficiency:
95
+
96
+ | Model | Size (MB) | Memory (MB) | Throughput (chars/sec) | F1 Score |
97
+ |-------|-----------|-------------|------------------------|----------|
98
+ | [Small](https://huggingface.co/alea-institute/charboundary-small) | 3.0 | 1,026 | ~748K | 0.773 |
99
+ | [Medium](https://huggingface.co/alea-institute/charboundary-medium) | 13.0 | 1,897 | ~587K | 0.779 |
100
+ | [Large](https://huggingface.co/alea-institute/charboundary-large) | 60.0 | 5,734 | ~518K | 0.782 |
101
+
102
+ ONNX-optimized versions of each model are also available:
103
+ - [Small ONNX](https://huggingface.co/alea-institute/charboundary-small-onnx)
104
+ - [Medium ONNX](https://huggingface.co/alea-institute/charboundary-medium-onnx)
105
+ - [Large ONNX](https://huggingface.co/alea-institute/charboundary-large-onnx)
106
+
107
  ## Paper and Citation
108
 
109
  This model is part of the research presented in the following paper:
 
120
  For more details on the model architecture, training, and evaluation, please see:
121
  - [Paper: "Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary"](https://arxiv.org/abs/2504.04131)
122
  - [CharBoundary GitHub repository](https://github.com/alea-institute/charboundary)
123
+ - [Annotated training data](https://huggingface.co/datasets/alea-institute/alea-legal-benchmark-sentence-paragraph-boundaries)