1-800-BAD-CODE
/

sentence_boundary_detection_multilang

ONNX

NeMo

sentence boundary detection

token classification

nlp

Model card Files Files and versions Community

1-800-BAD-CODE commited on Mar 5, 2023

Commit

b21595d

1 Parent(s): 5bda556

Update README.md

Browse files

Files changed (1) hide show

README.md +79 -104

README.md CHANGED Viewed

@@ -80,130 +80,105 @@ For each input subword `t`, this model predicts the probability that `t` is the
 # Example Usage
-This model has been exported to `ONNX` (opset 17) alongside the associated `SentencePiece` tokenizer.
-This model is intended to be downloaded and used in the native frameworks, rather than HF's API.
-First, let's download and prepare the ONNX and SentencePiece models:
 ```python
-from sentencepiece import SentencePieceProcessor
-import onnxruntime as ort
-import numpy as np
-from huggingface_hub import hf_hub_download
 from typing import List
-spe_path = hf_hub_download(
-    repo_id="1-800-BAD-CODE/sentence_boundary_detection_multilang", filename="spe_mixed_case_64k_49lang.model"
-)
-onnx_path = hf_hub_download(
-    repo_id="1-800-BAD-CODE/sentence_boundary_detection_multilang", filename="sbd_49lang_bert_small.onnx"
-)
-tokenizer: SentencePieceProcessor = SentencePieceProcessor(spe_path)
-ort_session: ort.InferenceSession = ort.InferenceSession(onnx_path)
-```
-Next, let's define a simple function that runs inference on one text input and prints the predictions:
-```python
-def run_infer(text: str, threshold: float = 0.5):
-    # Encode as IDs for the model input; add BOS/EOS tags.
-    ids = tokenizer.EncodeAsIds(text)
-    input_ids = np.array([[tokenizer.bos_id()] + ids + [tokenizer.eos_id()]])
-    # Run inference; get probablity of each token being a sentence boundary
-    outputs = ort_session.run(None, {"input_ids": input_ids})
-    # Shape [B, T]
-    probs = outputs[0]
-    # Single input is batched; keep only first element
-    probs = probs[0]
-    # Trim BOS/EOS
-    probs = probs[1:-1]
-    # Find all positions that exceed the threshold as a sentence boundary
-    break_points: List[int] = np.squeeze(np.argwhere(probs > threshold), axis=1).tolist()  # noqa
-    # Add the final token to the break points, to not have leftover tokens after the loop
-    if (not break_points) or (break_points[-1] != len(ids) - 1):
-        break_points.append(len(ids) - 1)
-    # Break tokens at boundaries, convert back to text
-    print(f"Input: {text}")
-    for i, break_point in enumerate(break_points):
-        start = 0 if i == 0 else (break_points[i - 1] + 1)
-        sub_ids = ids[start : break_point + 1]
-        sub_text = tokenizer.DecodeIds(sub_ids)
-        print(f"\tSentence {i}: {sub_text}")
 ```
-Now let's run some examples. These are all from the OpenSubtitles test set.
-Some interesting behavior is the English acronyms (a period is not a sufficient condition for a sentence boundary) and Thai (spaces are full stops in Thai, and this is detected automatically).
-These are all lower-cased to make it harder, but the model is trained with mixed-case data so true-cased inputs will work as well.
-```python
-# English with a lot of acronyms
-run_infer(
-    "the new d.n.a. sample has been multiplexed, and the gametes are already dividing. let's get the c.p.d. over "
-    "there. dinner's at 630 p.m. see that piece on you in the l.a. times? chicago p.d. will eat him alive."
-)
-# Chinese
-run_infer("魔鬼兵團都死了？但是如果这让你不快乐就别做了。您就不能发个电报吗。我們都準備好了。")
-# Spanish
-run_infer("él es uno de aquellos. ¿tiene algo de beber? cómo el aislamiento no vale la pena.")
-# Thai
-run_infer(
-    "พวกเขาต้องโกรธมากเลยใช่ไหม โทษทีนะลูกของเราไม่เป็นอะไรใช่ไหม ถึงเจ้าจะลากข้าไปเจ้าก็ไม่ได้อะไรอยู่ดี ผมคิดว่าจะดีกว่านะถ้าคุณไม่ออกไปไหน"
-)
-# Ukrainian
-run_infer(
-    "розігни і зігни, будь ласка. я знаю, ваши люди храбры. было приятно, правда? для начала, тебе нужен собственный "
-    "свой самолет."
-)
-# Polish
-run_infer(
-    "szedłem tylko do. pamiętaj, nigdy się nie obawiaj żyć na krawędzi ryzyka. ćwiczę już od dwóch tygodni a byłem "
-    "zabity tylko raz."
-)
-```
-Expected output:
 ```text
 Input: the new d.n.a. sample has been multiplexed, and the gametes are already dividing. let's get the c.p.d. over there. dinner's at 630 p.m. see that piece on you in the l.a. times? chicago p.d. will eat him alive.
-        Sentence 0: the new d.n.a. sample has been multiplexed, and the gametes are already dividing.
-        Sentence 1: let's get the c.p.d. over there.
-        Sentence 2: dinner's at 630 p.m.
-        Sentence 3: see that piece on you in the l.a. times?
-        Sentence 4: chicago p.d. will eat him alive.
 Input: 魔鬼兵團都死了？但是如果这让你不快乐就别做了。您就不能发个电报吗。我們都準備好了。
-        Sentence 0: 魔鬼兵團都死了？
-        Sentence 1: 但是如果这让你不快乐就别做了。
-        Sentence 2: 您就不能发个电报吗。
-        Sentence 3: 我們都準備好了。
 Input: él es uno de aquellos. ¿tiene algo de beber? cómo el aislamiento no vale la pena.
-        Sentence 0: él es uno de aquellos.
-        Sentence 1: ¿tiene algo de beber?
-        Sentence 2: cómo el aislamiento no vale la pena.
 Input: พวกเขาต้องโกรธมากเลยใช่ไหม โทษทีนะลูกของเราไม่เป็นอะไรใช่ไหม ถึงเจ้าจะลากข้าไปเจ้าก็ไม่ได้อะไรอยู่ดี ผมคิดว่าจะดีกว่านะถ้าคุณไม่ออกไปไหน
-        Sentence 0: พวกเขาต้องโกรธมากเลยใช่ไหม
-        Sentence 1: โทษทีนะลูกของเราไม่เป็นอะไรใช่ไหม
-        Sentence 2: ถึงเจ้าจะลากข้าไปเจ้าก็ไม่ได้อะไรอยู่ดี
-        Sentence 3: ผมคิดว่าจะดีกว่านะถ้าคุณไม่ออกไปไหน
 Input: розігни і зігни, будь ласка. я знаю, ваши люди храбры. было приятно, правда? для начала, тебе нужен собственный свой самолет.
-        Sentence 0: розігни і зігни, будь ласка.
-        Sentence 1: я знаю, ваши люди храбры.
-        Sentence 2: было приятно, правда?
-        Sentence 3: для начала, тебе нужен собственный свой самолет.
 Input: szedłem tylko do. pamiętaj, nigdy się nie obawiaj żyć na krawędzi ryzyka. ćwiczę już od dwóch tygodni a byłem zabity tylko raz.
-        Sentence 0: szedłem tylko do.
-        Sentence 1: pamiętaj, nigdy się nie obawiaj żyć na krawędzi ryzyka.
-        Sentence 2: ćwiczę już od dwóch tygodni a byłem zabity tylko raz.
 ```
 # Model Architecture
 This is a data-driven approach to SBD. The model uses a `SentencePiece` tokenizer, a BERT-style encoder, and a linear classifier to predict which subwords are sentence boundaries.

 # Example Usage
+The easiest way to use this model is to install [punctuators](https://github.com/1-800-BAD-CODE/punctuators):
+```bash
+$ pip install punctuators
+```
+<details open>
+  <summary>Example Usage</summary>
 ```python
 from typing import List
+from punctuators.models import SBDModelONNX
+# Instantiate this model
+# This will download the ONNX and SPE models. To clean up, delete this model from your HF cache directory.
+m = SBDModelONNX.from_pretrained("sbd_multi_lang")
+input_texts: List[str] = [
+    # English (with a lot of acronyms)
+    "the new d.n.a. sample has been multiplexed, and the gametes are already dividing. let's get the c.p.d. over there. dinner's at 630 p.m. see that piece on you in the l.a. times? chicago p.d. will eat him alive.",
+    # Chinese
+    "魔鬼兵團都死了？但是如果这让你不快乐就别做了。您就不能发个电报吗。我們都準備好了。",
+    # Spanish
+    "él es uno de aquellos. ¿tiene algo de beber? cómo el aislamiento no vale la pena.",
+    # Thai
+    "พวกเขาต้องโกรธมากเลยใช่ไหม โทษทีนะลูกของเราไม่เป็นอะไรใช่ไหม ถึงเจ้าจะลากข้าไปเจ้าก็ไม่ได้อะไรอยู่ดี ผมคิดว่าจะดีกว่านะถ้าคุณไม่ออกไปไหน",
+    # Ukrainian
+    "розігни і зігни, будь ласка. я знаю, ваши люди храбры. было приятно, правда? для начала, тебе нужен собственный свой самолет.",
+    # Polish
+    "szedłem tylko do. pamiętaj, nigdy się nie obawiaj żyć na krawędzi ryzyka. ćwiczę już od dwóch tygodni a byłem zabity tylko raz.",
+]
+# Run inference
+results: List[List[str]] = m.infer(input_texts)
+# Print each input and it's segmented outputs
+for input_text, output_texts in zip(input_texts, results):
+    print(f"Input: {input_text}")
+    print(f"Outputs:")
+    for text in output_texts:
+        print(f"\t{text}")
+    print()
 ```
+</details>
+<details open>
+  <summary>Expected outputs</summary>
 ```text
 Input: the new d.n.a. sample has been multiplexed, and the gametes are already dividing. let's get the c.p.d. over there. dinner's at 630 p.m. see that piece on you in the l.a. times? chicago p.d. will eat him alive.
+Outputs:
+	the new d.n.a. sample has been multiplexed, and the gametes are already dividing.
+	let's get the c.p.d. over there.
+	dinner's at 630 p.m.
+	see that piece on you in the l.a. times?
+	chicago p.d. will eat him alive.
 Input: 魔鬼兵團都死了？但是如果这让你不快乐就别做了。您就不能发个电报吗。我們都準備好了。
+Outputs:
+	魔鬼兵團都死了？
+	但是如果这让你不快乐就别做了。
+	您就不能发个电报吗。
+	我們都準備好了。
 Input: él es uno de aquellos. ¿tiene algo de beber? cómo el aislamiento no vale la pena.
+Outputs:
+	él es uno de aquellos.
+	¿tiene algo de beber?
+	cómo el aislamiento no vale la pena.
 Input: พวกเขาต้องโกรธมากเลยใช่ไหม โทษทีนะลูกของเราไม่เป็นอะไรใช่ไหม ถึงเจ้าจะลากข้าไปเจ้าก็ไม่ได้อะไรอยู่ดี ผมคิดว่าจะดีกว่านะถ้าคุณไม่ออกไปไหน
+Outputs:
+	พวกเขาต้องโกรธมากเลยใช่ไหม
+	โทษทีนะลูกของเราไม่เป็นอะไรใช่ไหม
+	ถึงเจ้าจะลากข้าไปเจ้าก็ไม่ได้อะไรอยู่ดี
+	ผมคิดว่าจะดีกว่านะถ้าคุณไม่ออกไปไหน
 Input: розігни і зігни, будь ласка. я знаю, ваши люди храбры. было приятно, правда? для начала, тебе нужен собственный свой самолет.
+Outputs:
+	розігни і зігни, будь ласка.
+	я знаю, ваши люди храбры.
+	было приятно, правда?
+	для начала, тебе нужен собственный свой самолет.
 Input: szedłem tylko do. pamiętaj, nigdy się nie obawiaj żyć na krawędzi ryzyka. ćwiczę już od dwóch tygodni a byłem zabity tylko raz.
+Outputs:
+	szedłem tylko do.
+	pamiętaj, nigdy się nie obawiaj żyć na krawędzi ryzyka.
+	ćwiczę już od dwóch tygodni a byłem zabity tylko raz.
 ```
+</details>
 # Model Architecture
 This is a data-driven approach to SBD. The model uses a `SentencePiece` tokenizer, a BERT-style encoder, and a linear classifier to predict which subwords are sentence boundaries.