shibing624
/

mengzi-t5-base-chinese-correction

Text2Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

shibing624 commited on Jun 17, 2022

Commit

158150b

·

1 Parent(s): 462acb6

Update README.md

Files changed (1) hide show

README.md +96 -1

README.md CHANGED Viewed

@@ -1,3 +1,98 @@
 ---
-license: apache-2.0
 ---

 ---
+language:
+- zh
+tags:
+- t5
+- pytorch
+- zh
+license: "apache-2.0"
 ---
+# T5 for Chinese Spelling Correction Model
+中文拼写纠错模型
+`shibing624/mengzi-t5-base-chinese-correction` evaluate SIGHAN2015 test data：
+- Sentence Level: precision:0.8321, recall:0.6390, f1:0.7229
+由于训练使用的数据使用了SIGHAN2015的训练集（复现paper），在SIGHAN2015的测试集上达到接近SOTA水平。
+未改动模型结构，finetune中文纠错数据集，评估纠错效果很好，模型潜力巨大。
+## Usage
+本项目开源在中文文本纠错项目：[pycorrector](https://github.com/shibing624/pycorrector)，可支持t5模型，通过如下命令调用：
+```python
+from pycorrector.t5.t5_corrector import T5Corrector
+nlp = T5Corrector("shibing624/mengzi-t5-base-chinese-correction").batch_t5_correct
+i = "今天新情很好"
+print(i, ' => ', nlp([i]))
+```
+output:
+```shell
+今天新情很好  =>  今天心情很好 [('新', '心', 2, 3)]
+```
+模型文件组成：
+```
+mengzi-t5-base-chinese-correction
+|-- config.json
+|-- pytorch_model.bin
+|-- special_tokens_map.json
+|-- spiece.model
+|-- tokenizer_config.json
+`-- tokenizer.json
+```
+### 训练数据集
+#### SIGHAN+Wang271K中文纠错数据集
+| 数据集 | 语料 | 下载链接 | 压缩包大小 |
+| :------- | :--------- | :---------: | :---------: |
+| **`SIGHAN+Wang271K中文纠错数据集`** | SIGHAN+Wang271K(27万条) | [百度网盘（密码01b9）](https://pan.baidu.com/s/1BV5tr9eONZCI0wERFvr0gQ)| 106M |
+| **`原始SIGHAN数据集`** | SIGHAN13 14 15 | [官方csc.html](http://nlp.ee.ncu.edu.tw/resource/csc.html)| 339K |
+| **`原始Wang271K数据集`** | Wang271K | [Automatic-Corpus-Generation dimmywang提供](https://github.com/wdimmy/Automatic-Corpus-Generation/blob/master/corpus/train.sgml)| 93M |
+SIGHAN+Wang271K中文纠错数据集，数据格式：
+```json
+[
+    {
+        "id": "B2-4029-3",
+        "original_text": "晚间会听到嗓音，白天的时候大家都不会太在意，但是在睡觉的时候这嗓音成为大家的恶梦。",
+        "wrong_ids": [
+            5,
+            31
+        ],
+        "correct_text": "晚间会听到噪音，白天的时候大家都不会太在意，但是在睡觉的时候这噪音成为大家的恶梦。"
+    },
+]
+```
+```shell
+macbert4csc
+    ├── config.json
+    ├── pytorch_model.bin
+    ├── special_tokens_map.json
+    ├── tokenizer_config.json
+    └── vocab.txt
+```
+如果需要训练t5-correction，请参考[https://github.com/shibing624/pycorrector/tree/master/pycorrector/t5](https://github.com/shibing624/pycorrector/tree/master/pycorrector/t5)
+## Citation
+```latex
+@software{pycorrector,
+  author = {Xu Ming},
+  title = {pycorrector: Text Error Correction Tool},
+  year = {2021},
+  url = {https://github.com/shibing624/pycorrector},
+}
+```