|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
- fr |
|
- de |
|
--- |
|
**OCRerrcr** is a small language model specialized for the detection of OCR error. |
|
|
|
OCRerrcr was trained by Eliot Jones for PleIAs on a sample of 1000 documents with labelled OCR errors from open data documents (Finance Commons) and cultural heritage sources (Common Corpus). |
|
|
|
To date, OCRerrcr provide the most accurate agnostic OCR error rate estimate. PleIAs has also develop an alternative pipeline for this tasks, [OCRoscope](https://github.com/Pleias/OCRoscope), that scale significantly better but also significantly less accurate, especially for document with fewer mistakes. |
|
|
|
The name OCRerrcr (instead of OCRerror) is a playful allusion to a common OCR misreading. |
|
|
|
## Example |
|
|
|
The following is a low-error example sentence taken from Common Corpus: |
|
|
|
> They did not approach cer, but turned away and passed irom her presence, filled with sorrow and moved with sympathy, |
|
> which her intense emotions seemed to communicate to even these thoughtless young men of the tho plains. |
|
> |
|
And the OCRerrcr detection (with formatting for clarity): |
|
|
|
> They did not approach \<er\>cer,\</er\> but turned away and passed \<er\>irom\<\/er\> her presence, filled with sorrow and moved with sympathy, |
|
> which her intense emotions seemed to communicate to even these thoughtless young men of the \<er\>tho\<\/er\> plains. |
|
|