PleIAs
/

OCRerrcr

Token Classification

Inference Endpoints

Model card Files Files and versions Community

OCRerrcr / README.md

eliotj's picture

Update README.md

b8501a2 verified 7 months ago

|

history blame contribute delete

1.37 kB

	---
	license: apache-2.0
	language:
	- en
	- fr
	- de
	---
	OCRerrcr is a small language model specialized for the detection of OCR error.

	OCRerrcr was trained by Eliot Jones for PleIAs on a sample of 1000 documents with labelled OCR errors from open data documents (Finance Commons) and cultural heritage sources (Common Corpus).

	To date, OCRerrcr provide the most accurate agnostic OCR error rate estimate. PleIAs has also develop an alternative pipeline for this tasks, [OCRoscope](https://github.com/Pleias/OCRoscope), that scale significantly better but also significantly less accurate, especially for document with fewer mistakes.

	The name OCRerrcr (instead of OCRerror) is a playful allusion to a common OCR misreading.

	## Example

	The following is a low-error example sentence taken from Common Corpus:

	> They did not approach cer, but turned away and passed irom her presence, filled with sorrow and moved with sympathy,
	> which her intense emotions seemed to communicate to even these thoughtless young men of the tho plains.
	>
	And the OCRerrcr detection (with formatting for clarity):

	> They did not approach \<er\>cer,\</er\> but turned away and passed \<er\>irom\<\/er\> her presence, filled with sorrow and moved with sympathy,
	> which her intense emotions seemed to communicate to even these thoughtless young men of the \<er\>tho\<\/er\> plains.