language:
- en
- de
- fr
- it
- nl
- multilingual
license: mit
tags:
- punctuation prediction
- punctuation
datasets:
- wmt/europarl
- SoNaR
metrics:
- f1
widget:
- text: Ho sentito che ti sei laureata il che mi fa molto piacere
example_title: Italian
- text: Tous les matins vers quatre heures mon p�re ouvrait la porte de ma chambre
example_title: French
- text: Ist das eine Frage Frau M�ller
example_title: German
- text: My name is Clara and I live in Berkeley California
example_title: English
- text: >-
hervatting van de zitting ik verklaar de zitting van het europees
parlement die op vrijdag 17 december werd onderbroken te zijn hervat
example_title: Dutch
This model predicts the punctuation of English, Italian, French and German texts. We developed it to restore the punctuation of transcribed spoken language.
This multilanguage model was trained on the Europarl Dataset provided by the SEPP-NLG Shared Task and for the Dutch language we included the SoNaR Dataset. Please note that this dataset consists of political speeches. Therefore the model might perform differently on texts from other domains.
The model restores the following punctuation markers: "." "," "?" "-" ":"
Sample Code
We provide a simple python package that allows you to process text of any length.
Install
To get started install the package from pypi:
pip install deepmultilingualpunctuation
Restore Punctuation
from deepmultilingualpunctuation import PunctuationModel
model = PunctuationModel(model="oliverguhr/fullstop-punctuation-multilingual-sonar-base")
text = "My name is Clara and I live in Berkeley California Ist das eine Frage Frau M�ller"
result = model.restore_punctuation(text)
print(result)
output
My name is Clara and I live in Berkeley, California. Ist das eine Frage, Frau M�ller?
Predict Labels
from deepmultilingualpunctuation import PunctuationModel
model = PunctuationModel(model="oliverguhr/fullstop-punctuation-multilingual-sonar-base")
text = "My name is Clara and I live in Berkeley California Ist das eine Frage Frau M�ller"
clean_text = model.preprocess(text)
labled_words = model.predict(clean_text)
print(labled_words)
output
[['My', '0', 0.99998856], ['name', '0', 0.9999708], ['is', '0', 0.99975926], ['Clara', '0', 0.6117834], ['and', '0', 0.9999014], ['I', '0', 0.9999808], ['live', '0', 0.9999666], ['in', '0', 0.99990165], ['Berkeley', ',', 0.9941764], ['California', '.', 0.9952892], ['Ist', '0', 0.9999577], ['das', '0', 0.9999678], ['eine', '0', 0.99998224], ['Frage', ',', 0.9952265], ['Frau', '0', 0.99995995], ['M�ller', '?', 0.972517]]
Results
The performance differs for the single punctuation markers as hyphens and colons, in many cases, are optional and can be substituted by either a comma or a full stop. The model achieves the following F1 scores for the different languages:
Label | English | German | French | Italian | Dutch |
---|---|---|---|---|---|
0 | 0.990 | 0.996 | 0.991 | 0.988 | 0.994 |
. | 0.924 | 0.951 | 0.921 | 0.917 | 0.959 |
? | 0.825 | 0.829 | 0.800 | 0.736 | 0.817 |
, | 0.798 | 0.937 | 0.811 | 0.778 | 0.813 |
: | 0.535 | 0.608 | 0.578 | 0.544 | 0.657 |
- | 0.345 | 0.384 | 0.353 | 0.344 | 0.464 |
macro average | 0.736 | 0.784 | 0.742 | 0.718 | 0.784 |
micro average | 0.975 | 0.987 | 0.977 | 0.972 | 0.983 |
Languages
Models
Languages | Model |
---|---|
English, Italian, French and German | oliverguhr/fullstop-punctuation-multilang-large |
English, Italian, French, German and Dutch | oliverguhr/fullstop-punctuation-multilingual-sonar-base |
Dutch | oliverguhr/fullstop-dutch-sonar-punctuation-prediction |
Community Models
Languages | Model |
---|---|
English, German, French, Spanish, Bulgarian, Italian, Polish, Dutch, Czech, Portugese, Slovak, Slovenian | kredor/punctuate-all |
Catalan | softcatala/fullstop-catalan-punctuation-prediction |
You can use different models by setting the model parameter:
model = PunctuationModel(model = "oliverguhr/fullstop-dutch-punctuation-prediction")