maliknaik
/

distilbert-natural-unit-conversion

 ---
 license: cc0-1.0
+task_categories:
+- token-classification
+language:
+- en
+tags:
+- named-entity-recognition
+- ner
+- scientific
+- unit-conversion
+- units
+- measurement
+- natural-language-understanding
+- automatic-annotations
 ---
+# DistilBERT Token Classification Model for Unit Conversion
+### Model Overview
+This model is a fine-tuned version of `distilbert/distilbert-base-uncased` for token classification on unit conversion-related text. It is designed to recognize unit values and conversion entities, facilitating automatic extraction of unit-related data.
+### Dataset
+The model is trained on the `maliknaik/natural_unit_conversion` dataset, which contains:
+- **Training set**: 583,863 examples
+- **Validation set**: 100,091 examples
+- **Test set**: 150,137 examples
+Each example consists of:
+- **text**: The input sentence containing unit-related phrases.
+- **entities**: The labeled entities specifying unit values and types.
+Dataset url: [https://huggingface.co/datasets/maliknaik/natural_unit_conversion](https://huggingface.co/datasets/maliknaik/natural_unit_conversion)
+### Labels
+The model classifies tokens into the following categories:
+- `B-FROM_UNIT`: Beginning of the source unit
+- `I-FROM_UNIT`: Inside the source unit
+- `B-TO_UNIT`: Beginning of the target unit
+- `I-TO_UNIT`: Inside the target unit
+- `B-FEET_VALUE`: Beginning of feet value
+- `I-FEET_VALUE`: Inside feet value
+- `B-INCH_VALUE`: Beginning of inch value
+- `I-INCH_VALUE`: Inside inch value
+### Training Details
+- **Base Model**: `distilbert/distilbert-base-uncased`
+- **Tokenization**: `AutoTokenizer` from Hugging Face Transformers
+- **Training Framework**: Hugging Face `Trainer`
+- **Data Collator**: `DataCollatorForTokenClassification`
+- **Loss Function**: CrossEntropyLoss
+- **Batch Size**: 64
+- **Epochs**: 10
+- **GPU**: 1x NVIDIA Tesla P4 (8GB GDDR5)
+- **CPU**: 56 vCPUs
+- **RAM**: 283GB
+### Usage
+To use this model for inference:
+```python
+from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
+model_name = 'maliknaik/distilbert-natural-unit-conversion'
+model = AutoModelForTokenClassification.from_pretrained(model_name)
+tokenizer = AutoTokenizer.from_pretrained('distilbert/distilbert-base-uncased')
+text = 'How many miles are there in 50 kilometers?'
+unit_pipeline = pipeline('ner', model=model, tokenizer=tokenizer)
+print(unit_pipeline(text))
+```
+Output:
+```bash
+[{'entity_group': 'TO_UNIT',
+  'score': np.float32(0.9999982),
+  'word': 'miles',
+  'start': 9,
+  'end': 14},
+ {'entity_group': 'FROM_UNIT',
+  'score': np.float32(0.9999473),
+  'word': 'kilometers',
+  'start': 31,
+  'end': 41}]
+```
+### Performance
+The model achieves high f1 score in identifying unit values and conversions. The f1-score for validation and test sets is
+expected to be optimized further.
+### Usage
+This dataset can be used for training named entity recognition (NER) models, especially for tasks related to unit
+conversion and natural language understanding.
+### License
+This model is available under the CC0-1.0 license. It is free to use for any purpose without any restrictions.
+### Contributions
+Developed by [Malik N. Mohammed](https://maliknaik.me/), leveraging **DistilBERT** for efficient NLP token classification.
+### Citation
+If you use this model in your work, please cite it as follows:
+```
+@misc{unit-conversion-dataset,
+  author = {Malik N. Mohammed},
+  title = {Natural Language Unit Conversion Model for Named-Entity Recognition},
+  year = {2025},
+  publisher = {HuggingFace},
+  journal = {HuggingFace repository}
+  howpublished = {\url{https://huggingface.co/maliknaik/distilbert-natural-unit-conversion/}}
+}
+```