maliknaik commited on
Commit
18270bd
·
1 Parent(s): a99c163

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +124 -0
README.md CHANGED
@@ -1,3 +1,127 @@
1
  ---
2
  license: cc0-1.0
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc0-1.0
3
+ task_categories:
4
+ - token-classification
5
+ language:
6
+ - en
7
+ tags:
8
+ - named-entity-recognition
9
+ - ner
10
+ - scientific
11
+ - unit-conversion
12
+ - units
13
+ - measurement
14
+ - natural-language-understanding
15
+ - automatic-annotations
16
  ---
17
+
18
+ # DistilBERT Token Classification Model for Unit Conversion
19
+
20
+ ### Model Overview
21
+
22
+ This model is a fine-tuned version of `distilbert/distilbert-base-uncased` for token classification on unit conversion-related text. It is designed to recognize unit values and conversion entities, facilitating automatic extraction of unit-related data.
23
+
24
+ ### Dataset
25
+
26
+ The model is trained on the `maliknaik/natural_unit_conversion` dataset, which contains:
27
+
28
+ - **Training set**: 583,863 examples
29
+ - **Validation set**: 100,091 examples
30
+ - **Test set**: 150,137 examples
31
+
32
+ Each example consists of:
33
+
34
+ - **text**: The input sentence containing unit-related phrases.
35
+ - **entities**: The labeled entities specifying unit values and types.
36
+
37
+ Dataset url: [https://huggingface.co/datasets/maliknaik/natural_unit_conversion](https://huggingface.co/datasets/maliknaik/natural_unit_conversion)
38
+
39
+ ### Labels
40
+
41
+ The model classifies tokens into the following categories:
42
+
43
+ - `B-FROM_UNIT`: Beginning of the source unit
44
+ - `I-FROM_UNIT`: Inside the source unit
45
+ - `B-TO_UNIT`: Beginning of the target unit
46
+ - `I-TO_UNIT`: Inside the target unit
47
+ - `B-FEET_VALUE`: Beginning of feet value
48
+ - `I-FEET_VALUE`: Inside feet value
49
+ - `B-INCH_VALUE`: Beginning of inch value
50
+ - `I-INCH_VALUE`: Inside inch value
51
+
52
+ ### Training Details
53
+ - **Base Model**: `distilbert/distilbert-base-uncased`
54
+ - **Tokenization**: `AutoTokenizer` from Hugging Face Transformers
55
+ - **Training Framework**: Hugging Face `Trainer`
56
+ - **Data Collator**: `DataCollatorForTokenClassification`
57
+ - **Loss Function**: CrossEntropyLoss
58
+ - **Batch Size**: 64
59
+ - **Epochs**: 10
60
+ - **GPU**: 1x NVIDIA Tesla P4 (8GB GDDR5)
61
+ - **CPU**: 56 vCPUs
62
+ - **RAM**: 283GB
63
+
64
+
65
+ ### Usage
66
+
67
+ To use this model for inference:
68
+
69
+ ```python
70
+ from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
71
+
72
+ model_name = 'maliknaik/distilbert-natural-unit-conversion'
73
+
74
+ model = AutoModelForTokenClassification.from_pretrained(model_name)
75
+ tokenizer = AutoTokenizer.from_pretrained('distilbert/distilbert-base-uncased')
76
+
77
+ text = 'How many miles are there in 50 kilometers?'
78
+
79
+ unit_pipeline = pipeline('ner', model=model, tokenizer=tokenizer)
80
+ print(unit_pipeline(text))
81
+ ```
82
+
83
+ Output:
84
+ ```bash
85
+ [{'entity_group': 'TO_UNIT',
86
+ 'score': np.float32(0.9999982),
87
+ 'word': 'miles',
88
+ 'start': 9,
89
+ 'end': 14},
90
+ {'entity_group': 'FROM_UNIT',
91
+ 'score': np.float32(0.9999473),
92
+ 'word': 'kilometers',
93
+ 'start': 31,
94
+ 'end': 41}]
95
+ ```
96
+
97
+ ### Performance
98
+
99
+ The model achieves high f1 score in identifying unit values and conversions. The f1-score for validation and test sets is
100
+ expected to be optimized further.
101
+
102
+
103
+ ### Usage
104
+ This dataset can be used for training named entity recognition (NER) models, especially for tasks related to unit
105
+ conversion and natural language understanding.
106
+
107
+ ### License
108
+ This model is available under the CC0-1.0 license. It is free to use for any purpose without any restrictions.
109
+
110
+ ### Contributions
111
+
112
+ Developed by [Malik N. Mohammed](https://maliknaik.me/), leveraging **DistilBERT** for efficient NLP token classification.
113
+
114
+ ### Citation
115
+ If you use this model in your work, please cite it as follows:
116
+
117
+ ```
118
+ @misc{unit-conversion-dataset,
119
+ author = {Malik N. Mohammed},
120
+ title = {Natural Language Unit Conversion Model for Named-Entity Recognition},
121
+ year = {2025},
122
+ publisher = {HuggingFace},
123
+ journal = {HuggingFace repository}
124
+ howpublished = {\url{https://huggingface.co/maliknaik/distilbert-natural-unit-conversion/}}
125
+ }
126
+
127
+ ```