|
# Model Details |
|
|
|
##### Model Name: NumericBERT |
|
|
|
##### Model Type: Transformer |
|
|
|
##### Architecture: BERT |
|
|
|
##### Training Method: Masked Language Modeling (MLM) |
|
|
|
##### Training Data: MIMIC IV Lab values data |
|
|
|
##### Training Hyperparameters: |
|
|
|
Optimizer: AdamW |
|
Learning Rate: 5e-5 |
|
Masking Rate: 20% |
|
Tokenization |
|
Tokenizer: Custom numeric-to-text mapping using the TextEncoder class |
|
|
|
### Text Encoding Process: |
|
|
|
The process converts non-negative integers into uppercase letter-based representations. This mapping allows numerical values to be expressed as sequences of letters. |
|
Subsequently, a method is applied to scale numerical values and convert them into corresponding letters based on a predefined mapping. |
|
Finally, a text encoding is executed to add the corresponding lab ID using the numeric values in specified columns ('Bic', 'Crt', 'Pot', 'Sod', 'Ure', 'Hgb', 'Plt', 'Wbc'). |
|
|
|
|
|
### Training Data Preprocessing |
|
Column Selection: Numerical values from the following lab values represented as: 'Bic', 'Crt', 'Pot', 'Sod', 'Ure', 'Hgb', 'Plt', 'Wbc'. |
|
Text Encoding: The numeric values are encoded into text. |
|
Masking: 20% of the data is randomly masked during training. |
|
|
|
### Model Output |
|
The model outputs predictions for masked values during training. |
|
The output contains the encoded text. |
|
|
|
### Limitations and Considerations |
|
Numeric Data Representation: The model relies on a custom text representation of numeric data, which might have limitations in capturing complex patterns present in the original numeric data. |
|
Training Data Source: The model is trained on MIMIC IV numeric data, and its performance might be influenced by the characteristics and biases present in that dataset. |
|
|
|
### Contact Information |
|
For inquiries or additional information, please contact: |
|
|
|
David Restrepo |
|
[email protected] |
|
MIT Critical Data |
|
|
|
--- |
|
license: mit |
|
--- |
|
|
|
|