File size: 5,693 Bytes
c8c36dd 88da8bb c8c36dd 88da8bb c8c36dd 88da8bb c8c36dd 6739205 c8c36dd |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 |
---
tags:
- PretrainModel
- TCM
- transformer
- herberta
- text-embedding
license: apache-2.0
language:
- zh
- en
metrics:
- accuracy
base_model:
- hfl/chinese-roberta-wwm-ext-large
new_version: XiaoEnn/herberta_seq_512_V2
inference: true
library_name: transformers
---
# Herberta: A Pretrained Model for TCM Herbal Medicine and Downstream Tasks
## Introduction
Herberta is a pre-trained model developed by the Angelpro Team, aimed at advancing the representation learning and modeling capabilities in Traditional Chinese Medicine (TCM). Built upon the **chinese-roberta-wwm-ext-large** model, Herberta leverages MLM (Masked Language Modeling) tasks to pre-train on datasets comprising **700 ancient books (538.95M)** and **48 modern Chinese medicine textbooks (54M)**, resulting in a robust model for embedding generation and TCM-specific downstream tasks.
We named the model "Herberta" by combining "Herb" and "Roberta" to signify its purpose in herbal medicine research. Herberta is ideal for applications such as:
- **Encoder for Herbal Formulas**: Generating meaningful embeddings for TCM formulations.
- **Domain-Specific Word Embedding**: Serving the Chinese medicine text domain.
- **Support for TCM Downstream Tasks**: Including classification, labeling, and more.
---
## Pretraining Experiments
### Dataset
| Data Type | Quantity | Data Size |
|------------------------|-------------|------------------|
| **Ancient TCM Books** | 700 books | ~538.95M |
| **Modern TCM Textbooks** | 48 books | ~54M |
| **Mixed-Type Dataset** | Combined dataset | ~637.8M |
### Pretrain result:
| Model | eval_accuracy | Loss/epoch_valid | Perplexity_valid |
|-----------------------|---------------|------------------|------------------|
| **herberta_seq_512_v2** | 0.9841 | 0.04367 | 1.083 |
| **herberta_seq_128_v2** | 0.9406 | 0.2877 | 1.333 |
| **herberta_seq_512_V3** | 0.755 |1.100 | 3.010 |
#### Metrics Comparison



### Pretraining Configuration
#### Ancient Books
- Pretraining Strategy: BERT-style MASK (15% tokens masked)
- Sequence Length: 512
- Batch Size: 32
- Learning Rate: `1e-5` with an epoch-based decay (`epoch * 0.1`)
- Tokenization: Sentence-based tokenization with padding for sequences <512 tokens.
---
## Downstream Task: TCM Pattern Classification
### Task Definition
Using **321 pattern descriptions** extracted from TCM internal medicine textbooks, we evaluated the classification performance on four models:
1. **Herberta_seq_512_v2**: Pretrained on 700 ancient TCM books.
2. **Herberta_seq_512_v3**: Pretrained on 48 modern TCM textbooks.
3. **Herberta_seq_128_v2**: Pretrained on 700 ancient TCM books (128-length sequences).
4. **Roberta**: Baseline model without TCM-specific pretraining.
### Training Configuration
- Max Sequence Length: 512
- Batch Size: 16
- Epochs: 30
### Results
| Model Name | Eval Accuracy | Eval F1 | Eval Precision | Eval Recall |
|--------------------------|---------------|-----------|----------------|-------------|
| **Herberta_seq_512_v2** | **0.9454** | **0.9293** | **0.9221** | **0.9454** |
| **Herberta_seq_512_v3** | 0.8989 | 0.8704 | 0.8583 | 0.8989 |
| **Herberta_seq_128_v2** | 0.8716 | 0.8443 | 0.8351 | 0.8716 |
| **Roberta** | 0.8743 | 0.8425 | 0.8311 | 0.8743 |

#### Summary
The **Herberta_seq_512_v2** model, pretrained on 700 ancient TCM books, exhibited superior performance across all evaluation metrics. This highlights the significance of domain-specific pretraining on larger and historically richer datasets for TCM applications.
---
## Quickstart
### Use Hugging Face
```python
from transformers import AutoTokenizer, AutoModel
model_name = "XiaoEnn/herberta"
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
# Input text
text = "中医理论是我国传统文化的瑰宝。"
# Tokenize and prepare input
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding="max_length", max_length=128)
# Get the model's outputs
with torch.no_grad():
outputs = model(**inputs)
# Get the embedding (sentence-level average pooling)
sentence_embedding = outputs.last_hidden_state.mean(dim=1)
print("Embedding shape:", sentence_embedding.shape)
print("Embedding vector:", sentence_embedding)
```
if you find our work helpful, feel free to give us a cite
@misc{herberta-embedding,
title = {Herberta: A Pretrained Model for TCM Herbal Medicine and Downstream Tasks as Text Embedding Generation},
url = {https://github.com/15392778677/herberta},
author = {Yehan Yang, Xinhan Zheng},
month = {December},
year = {2024}
}
@article{herberta-technical-report,
title={Herberta: A Pretrained Model for TCM Herbal Medicine and Downstream Tasks as Text Embedding Generation},
author={Yehan Yang, Xinhan Zheng},
institution={Beijing Angelpro Technology Co., Ltd.},
year={2024},
note={Presented at the 2024 Machine Learning Applications Conference (MLAC)}
} |