File size: 5,693 Bytes
c8c36dd
88da8bb
 
 
 
 
 
c8c36dd
88da8bb
 
 
 
 
 
 
 
 
 
c8c36dd
 
88da8bb
c8c36dd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6739205
 
 
 
c8c36dd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
---
tags:
- PretrainModel
- TCM
- transformer
- herberta
- text-embedding
license: apache-2.0
language:
- zh
- en
metrics:
- accuracy
base_model:
- hfl/chinese-roberta-wwm-ext-large
new_version: XiaoEnn/herberta_seq_512_V2
inference: true
library_name: transformers
---

# Herberta: A Pretrained Model for TCM Herbal Medicine and Downstream Tasks

## Introduction

Herberta is a pre-trained model developed by the Angelpro Team, aimed at advancing the representation learning and modeling capabilities in Traditional Chinese Medicine (TCM). Built upon the **chinese-roberta-wwm-ext-large** model, Herberta leverages MLM (Masked Language Modeling) tasks to pre-train on datasets comprising **700 ancient books (538.95M)** and **48 modern Chinese medicine textbooks (54M)**, resulting in a robust model for embedding generation and TCM-specific downstream tasks.

We named the model "Herberta" by combining "Herb" and "Roberta" to signify its purpose in herbal medicine research. Herberta is ideal for applications such as:

- **Encoder for Herbal Formulas**: Generating meaningful embeddings for TCM formulations.
- **Domain-Specific Word Embedding**: Serving the Chinese medicine text domain.
- **Support for TCM Downstream Tasks**: Including classification, labeling, and more.

---

## Pretraining Experiments

### Dataset

| Data Type              | Quantity    | Data Size        |
|------------------------|-------------|------------------|
| **Ancient TCM Books**  | 700 books   | ~538.95M         |
| **Modern TCM Textbooks** | 48 books   | ~54M             |
| **Mixed-Type Dataset** | Combined dataset | ~637.8M          |

### Pretrain result:


| Model                 | eval_accuracy | Loss/epoch_valid | Perplexity_valid |
|-----------------------|---------------|------------------|------------------|
| **herberta_seq_512_v2** | 0.9841        | 0.04367          | 1.083           |
| **herberta_seq_128_v2** | 0.9406        | 0.2877           | 1.333           |
| **herberta_seq_512_V3** |  0.755         |1.100         | 3.010           |

#### Metrics Comparison

![Accuracy](https://cdn-uploads.huggingface.co/production/uploads/6564baaa393bae9c194fc32e/RDgI-0Ro2kMiwV853Wkgx.png)
![Loss](https://cdn-uploads.huggingface.co/production/uploads/6564baaa393bae9c194fc32e/BJ7enbRg13IYAZuxwraPP.png)
![Perplexity](https://cdn-uploads.huggingface.co/production/uploads/6564baaa393bae9c194fc32e/lOohRMIctPJZKM5yEEcQ2.png)


### Pretraining Configuration

#### Ancient Books
- Pretraining Strategy: BERT-style MASK (15% tokens masked)
- Sequence Length: 512
- Batch Size: 32
- Learning Rate: `1e-5` with an epoch-based decay (`epoch * 0.1`)
- Tokenization: Sentence-based tokenization with padding for sequences <512 tokens.

---

## Downstream Task: TCM Pattern Classification

### Task Definition
Using **321 pattern descriptions** extracted from TCM internal medicine textbooks, we evaluated the classification performance on four models:

1. **Herberta_seq_512_v2**: Pretrained on 700 ancient TCM books.
2. **Herberta_seq_512_v3**: Pretrained on 48 modern TCM textbooks.
3. **Herberta_seq_128_v2**: Pretrained on 700 ancient TCM books (128-length sequences).
4. **Roberta**: Baseline model without TCM-specific pretraining.

### Training Configuration
- Max Sequence Length: 512
- Batch Size: 16
- Epochs: 30

### Results

| Model Name              | Eval Accuracy | Eval F1   | Eval Precision | Eval Recall |
|--------------------------|---------------|-----------|----------------|-------------|
| **Herberta_seq_512_v2** | **0.9454**    | **0.9293** | **0.9221**     | **0.9454**  |
| **Herberta_seq_512_v3** | 0.8989        | 0.8704    | 0.8583         | 0.8989      |
| **Herberta_seq_128_v2** | 0.8716        | 0.8443    | 0.8351         | 0.8716      |
| **Roberta**             | 0.8743        | 0.8425    | 0.8311         | 0.8743      |

![image/png](https://cdn-uploads.huggingface.co/production/uploads/6564baaa393bae9c194fc32e/1yG96YdzXuxQlTfjOmXqg.png)


#### Summary
The **Herberta_seq_512_v2** model, pretrained on 700 ancient TCM books, exhibited superior performance across all evaluation metrics. This highlights the significance of domain-specific pretraining on larger and historically richer datasets for TCM applications.

---

## Quickstart

### Use Hugging Face

```python
from transformers import AutoTokenizer, AutoModel

model_name = "XiaoEnn/herberta"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Input text
text = "中医理论是我国传统文化的瑰宝。"

# Tokenize and prepare input
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding="max_length", max_length=128)

# Get the model's outputs
with torch.no_grad():
    outputs = model(**inputs)

# Get the embedding (sentence-level average pooling)
sentence_embedding = outputs.last_hidden_state.mean(dim=1)

print("Embedding shape:", sentence_embedding.shape)
print("Embedding vector:", sentence_embedding)

```

if you find our work helpful, feel free to give us a cite

@misc{herberta-embedding,
  title = {Herberta: A Pretrained Model for TCM Herbal Medicine and Downstream Tasks as Text Embedding Generation},
  url = {https://github.com/15392778677/herberta},
  author = {Yehan Yang, Xinhan Zheng},
  month = {December},
  year = {2024}
}

@article{herberta-technical-report,
  title={Herberta: A Pretrained Model for TCM Herbal Medicine and Downstream Tasks as Text Embedding Generation},
  author={Yehan Yang, Xinhan Zheng},
  institution={Beijing Angelpro Technology Co., Ltd.},
  year={2024},
  note={Presented at the 2024 Machine Learning Applications Conference (MLAC)}
}