JeffreyLau commited on
Commit
800dcf0
·
1 Parent(s): 1c0e04d

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +125 -0
README.md ADDED
@@ -0,0 +1,125 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+ # Pre-trained Language Model for the Humanities and Social Sciences in Chinese
5
+
6
+ ## Introduction
7
+
8
+ The research for social science texts in Chinese needs the support natural language processing tools.
9
+
10
+ The pre-trained language model has greatly improved the accuracy of text mining in general texts. At present, there is an urgent need for a pre-trained language model specifically for the automatic processing of scientific texts in Chinese social science.
11
+
12
+ We used the abstract of social science research as the training set. Based on the deep language model framework of BERT, we constructed CSSCI_ABS_BERT, CSSCI_ABS_roberta and CSSCI_ABS_roberta-wwm pre-training language models by [transformers/run_mlm.py](https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_mlm.py) and [transformers/mlm_wwm](https://github.com/huggingface/transformers/tree/main/examples/research_projects/mlm_wwm).
13
+
14
+ We designed four downstream tasks of Text Classification on different Chinese social scientific article corpus to verify the performance of the model.
15
+
16
+ - CSSCI_ABS_BERT , CSSCI_ABS_roberta and CSSCI_ABS_roberta-wwm are trained on the abstract of articles published in CSSCI journals. The training set involved in the experiment included a total of `510,956,094 words`.
17
+ - Based on the idea of Domain-Adaptive Pretraining, `CSSCI_ABS_BERT` and `CSSCI_ABS_roberta` combine a large amount of abstracts of scientific articles in Chinese based on the BERT structure, and continue to train the BERT and Chinese-RoBERTa models respectively to obtain pre-training models for the automatic processing of Chinese Social science research texts.
18
+
19
+
20
+
21
+ ## News
22
+
23
+ - 2022-06-15 : CSSCI_ABS_BERT, CSSCI_ABS_roberta and CSSCI_ABS_roberta-wwm has been put forward for the first time.
24
+
25
+
26
+
27
+ ## How to use
28
+
29
+ ### Huggingface Transformers
30
+
31
+ The `from_pretrained` method based on [Huggingface Transformers](https://github.com/huggingface/transformers) can directly obtain CSSCI_ABS_BERT, CSSCI_ABS_roberta and CSSCI_ABS_roberta-wwm models online.
32
+
33
+
34
+
35
+ - CSSCI_ABS_BERT
36
+
37
+ ```python
38
+ from transformers import AutoTokenizer, AutoModel
39
+
40
+ tokenizer = AutoTokenizer.from_pretrained("KM4STfulltext/CSSCI_ABS_BERT")
41
+
42
+ model = AutoModel.from_pretrained("KM4STfulltext/CSSCI_ABS_BERT")
43
+ ```
44
+
45
+ - CSSCI_ABS_roberta
46
+
47
+ ```python
48
+ from transformers import AutoTokenizer, AutoModel
49
+
50
+ tokenizer = AutoTokenizer.from_pretrained("KM4STfulltext/CSSCI_ABS_roberta")
51
+
52
+ model = AutoModel.from_pretrained("KM4STfulltext/CSSCI_ABS_roberta")
53
+ ```
54
+
55
+ - CSSCI_ABS_roberta-wwm
56
+
57
+ ```python
58
+ from transformers import AutoTokenizer, AutoModel
59
+
60
+ tokenizer = AutoTokenizer.from_pretrained("KM4STfulltext/CSSCI_ABS_roberta_wwm")
61
+
62
+ model = AutoModel.from_pretrained("KM4STfulltext/CSSCI_ABS_roberta_wwm")
63
+ ```
64
+
65
+ ### Download Models
66
+
67
+ - The version of the model we provide is `PyTorch`.
68
+
69
+ ### From Huggingface
70
+
71
+ - Download directly through Huggingface's official website.
72
+ - [KM4STfulltext/CSSCI_ABS_BERT](https://huggingface.co/KM4STfulltext/CSSCI_ABS_BERT)
73
+ - [KM4STfulltext/CSSCI_ABS_roberta](https://huggingface.co/KM4STfulltext/CSSCI_ABS_roberta)
74
+ - [KM4STfulltext/CSSCI_ABS_roberta_wwm](https://huggingface.co/KM4STfulltext/CSSCI_ABS_roberta_wwm)
75
+
76
+ ## Evaluation & Results
77
+
78
+ - We useCSSCI_ABS_BERT, CSSCI_ABS_roberta and CSSCI_ABS_roberta-wwm to perform Text Classificationon different social science research corpus. The experimental results are as follows.
79
+
80
+ #### Discipline classification experiments of articles published in CSSCI journals
81
+
82
+ https://github.com/S-T-Full-Text-Knowledge-Mining/CSSCI-BERT
83
+
84
+
85
+ #### Movement recognition experiments for data analysis and knowledge discovery abstract
86
+
87
+ | Tag | bert-base-Chinese | chinese-roberta-wwm,ext | CSSCI_ABS_BERT | CSSCI_ABS_roberta | CSSCI_ABS_roberta_wwm | support |
88
+ | ------------ | ----------------- | ----------------------- | -------------- | ----------------- | --------------------- | ------- |
89
+ | Abstract | 55.23 | 62.44 | 56.8 | 57.96 | 58.26 | 223 |
90
+ | Location | 61.61 | 54.38 | 61.83 | 61.4 | 61.94 | 2866 |
91
+ | Metric | 45.08 | 41 | 45.27 | 46.74 | 47.13 | 622 |
92
+ | Organization | 46.85 | 35.29 | 45.72 | 45.44 | 44.65 | 327 |
93
+ | Person | 88.66 | 82.79 | 88.21 | 88.29 | 88.51 | 4850 |
94
+ | Thing | 71.68 | 65.34 | 71.88 | 71.68 | 71.81 | 5993 |
95
+ | Time | 65.35 | 60.38 | 64.15 | 65.26 | 66.03 | 1272 |
96
+ | avg | 72.69 | 66.62 | 72.59 | 72.61 | 72.89 | 16153 |
97
+
98
+ #### Chinese literary entity recognition
99
+
100
+ | Tag | bert-base-Chinese | chinese-roberta-wwm,ext | CSSCI_ABS_BERT | CSSCI_ABS_roberta | CSSCI_ABS_roberta_wwm | support |
101
+ | ------------ | ----------------- | ----------------------- | -------------- | ----------------- | --------------------- | ------- |
102
+ | Abstract | 55.23 | 62.44 | 56.8 | 57.96 | 58.26 | 223 |
103
+ | Location | 61.61 | 54.38 | 61.83 | 61.4 | 61.94 | 2866 |
104
+ | Metric | 45.08 | 41 | 45.27 | 46.74 | 47.13 | 622 |
105
+ | Organization | 46.85 | 35.29 | 45.72 | 45.44 | 44.65 | 327 |
106
+ | Person | 88.66 | 82.79 | 88.21 | 88.29 | 88.51 | 4850 |
107
+ | Thing | 71.68 | 65.34 | 71.88 | 71.68 | 71.81 | 5993 |
108
+ | Time | 65.35 | 60.38 | 64.15 | 65.26 | 66.03 | 1272 |
109
+ | avg | 72.69 | 66.62 | 72.59 | 72.61 | 72.89 | 16153 |
110
+
111
+ ## Cited
112
+
113
+ - If our content is helpful for your research work, please quote our research in your article.
114
+ - If you want to quote our research, you can use this url [S-T-Full-Text-Knowledge-Mining/CSSCI-BERT (github.com)](https://github.com/S-T-Full-Text-Knowledge-Mining/CSSCI-BERT) as an alternative before our paper is published.
115
+
116
+ ## Disclaimer
117
+
118
+ - The experimental results presented in the report only show the performance under a specific data set and hyperparameter combination, and cannot represent the essence of each model. The experimental results may change due to random number seeds and computing equipment.
119
+ - **Users can use the model arbitrarily within the scope of the license, but we are not responsible for the direct or indirect losses caused by using the content of the project.**
120
+
121
+
122
+ ## Acknowledgment
123
+
124
+ - CSSCI_ABS_BERT was trained based on [BERT-Base-Chinese]([google-research/bert: TensorFlow code and pre-trained models for BERT (github.com)](https://github.com/google-research/bert)).
125
+ - CSSCI_ABS_roberta and CSSCI_ABS_roberta-wwm was trained based on [RoBERTa-wwm-ext, Chinese]([ymcui/Chinese-BERT-wwm: Pre-Training with Whole Word Masking for Chinese BERT(中文BERT-wwm系列模型) (github.com)](https://github.com/ymcui/Chinese-BERT-wwm)).