clint-holt commited on
Commit
6e6e48e
·
1 Parent(s): 648eed0

feat: Add AbLangPDB1 model files and code

Browse files
.gitattributes CHANGED
@@ -1,35 +1 @@
1
- *.7z filter=lfs diff=lfs merge=lfs -text
2
- *.arrow filter=lfs diff=lfs merge=lfs -text
3
- *.bin filter=lfs diff=lfs merge=lfs -text
4
- *.bz2 filter=lfs diff=lfs merge=lfs -text
5
- *.ckpt filter=lfs diff=lfs merge=lfs -text
6
- *.ftz filter=lfs diff=lfs merge=lfs -text
7
- *.gz filter=lfs diff=lfs merge=lfs -text
8
- *.h5 filter=lfs diff=lfs merge=lfs -text
9
- *.joblib filter=lfs diff=lfs merge=lfs -text
10
- *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
- *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
- *.model filter=lfs diff=lfs merge=lfs -text
13
- *.msgpack filter=lfs diff=lfs merge=lfs -text
14
- *.npy filter=lfs diff=lfs merge=lfs -text
15
- *.npz filter=lfs diff=lfs merge=lfs -text
16
- *.onnx filter=lfs diff=lfs merge=lfs -text
17
- *.ot filter=lfs diff=lfs merge=lfs -text
18
- *.parquet filter=lfs diff=lfs merge=lfs -text
19
- *.pb filter=lfs diff=lfs merge=lfs -text
20
- *.pickle filter=lfs diff=lfs merge=lfs -text
21
- *.pkl filter=lfs diff=lfs merge=lfs -text
22
- *.pt filter=lfs diff=lfs merge=lfs -text
23
- *.pth filter=lfs diff=lfs merge=lfs -text
24
- *.rar filter=lfs diff=lfs merge=lfs -text
25
  *.safetensors filter=lfs diff=lfs merge=lfs -text
26
- saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
- *.tar.* filter=lfs diff=lfs merge=lfs -text
28
- *.tar filter=lfs diff=lfs merge=lfs -text
29
- *.tflite filter=lfs diff=lfs merge=lfs -text
30
- *.tgz filter=lfs diff=lfs merge=lfs -text
31
- *.wasm filter=lfs diff=lfs merge=lfs -text
32
- *.xz filter=lfs diff=lfs merge=lfs -text
33
- *.zip filter=lfs diff=lfs merge=lfs -text
34
- *.zst filter=lfs diff=lfs merge=lfs -text
35
- *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  *.safetensors filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
 
 
.gitignore ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ __pycache__/
2
+ *.pyc
README.md CHANGED
@@ -1,3 +1,141 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language: en
4
+ tags:
5
+ - pytorch
6
+ - feature-extraction
7
+ - biology
8
+ - protein-sequences
9
+ - antibodies
10
+ - ablang
11
+ - PDB
12
+ - contrastive-learning
13
+ library_name: transformers
14
+ ---
15
+
16
+ # AbLangPDB1: Contrastive-Learned Antibody Embeddings for General Epitope-Overlap Predictions
17
+
18
+ This repository contains the model, code, and tokenizers for **AbLangPDB1**.
19
+
20
+ ## Model Description
21
+ **AbLangPDB1** is a fine-tuned antibody language model for generating embeddings of antibodies searching for epitope/antigen-specificity matches to reference antibodies.
22
+
23
+ The model was developed using contrastive learning on paired heavy and light chain sequences, as described in our paper:
24
+
25
+ > [Contrastive Learning Enables Epitope Overlap Predictions for Targeted Antibody Discovery](). *[bioRxiv]*, Clinton M. Holt, Alexis K. Janke, Parastoo Amlashi, Parker J. Jamieson, Toma M. Marinov, Ivelin S. Georgiev. 2025. https://doi.org/10.1101/2025.02.25.640114
26
+
27
+
28
+
29
+ ### Model Architecture
30
+ Heavy Chain Seq -> [AbLang Heavy] -> 768-dim -> |
31
+ | -> [Concatenate] -> [Mixer Network] -> 1536-dim Paired Embedding
32
+ Light Chain Seq -> [AbLang Light] -> 768-dim -> |
33
+
34
+ The `AbLangPDB1` model uses the AbLangPaired architecture, a custom class that processes heavy and light chains of antibodies independently using the pre-trained **AbLang** models before fusing their embeddings together. The resulting embeddings from the two AbLang models are concatenated and passed through a custom **Mixer** network (6 fully connected feed forward layers) to produce a final, unified 1536-dimensional embedding for the paired antibody.
35
+
36
+ The pretrained heavy model is [AbLang_heavy](https://huggingface.co/qilowoq/AbLang_heavy) and the pretrained light model is [AbLang_light](https://huggingface.co/qilowoq/AbLang_light). In brief, these use the RoBERTa architecture pretrained with the masked language modeling objective. Each model is 12 transformer blocks with 12 attenuated heads, an inner hidden size of 3072 and a hidden size of 768. It uses a learned positional embedding specific for antibodies with a max length of 160. The 768 dimensional embedding from each model is generated by mean pooling over all residue-level embeddings.
37
+
38
+ During training these pretrained models were frozen and a QLORA adapter was added.
39
+
40
+ ## Intended uses & limitations
41
+ The model is intended to be used to generate epitope-information-rich embeddings of antibodies, but a prediction head could be added to the model to make predictions such as neutralization capacity. Expect accuracy to be significantly better when comparing antibodies to those within the PDB.
42
+
43
+ 1. **Epitope Classification**: Antibodies with unknown epitopes can be embedded and compared against a reference database of antibodies with known epitopes. The reference antibody with the highest cosine similarity represents the most similar epitope to the epitope of the given antibody.
44
+ **Limitation**: Mouse BCRs are unlikely to perform well here and BCRs which do not bind a Pfam domain used in training are likely to have reduced classification accuracy.
45
+
46
+ 2. **Antibody Search**: A reference antibody sequence can be embedded along with a large search database. Antibodies with high cosine similarities in the search database can be assumed to have similar epitope targets.
47
+ entative candidates can then be chosen from each cluster for downstream characterization.
48
+
49
+ ## Training data
50
+ For AbLang-PDB, we curated 1,909 non-redundant human antibodies from the [Structural Antibody Database (SAbDab)](https://doi.org/10.1093/nar/gkt1043) with a February 19, 2024 cutoff date . These were assigned antigen domains using the [pfam_scan software](https://github.com/aziele/pfam_scan) such that two antibodies containing at least one shared Pfam were considered to be in the same category. For partitioning antibodies between training (80%), validation (10%), and test (10%) splits, antibodies sharing both heavy and light V-genes and CDRH3 amino acid identity >70% were assigned to the same clone group and distributed such that the same clone group was not present in both the training and test sets. Additionally, pairs with >92.5% sequence identity in either chain were excluded to maintain diversity.
51
+
52
+ ## Training Procedure
53
+ The AbLang-PDB model was trained using a Mean Squared Error loss function of the cosine similarity between a pair of antibody embeddings versus the ground truth amount of epitope overlap of the pair. Here the epitope overlap includes labels pushing antibodies binding the same antigen protein family in the general vicinity of each other while pushing those binding overlapping epitopes progressively closer.
54
+
55
+ ## How to Use
56
+ To use this model, first ensure you have the necessary libraries installed:
57
+
58
+ ### 1. Setup
59
+ First, clone this repository and install the required libraries.
60
+
61
+ ```bash
62
+ # Clone the repository to get the model script, weights, and tokenizers
63
+ git clone https://huggingface.co/clint-holt/AbLangPDB1
64
+ cd AbLangPDB1
65
+
66
+ # Install dependencies
67
+ pip install torch pandas "transformers>=4.30.0" safetensors
68
+ ```
69
+
70
+ Then run the following code
71
+
72
+ ```python
73
+
74
+ import torch
75
+ import pandas as pd
76
+ from transformers import AutoTokenizer
77
+
78
+ # Import the custom model class and config from the cloned repository
79
+ from ablangpaired_model import AbLangPaired, AbLangPairedConfig
80
+
81
+ # 1. Load Model and Tokenizers
82
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
83
+ model_dir = "." # Assumes you are running this script from the cloned directory
84
+
85
+ # Configure the model to load the local weights
86
+ # The AbLangPairedConfig specifies the base AbLang models and the local checkpoint file
87
+ model_config = AbLangPairedConfig(checkpoint_filename=f"{model_dir}/ablangpdb_model.safetensors")
88
+ model = AbLangPaired(model_config, device).to(device)
89
+ model.eval()
90
+
91
+ # Tokenizers are stored in subdirectories
92
+ heavy_tokenizer = AutoTokenizer.from_pretrained(f"{model_dir}/heavy_tokenizer")
93
+ light_tokenizer = AutoTokenizer.from_pretrained(f"{model_dir}/light_tokenizer")
94
+
95
+ # 2. Prepare Antibody Sequences
96
+ data = {
97
+ 'HC_AA': ["EVQLVESGGGLVQPGGSLRLSCAASGFNLYYYSIHWVRQAPGKGLEWVASISPYSSSTSYADSVKGRFTISADTSKNTAYLQMNSLRAEDTAVYYCARGRWYRRALDYWGQGTLVTVSS"],
98
+ 'LC_AA': ["DIQMTQSPSSLSASVGDRVTITCRASQSVSSAVAWYQQKPGKAPKLLIYSASSLYSGVPSRFSGSRSGTDFTLTISSLQPEDFATYYCQQYPYYSSLITFGQGTKVEIK"]
99
+ }
100
+ df = pd.DataFrame(data)
101
+
102
+ # Pre-process sequences by adding spaces between amino acids
103
+ df["PREPARED_HC_SEQ"] = df["HC_AA"].apply(lambda x: " ".join(list(x)))
104
+ df["PREPARED_LC_SEQ"] = df["LC_AA"].apply(lambda x: " ".join(list(x)))
105
+
106
+ # 3. Tokenize and Embed
107
+ h_tokens = heavy_tokenizer(df["PREPARED_HC_SEQ"].tolist(), padding='longest', return_tensors="pt")
108
+ l_tokens = light_tokenizer(df["PREPARED_LC_SEQ"].tolist(), padding='longest', return_tensors="pt")
109
+
110
+ with torch.no_grad():
111
+ embeddings = model(
112
+ h_input_ids=h_tokens['input_ids'].to(device),
113
+ h_attention_mask=h_tokens['attention_mask'].to(device),
114
+ l_input_ids=l_tokens['input_ids'].to(device),
115
+ l_attention_mask=l_tokens['attention_mask'].to(device)
116
+ )
117
+
118
+ print("Embedding generation complete! ✅")
119
+ print("Shape of embeddings tensor:", embeddings.shape)
120
+ # Expected output shape: (1, 1536)
121
+ ```
122
+
123
+ ## Citation
124
+ If you use this model or code in your research, please cite our paper:
125
+
126
+ ```bibtex
127
+
128
+ @article {Holt2025.02.25.640114,
129
+     author = {Holt, Clinton M. and Janke, Alexis K. and Amlashi, Parastoo and Jamieson, Parker J. and Marinov, Toma M. and Georgiev, Ivelin S.},
130
+     title = {Contrastive Learning Enables Epitope Overlap Predictions for Targeted Antibody Discovery},
131
+     elocation-id = {2025.02.25.640114},
132
+     year = {2025},
133
+     doi = {10.1101/2025.02.25.640114},
134
+     publisher = {Cold Spring Harbor Laboratory},
135
+     URL = {https://www.biorxiv.org/content/early/2025/04/01/2025.02.25.640114},
136
+     eprint = {https://www.biorxiv.org/content/early/2025/04/01/2025.02.25.640114.full.pdf},
137
+     journal = {bioRxiv}
138
+
139
+ }
140
+
141
+ ```
ablangpaired_model.py ADDED
@@ -0,0 +1,115 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ablangrbd_model.py
2
+ import torch
3
+ import torch.nn as nn
4
+ import torch.nn.functional as F
5
+ from transformers import AutoModel
6
+ from transformers import PreTrainedModel, PretrainedConfig, AutoModel, AutoConfig
7
+ from safetensors.torch import load_file
8
+
9
+ import typing as T
10
+
11
+
12
+ class Mixer(nn.Module):
13
+ def __init__(self, in_d: int=1536):
14
+ super(Mixer, self).__init__()
15
+ self.layers = nn.Sequential(
16
+ nn.Linear(in_d, in_d), # First layer
17
+ nn.ReLU(), # First activation function
18
+ nn.Linear(in_d, in_d), # Second layer
19
+ nn.ReLU(), # Second activation function
20
+ nn.Linear(in_d, in_d), # Third layer
21
+ nn.ReLU(), # Third activation function
22
+ nn.Linear(in_d, in_d), # Fourth layer
23
+ nn.ReLU(), # Fourth activation function
24
+ nn.Linear(in_d, in_d), # Fifth layer
25
+ nn.ReLU(), # Fifth activation function
26
+ nn.Linear(in_d, in_d) # Output layer
27
+ # No activation here, apply softmax or sigmoid externally if needed, depending on your loss function
28
+ )
29
+
30
+ def forward(self, x):
31
+ return self.layers(x)
32
+
33
+
34
+ def get_sequence_embeddings(mask, model_output):
35
+ mask = mask.float()
36
+ d = {k: v for k, v in torch.nonzero(mask).cpu().numpy()} # dict of sep tokens k = ab index, v = index of final position where mask = 1
37
+ # make sep token invisible
38
+ for i in d:
39
+ mask[i, d[i]] = 0
40
+ mask[:, 0] = 0.0 # make cls token invisible
41
+ mask = mask.unsqueeze(-1).expand(model_output.last_hidden_state.size())
42
+ sum_embeddings = torch.sum(model_output.last_hidden_state * mask, 1)
43
+ sum_mask = torch.clamp(mask.sum(1), min=1e-9)
44
+ return sum_embeddings / sum_mask # sum_mask means length of unmasked positions
45
+
46
+
47
+ class AbLangPairedConfig(PretrainedConfig):
48
+ model_type = "ablang_paired"
49
+
50
+ def __init__(
51
+ self,
52
+ checkpoint_filename: str,
53
+ heavy_model_id='qilowoq/AbLang_heavy',
54
+ heavy_revision='ecac793b0493f76590ce26d48f7aac4912de8717',
55
+ light_model_id='qilowoq/AbLang_light',
56
+ light_revision='ce0637166f5e6e271e906d29a8415d9fdc30e377',
57
+ mixer_hidden_dim: int = 1536,
58
+ **kwargs
59
+ ):
60
+ super().__init__(**kwargs)
61
+ self.checkpoint_filename = checkpoint_filename
62
+ self.heavy_model_id = heavy_model_id
63
+ self.heavy_revision = heavy_revision
64
+ self.light_model_id = light_model_id
65
+ self.light_revision = light_revision
66
+ self.mixer_hidden_dim = mixer_hidden_dim
67
+
68
+
69
+ class AbLangPaired(PreTrainedModel):
70
+
71
+ def __init__(self, personal_config: AbLangPairedConfig, device: T.Union[str, torch.device] = "cpu"):
72
+ # During training I used the AbLang_heavy config as AbLangPaired's config
73
+ # This may be why it is very hard to integrate this into the Hugging Face AutoModel system
74
+ self.config = AutoConfig.from_pretrained(personal_config.heavy_model_id, revision=personal_config.heavy_revision)
75
+ super().__init__(self.config)
76
+ # super().__init__()
77
+
78
+ self.roberta_heavy = AutoModel.from_pretrained(
79
+ personal_config.heavy_model_id,
80
+ revision=personal_config.heavy_revision, # Specific commit hash
81
+ trust_remote_code=True
82
+ )
83
+
84
+ self.roberta_light = AutoModel.from_pretrained(
85
+ personal_config.light_model_id,
86
+ revision=personal_config.light_revision, # Specific commit hash
87
+ trust_remote_code=True
88
+ )
89
+
90
+ self.mixer = Mixer(in_d=1536)
91
+
92
+ # Load either torch or transformers saved file
93
+ if personal_config.checkpoint_filename.endswith('.safetensors'):
94
+ state_dict = load_file(personal_config.checkpoint_filename)
95
+ else:
96
+ state_dict = torch.load(personal_config.checkpoint_filename, map_location=device)
97
+
98
+ load_result = self.load_state_dict(state_dict, strict=False)
99
+ self.to(device)
100
+ self.eval()
101
+
102
+ def forward(self, h_input_ids, h_attention_mask, l_input_ids, l_attention_mask, **kwargs):
103
+ # Run chains through separate streams
104
+ outputs_h = self.roberta_heavy(input_ids=h_input_ids.to(torch.int64), attention_mask=h_attention_mask)
105
+ outputs_l = self.roberta_light(input_ids=l_input_ids.to(torch.int64), attention_mask=l_attention_mask)
106
+
107
+ # Mean pool
108
+ pooled_output_h = get_sequence_embeddings(h_attention_mask, outputs_h)
109
+ pooled_output_l = get_sequence_embeddings(l_attention_mask, outputs_l)
110
+
111
+ # Concatenate and then do 6 fully connected layers to pick up on cross-chain features
112
+ pooled_output = torch.cat([pooled_output_h, pooled_output_l], dim=1)
113
+ pooled_output = self.mixer(pooled_output)
114
+ embedding = F.normalize(pooled_output, p=2, dim=1)
115
+ return embedding
ablangpdb_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:049b60d04449bd39da23b59205e48b9b59425aa7ae14a91414b8dfe7483856e4
3
+ size 738301704
config.json ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "AbLangPaired"
4
+ ],
5
+ "heavy_model_id": "qilowoq/AbLang_heavy",
6
+ "heavy_revision": "ecac793b0493f76590ce26d48f7aac4912de8717",
7
+ "light_model_id": "qilowoq/AbLang_light",
8
+ "light_revision": "ce0637166f5e6e271e906d29a8415d9fdc30e377",
9
+ "mixer_hidden_dim": 1536,
10
+ "model_type": "ablang_paired",
11
+ "torch_dtype": "float32",
12
+ "transformers_version": "4.37.2",
13
+ "auto_map": {
14
+ "AutoConfig": "ablangpaired_model.AbLangPairedConfig",
15
+ "AutoModel": "ablangpaired_model.AbLangPaired"
16
+ }
17
+ }
heavy_tokenizer/special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
heavy_tokenizer/tokenizer.json ADDED
@@ -0,0 +1,175 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "version": "1.0",
3
+ "truncation": null,
4
+ "padding": null,
5
+ "added_tokens": [
6
+ {
7
+ "id": 0,
8
+ "content": "[CLS]",
9
+ "single_word": false,
10
+ "lstrip": false,
11
+ "rstrip": false,
12
+ "normalized": false,
13
+ "special": true
14
+ },
15
+ {
16
+ "id": 21,
17
+ "content": "[PAD]",
18
+ "single_word": false,
19
+ "lstrip": false,
20
+ "rstrip": false,
21
+ "normalized": false,
22
+ "special": true
23
+ },
24
+ {
25
+ "id": 22,
26
+ "content": "[SEP]",
27
+ "single_word": false,
28
+ "lstrip": false,
29
+ "rstrip": false,
30
+ "normalized": false,
31
+ "special": true
32
+ },
33
+ {
34
+ "id": 23,
35
+ "content": "[MASK]",
36
+ "single_word": false,
37
+ "lstrip": false,
38
+ "rstrip": false,
39
+ "normalized": false,
40
+ "special": true
41
+ },
42
+ {
43
+ "id": 24,
44
+ "content": "[UNK]",
45
+ "single_word": false,
46
+ "lstrip": false,
47
+ "rstrip": false,
48
+ "normalized": false,
49
+ "special": true
50
+ }
51
+ ],
52
+ "normalizer": {
53
+ "type": "BertNormalizer",
54
+ "clean_text": true,
55
+ "handle_chinese_chars": true,
56
+ "strip_accents": null,
57
+ "lowercase": false
58
+ },
59
+ "pre_tokenizer": {
60
+ "type": "BertPreTokenizer"
61
+ },
62
+ "post_processor": {
63
+ "type": "TemplateProcessing",
64
+ "single": [
65
+ {
66
+ "SpecialToken": {
67
+ "id": "[CLS]",
68
+ "type_id": 0
69
+ }
70
+ },
71
+ {
72
+ "Sequence": {
73
+ "id": "A",
74
+ "type_id": 0
75
+ }
76
+ },
77
+ {
78
+ "SpecialToken": {
79
+ "id": "[SEP]",
80
+ "type_id": 0
81
+ }
82
+ }
83
+ ],
84
+ "pair": [
85
+ {
86
+ "SpecialToken": {
87
+ "id": "[CLS]",
88
+ "type_id": 0
89
+ }
90
+ },
91
+ {
92
+ "Sequence": {
93
+ "id": "A",
94
+ "type_id": 0
95
+ }
96
+ },
97
+ {
98
+ "SpecialToken": {
99
+ "id": "[SEP]",
100
+ "type_id": 0
101
+ }
102
+ },
103
+ {
104
+ "Sequence": {
105
+ "id": "B",
106
+ "type_id": 1
107
+ }
108
+ },
109
+ {
110
+ "SpecialToken": {
111
+ "id": "[SEP]",
112
+ "type_id": 1
113
+ }
114
+ }
115
+ ],
116
+ "special_tokens": {
117
+ "[CLS]": {
118
+ "id": "[CLS]",
119
+ "ids": [
120
+ 0
121
+ ],
122
+ "tokens": [
123
+ "[CLS]"
124
+ ]
125
+ },
126
+ "[SEP]": {
127
+ "id": "[SEP]",
128
+ "ids": [
129
+ 22
130
+ ],
131
+ "tokens": [
132
+ "[SEP]"
133
+ ]
134
+ }
135
+ }
136
+ },
137
+ "decoder": {
138
+ "type": "WordPiece",
139
+ "prefix": "##",
140
+ "cleanup": true
141
+ },
142
+ "model": {
143
+ "type": "WordPiece",
144
+ "unk_token": "[UNK]",
145
+ "continuing_subword_prefix": "##",
146
+ "max_input_chars_per_word": 100,
147
+ "vocab": {
148
+ "[CLS]": 0,
149
+ "M": 1,
150
+ "R": 2,
151
+ "H": 3,
152
+ "K": 4,
153
+ "D": 5,
154
+ "E": 6,
155
+ "S": 7,
156
+ "T": 8,
157
+ "N": 9,
158
+ "Q": 10,
159
+ "C": 11,
160
+ "G": 12,
161
+ "P": 13,
162
+ "A": 14,
163
+ "V": 15,
164
+ "I": 16,
165
+ "F": 17,
166
+ "Y": 18,
167
+ "W": 19,
168
+ "L": 20,
169
+ "[PAD]": 21,
170
+ "[SEP]": 22,
171
+ "[MASK]": 23,
172
+ "[UNK]": 24
173
+ }
174
+ }
175
+ }
heavy_tokenizer/tokenizer_config.json ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[CLS]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "21": {
12
+ "content": "[PAD]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "22": {
20
+ "content": "[SEP]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "23": {
28
+ "content": "[MASK]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "24": {
36
+ "content": "[UNK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_basic_tokenize": true,
47
+ "do_lower_case": false,
48
+ "mask_token": "[MASK]",
49
+ "model_max_length": 160,
50
+ "never_split": null,
51
+ "pad_token": "[PAD]",
52
+ "sep_token": "[SEP]",
53
+ "strip_accents": null,
54
+ "tokenize_chinese_chars": true,
55
+ "tokenizer_class": "BertTokenizer",
56
+ "unk_token": "[UNK]"
57
+ }
heavy_tokenizer/vocab.txt ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [CLS]
2
+ M
3
+ R
4
+ H
5
+ K
6
+ D
7
+ E
8
+ S
9
+ T
10
+ N
11
+ Q
12
+ C
13
+ G
14
+ P
15
+ A
16
+ V
17
+ I
18
+ F
19
+ Y
20
+ W
21
+ L
22
+ [PAD]
23
+ [SEP]
24
+ [MASK]
25
+ [UNK]
light_tokenizer/special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
light_tokenizer/tokenizer.json ADDED
@@ -0,0 +1,175 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "version": "1.0",
3
+ "truncation": null,
4
+ "padding": null,
5
+ "added_tokens": [
6
+ {
7
+ "id": 0,
8
+ "content": "[CLS]",
9
+ "single_word": false,
10
+ "lstrip": false,
11
+ "rstrip": false,
12
+ "normalized": false,
13
+ "special": true
14
+ },
15
+ {
16
+ "id": 21,
17
+ "content": "[PAD]",
18
+ "single_word": false,
19
+ "lstrip": false,
20
+ "rstrip": false,
21
+ "normalized": false,
22
+ "special": true
23
+ },
24
+ {
25
+ "id": 22,
26
+ "content": "[SEP]",
27
+ "single_word": false,
28
+ "lstrip": false,
29
+ "rstrip": false,
30
+ "normalized": false,
31
+ "special": true
32
+ },
33
+ {
34
+ "id": 23,
35
+ "content": "[MASK]",
36
+ "single_word": false,
37
+ "lstrip": false,
38
+ "rstrip": false,
39
+ "normalized": false,
40
+ "special": true
41
+ },
42
+ {
43
+ "id": 24,
44
+ "content": "[UNK]",
45
+ "single_word": false,
46
+ "lstrip": false,
47
+ "rstrip": false,
48
+ "normalized": false,
49
+ "special": true
50
+ }
51
+ ],
52
+ "normalizer": {
53
+ "type": "BertNormalizer",
54
+ "clean_text": true,
55
+ "handle_chinese_chars": true,
56
+ "strip_accents": null,
57
+ "lowercase": false
58
+ },
59
+ "pre_tokenizer": {
60
+ "type": "BertPreTokenizer"
61
+ },
62
+ "post_processor": {
63
+ "type": "TemplateProcessing",
64
+ "single": [
65
+ {
66
+ "SpecialToken": {
67
+ "id": "[CLS]",
68
+ "type_id": 0
69
+ }
70
+ },
71
+ {
72
+ "Sequence": {
73
+ "id": "A",
74
+ "type_id": 0
75
+ }
76
+ },
77
+ {
78
+ "SpecialToken": {
79
+ "id": "[SEP]",
80
+ "type_id": 0
81
+ }
82
+ }
83
+ ],
84
+ "pair": [
85
+ {
86
+ "SpecialToken": {
87
+ "id": "[CLS]",
88
+ "type_id": 0
89
+ }
90
+ },
91
+ {
92
+ "Sequence": {
93
+ "id": "A",
94
+ "type_id": 0
95
+ }
96
+ },
97
+ {
98
+ "SpecialToken": {
99
+ "id": "[SEP]",
100
+ "type_id": 0
101
+ }
102
+ },
103
+ {
104
+ "Sequence": {
105
+ "id": "B",
106
+ "type_id": 1
107
+ }
108
+ },
109
+ {
110
+ "SpecialToken": {
111
+ "id": "[SEP]",
112
+ "type_id": 1
113
+ }
114
+ }
115
+ ],
116
+ "special_tokens": {
117
+ "[CLS]": {
118
+ "id": "[CLS]",
119
+ "ids": [
120
+ 0
121
+ ],
122
+ "tokens": [
123
+ "[CLS]"
124
+ ]
125
+ },
126
+ "[SEP]": {
127
+ "id": "[SEP]",
128
+ "ids": [
129
+ 22
130
+ ],
131
+ "tokens": [
132
+ "[SEP]"
133
+ ]
134
+ }
135
+ }
136
+ },
137
+ "decoder": {
138
+ "type": "WordPiece",
139
+ "prefix": "##",
140
+ "cleanup": true
141
+ },
142
+ "model": {
143
+ "type": "WordPiece",
144
+ "unk_token": "[UNK]",
145
+ "continuing_subword_prefix": "##",
146
+ "max_input_chars_per_word": 100,
147
+ "vocab": {
148
+ "[CLS]": 0,
149
+ "M": 1,
150
+ "R": 2,
151
+ "H": 3,
152
+ "K": 4,
153
+ "D": 5,
154
+ "E": 6,
155
+ "S": 7,
156
+ "T": 8,
157
+ "N": 9,
158
+ "Q": 10,
159
+ "C": 11,
160
+ "G": 12,
161
+ "P": 13,
162
+ "A": 14,
163
+ "V": 15,
164
+ "I": 16,
165
+ "F": 17,
166
+ "Y": 18,
167
+ "W": 19,
168
+ "L": 20,
169
+ "[PAD]": 21,
170
+ "[SEP]": 22,
171
+ "[MASK]": 23,
172
+ "[UNK]": 24
173
+ }
174
+ }
175
+ }
light_tokenizer/tokenizer_config.json ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[CLS]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "21": {
12
+ "content": "[PAD]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "22": {
20
+ "content": "[SEP]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "23": {
28
+ "content": "[MASK]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "24": {
36
+ "content": "[UNK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_basic_tokenize": true,
47
+ "do_lower_case": false,
48
+ "mask_token": "[MASK]",
49
+ "model_max_length": 160,
50
+ "never_split": null,
51
+ "pad_token": "[PAD]",
52
+ "sep_token": "[SEP]",
53
+ "strip_accents": null,
54
+ "tokenize_chinese_chars": true,
55
+ "tokenizer_class": "BertTokenizer",
56
+ "unk_token": "[UNK]"
57
+ }
light_tokenizer/vocab.txt ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [CLS]
2
+ M
3
+ R
4
+ H
5
+ K
6
+ D
7
+ E
8
+ S
9
+ T
10
+ N
11
+ Q
12
+ C
13
+ G
14
+ P
15
+ A
16
+ V
17
+ I
18
+ F
19
+ Y
20
+ W
21
+ L
22
+ [PAD]
23
+ [SEP]
24
+ [MASK]
25
+ [UNK]