superspider2023
/

iupacGPT

Model card Files Files and versions Community

mao jiashun commited on Jun 23, 2023

Commit

295ff14

1 Parent(s): 1286756

Upload 58 files

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitattributes +3 -0
iupac-gpt/.gitignore +5 -0
iupac-gpt/LICENSE +32 -0
iupac-gpt/README.md +62 -0
iupac-gpt/checkpoints/iupac/config.json +33 -0
iupac-gpt/checkpoints/iupac/pytorch_model.bin +3 -0
iupac-gpt/class.txt +3 -0
iupac-gpt/data/bbbp.csv +0 -0
iupac-gpt/data/iupacs_logp.csv +0 -0
iupac-gpt/environment.yml +19 -0
iupac-gpt/iupac.txt +3 -0
iupac-gpt/iupacGPT2-gen50K.csv +0 -0
iupac-gpt/iupac_gpt/__init__.py +21 -0
iupac-gpt/iupac_gpt/__pycache__/__init__.cpython-37.pyc +0 -0
iupac-gpt/iupac_gpt/__pycache__/__init__.cpython-38.pyc +0 -0
iupac-gpt/iupac_gpt/__pycache__/classification.cpython-37.pyc +0 -0
iupac-gpt/iupac_gpt/__pycache__/classification.cpython-38.pyc +0 -0
iupac-gpt/iupac_gpt/__pycache__/data.cpython-38.pyc +0 -0
iupac-gpt/iupac_gpt/__pycache__/iupac_dataset.cpython-38.pyc +0 -0
iupac-gpt/iupac_gpt/__pycache__/iupac_dataset_class.cpython-38.pyc +0 -0
iupac-gpt/iupac_gpt/__pycache__/iupac_dataset_pro.cpython-38.pyc +0 -0
iupac-gpt/iupac_gpt/__pycache__/iupac_tokenization.cpython-38.pyc +0 -0
iupac-gpt/iupac_gpt/__pycache__/iupac_tokenization_class.cpython-38.pyc +0 -0
iupac-gpt/iupac_gpt/__pycache__/iupac_tokenization_iupac.cpython-38.pyc +0 -0
iupac-gpt/iupac_gpt/__pycache__/iupac_tokenization_pro.cpython-38.pyc +0 -0
iupac-gpt/iupac_gpt/__pycache__/language_modeling.cpython-38.pyc +0 -0
iupac-gpt/iupac_gpt/__pycache__/tokenization.cpython-38.pyc +0 -0
iupac-gpt/iupac_gpt/classification.py +362 -0
iupac-gpt/iupac_gpt/data.py +269 -0
iupac-gpt/iupac_gpt/iupac_dataset.py +121 -0
iupac-gpt/iupac_gpt/iupac_dataset_class.py +128 -0
iupac-gpt/iupac_gpt/iupac_dataset_pro.py +124 -0
iupac-gpt/iupac_gpt/iupac_spm.model +3 -0
iupac-gpt/iupac_gpt/iupac_spm.vocab +1391 -0
iupac-gpt/iupac_gpt/iupac_tokenization.py +131 -0
iupac-gpt/iupac_gpt/iupac_tokenization_class.py +131 -0
iupac-gpt/iupac_gpt/iupac_tokenization_iupac.py +134 -0
iupac-gpt/iupac_gpt/iupac_tokenization_pro.py +131 -0
iupac-gpt/iupac_gpt/iupacs_logp.csv +0 -0
iupac-gpt/iupac_gpt/language_modeling.py +68 -0
iupac-gpt/iupac_gpt/pubchem_iupac_smile_gpt.csv +3 -0
iupac-gpt/iupac_gpt/real_iupac_tokenizer.pt +3 -0
iupac-gpt/iupac_gpt/tokenization.py +193 -0
iupac-gpt/nohup.out +0 -0
iupac-gpt/notebooks/.ipynb_checkpoints/language-modeling-checkpoint.ipynb +0 -0
iupac-gpt/notebooks/iupac_head_view.html +0 -0
iupac-gpt/notebooks/iupac_language-modeling.py +236 -0
iupac-gpt/notebooks/iupac_language-modeling_retrain.py +224 -0
iupac-gpt/notebooks/iupac_language-modeling_train.ipynb +0 -0
iupac-gpt/notebooks/iupac_language-modeling_train.py +231 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+iupac-gpt/class.txt filter=lfs diff=lfs merge=lfs -text
+iupac-gpt/iupac_gpt/pubchem_iupac_smile_gpt.csv filter=lfs diff=lfs merge=lfs -text
+iupac-gpt/iupac.txt filter=lfs diff=lfs merge=lfs -text

iupac-gpt/.gitignore ADDED Viewed

	@@ -0,0 +1,5 @@

+**/__pycache__/*
+**/.idea/*
+**/.ipynb_checkpoints/*
+**/lightning_logs/*
+*.log

iupac-gpt/LICENSE ADDED Viewed

	@@ -0,0 +1,32 @@

+The Clear BSD License
+Copyright (c) 2021 Sanjar Adilov
+All rights reserved.
+Redistribution and use in source and binary forms, with or without
+modification, are permitted (subject to the limitations in the disclaimer
+below) provided that the following conditions are met:
+     * Redistributions of source code must retain the above copyright notice,
+     this list of conditions and the following disclaimer.
+     * Redistributions in binary form must reproduce the above copyright
+     notice, this list of conditions and the following disclaimer in the
+     documentation and/or other materials provided with the distribution.
+     * Neither the name of the copyright holder nor the names of its
+     contributors may be used to endorse or promote products derived from this
+     software without specific prior written permission.
+NO EXPRESS OR IMPLIED LICENSES TO ANY PARTY'S PATENT RIGHTS ARE GRANTED BY
+THIS LICENSE. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
+CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
+PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR
+CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
+BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER
+IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+POSSIBILITY OF SUCH DAMAGE.

iupac-gpt/README.md ADDED Viewed

	@@ -0,0 +1,62 @@

+# Generative Pre-Training from Molecules
+Autoregressive transformer language model for drug discovery. (Pre)trained on a large
+SMILES corpus. Evaluated on molecular property prediction and low-data de novo design
+tasks.
+## Installation
+Set up [conda](https://conda.io/en/latest/index.html) and create a new environment from
+`environment.yml` (if needed, make corresponding edits for GPU-compatibility).
+```shell
+conda env create -f environment.yml
+conda activate smiles-gpt
+git clone https://github.com/sanjaradylov/smiles-gpt.git
+cd smiles-gpt
+```
+## Benchmark
+### Checkpoint
+[checkpoints/benchmark-5m](https://github.com/sanjaradylov/smiles-gpt/tree/master/checkpoints/benchmark-5m)
+stores serialized model, tokenizer, and configuration. Do not modify them. Use
+`from_pretrained` method to load HuggingFace objects, e.g.,
+```python
+from transformers import GPT2Config, GPT2LMHeadModel, PreTrainedTokenizerFast
+checkpoint = "checkpoints/benchmark-5m"
+config = GPT2Config.from_pretrained(checkpoint)
+model = GPT2LMHeadModel.from_pretrained(checkpoint)
+tokenizer = PreTrainedTokenizerFast.from_pretrained(checkpoint)
+```
+### Data
+[data](https://github.com/sanjaradylov/smiles-gpt/tree/master/data) stores
+[Blood-Brain Barrier Penetration](https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/BBBP.csv)
+classification dataset and 10K subset of ChemBERTa's
+[PubChem-10M](https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/pubchem_10m.txt.zip).
+See [Examples](#Examples).
+### Output
+[output](https://github.com/sanjaradylov/smiles-gpt/tree/master/output) stores generated
+SMILES strings.
+## Examples
+Adapter training for molecular property prediction
+(replace `data/bbbp.csv` and `p_np` arguments with your dataset and taskname(s),
+respectively):
+```shell
+python3 scripts/classification.py checkpoints/benchmark-5m data/bbbp.csv p_np
+```
+For language model pretraining, see
+[notebooks](https://github.com/sanjaradylov/smiles-gpt/tree/master/notebooks).
+## Citation
+If you use `smiles-gpt` in your research, please consider citing
+> https://doi.org/10.33774/chemrxiv-2021-5fwjd

iupac-gpt/checkpoints/iupac/config.json ADDED Viewed

	@@ -0,0 +1,33 @@

+{
+  "activation_function": "gelu_new",
+  "adapters": {
+    "adapters": {},
+    "config_map": {}
+  },
+  "architectures": [
+    "GPT2LMHeadModel"
+  ],
+  "attn_pdrop": 0.1,
+  "bos_token_id": 2,
+  "embd_pdrop": 0.1,
+  "eos_token_id": 1,
+  "gradient_checkpointing": false,
+  "initializer_range": 0.02,
+  "layer_norm_epsilon": 1e-05,
+  "model_type": "gpt2",
+  "n_ctx": 1280,
+  "n_embd": 256,
+  "n_head": 8,
+  "n_inner": null,
+  "n_layer": 8,
+  "n_positions": 1280,
+  "resid_pdrop": 0.1,
+  "summary_activation": null,
+  "summary_first_dropout": 0.1,
+  "summary_proj_to_labels": true,
+  "summary_type": "cls_index",
+  "summary_use_proj": true,
+  "transformers_version": "2.0.1",
+  "use_cache": true,
+  "vocab_size": 1491
+}

iupac-gpt/checkpoints/iupac/pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0aea87ea0c2a89b9a15bfd4682615df1d64f37c25748684d117ecc933153950f
+size 41264861

iupac-gpt/class.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:83bbfd23faf47bacfdec9db23eed87b0d20488da7d2c84d838d0d77e7f2c58d5
+size 12317646

iupac-gpt/data/bbbp.csv ADDED Viewed

The diff for this file is too large to render. See raw diff

iupac-gpt/data/iupacs_logp.csv ADDED Viewed

The diff for this file is too large to render. See raw diff

iupac-gpt/environment.yml ADDED Viewed

	@@ -0,0 +1,19 @@

+name: smiles-gpt
+channels:
+  - pytorch
+  - anaconda
+  - conda-forge
+dependencies:
+  - python=3.8
+  - pip
+  - pandas
+  - rdkit
+  - pytorch
+  - torchvision
+  - torchaudio
+  - cpuonly
+  - pip:
+    - tokenizers
+    - adapter-transformers
+    - pytorch-lightning
+    - bertviz

iupac-gpt/iupac.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9bead0044a324634255bf5675f623fbd3a0b6babb51da7ca63870b6bf87f800a
+size 156486208

iupac-gpt/iupacGPT2-gen50K.csv ADDED Viewed

The diff for this file is too large to render. See raw diff

iupac-gpt/iupac_gpt/__init__.py ADDED Viewed

	@@ -0,0 +1,21 @@

+"""`smiles_gpt` implements transformer models for molecule generation and molecular-
+property prediction.
+"""
+__author__ = "Sanjar Ad[iy]lov"
+__version__ = "1.0.0-pub"
+from . import classification, data, language_modeling, tokenization
+from .classification import (ClassifierLitModel, RegressorLitModel,
+                             GPT2ForSequenceClassification)
+from .data import CSVDataModule, CVSplitter, LMDataModule
+from .language_modeling import GPT2LitModel
+from .tokenization import SMILESBPETokenizer, SMILESAlphabet
+from .iupac_tokenization_iupac import get_data_loader,prepare_input
+from .iupac_tokenization_pro import get_data_loader_pro,prepare_input_pro
+from .iupac_tokenization_class import get_data_loader_class,prepare_input_class
+__all__ = ("classification", "data", "tokenization",
+           "ClassifierLitModel", "CSVDataModule", "CVSplitter",
+           "GPT2ForSequenceClassification", "GPT2LitModel", "LMDataModule",
+           "RegressorLitModel", "SMILESBPETokenizer", "SMILESAlphabet")

iupac-gpt/iupac_gpt/__pycache__/__init__.cpython-37.pyc ADDED Viewed

Binary file (1.07 kB). View file

iupac-gpt/iupac_gpt/__pycache__/__init__.cpython-38.pyc ADDED Viewed

Binary file (1.08 kB). View file

iupac-gpt/iupac_gpt/__pycache__/classification.cpython-37.pyc ADDED Viewed

Binary file (13.3 kB). View file

iupac-gpt/iupac_gpt/__pycache__/classification.cpython-38.pyc ADDED Viewed

Binary file (13.3 kB). View file

iupac-gpt/iupac_gpt/__pycache__/data.cpython-38.pyc ADDED Viewed

Binary file (11.1 kB). View file

iupac-gpt/iupac_gpt/__pycache__/iupac_dataset.cpython-38.pyc ADDED Viewed

Binary file (3.09 kB). View file

iupac-gpt/iupac_gpt/__pycache__/iupac_dataset_class.cpython-38.pyc ADDED Viewed

Binary file (3.22 kB). View file

iupac-gpt/iupac_gpt/__pycache__/iupac_dataset_pro.cpython-38.pyc ADDED Viewed

Binary file (3.2 kB). View file

iupac-gpt/iupac_gpt/__pycache__/iupac_tokenization.cpython-38.pyc ADDED Viewed

Binary file (5.1 kB). View file

iupac-gpt/iupac_gpt/__pycache__/iupac_tokenization_class.cpython-38.pyc ADDED Viewed

Binary file (5.08 kB). View file

iupac-gpt/iupac_gpt/__pycache__/iupac_tokenization_iupac.cpython-38.pyc ADDED Viewed

Binary file (5.09 kB). View file

iupac-gpt/iupac_gpt/__pycache__/iupac_tokenization_pro.cpython-38.pyc ADDED Viewed

Binary file (5.11 kB). View file

iupac-gpt/iupac_gpt/__pycache__/language_modeling.cpython-38.pyc ADDED Viewed

Binary file (3.37 kB). View file

iupac-gpt/iupac_gpt/__pycache__/tokenization.cpython-38.pyc ADDED Viewed

Binary file (7.44 kB). View file

iupac-gpt/iupac_gpt/classification.py ADDED Viewed

	@@ -0,0 +1,362 @@

+"""HuggingFace-compatible classification and regression models including
+pytorch-lightning models.
+"""
+__all__ = ("BypassNet", "ClassificationHead", "ClassifierLitModel",
+           "GPT2ForSequenceClassification", "RegressorLitModel",
+           "SequenceClassifierOutput")
+from dataclasses import dataclass
+from typing import List, Optional
+import pytorch_lightning as pl
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torchmetrics import AUROC, AveragePrecision
+from transformers import AdamW, GPT2Model, GPT2PreTrainedModel
+from transformers.modeling_outputs import SequenceClassifierOutputWithPast
+from transformers.adapters.model_mixin import ModelWithHeadsAdaptersMixin
+@dataclass
+class SequenceClassifierOutput(SequenceClassifierOutputWithPast):
+    target: Optional[torch.LongTensor] = None
+class GPT2ForSequenceClassification(ModelWithHeadsAdaptersMixin, GPT2PreTrainedModel):
+    """HuggingFace-compatible single- and multi-output (-task) classification model.
+    `config` must be a `GPT2Config` instance with additional `num_tasks` and `num_labels`
+    properties. For multi-task classification, the output is Bypass network with the
+    reduction factor = `config.n_embd // config.n_head`.
+    """
+    _keys_to_ignore_on_load_missing = [
+        r"h\.\d+\.attn\.masked_bias", r"lm_head\.weight", r"output\..*"]
+    def __init__(self, config):
+        super().__init__(config)
+        self.num_tasks = config.num_tasks
+        self.num_labels = config.num_labels
+        self.transformer = GPT2Model(config)
+        if self.num_tasks > 1:
+            self.output = BypassNet(
+                config.n_embd, config.n_embd // config.n_head,
+                config.num_tasks, config.num_labels,
+                config.embd_pdrop)
+        else:
+            self.output = ClassificationHead(
+                config.n_embd, config.n_embd // config.n_head,
+                config.num_labels, config.embd_pdrop)
+        self.init_weights()
+    def forward(self, input_ids=None, past_key_values=None, attention_mask=None,
+                token_type_ids=None, position_ids=None, head_mask=None,
+                inputs_embeds=None, labels=None, use_cache=None, output_attentions=None,
+                output_hidden_states=None, return_dict=None, adapter_names=None,
+                label_mask=None):
+        return_dict = return_dict or self.config.use_return_dict
+        transformer_outputs = self.transformer(
+            input_ids, past_key_values=past_key_values, attention_mask=attention_mask,
+            token_type_ids=token_type_ids, position_ids=position_ids,
+            head_mask=head_mask, inputs_embeds=inputs_embeds, use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states, return_dict=return_dict,
+            adapter_names=adapter_names)
+        hidden_states = transformer_outputs[0]
+        if input_ids is not None:
+            batch_size, sequence_length = input_ids.shape[:2]
+        else:
+            batch_size, sequence_length = inputs_embeds.shape[:2]
+        assert self.config.pad_token_id is not None or batch_size == 1, \
+            "Cannot handle batch sizes > 1 if no padding token is defined."
+        if self.config.pad_token_id is None:
+            sequence_lengths = -1
+        else:
+            if input_ids is not None:
+                sequence_lengths = torch.ne(
+                    input_ids, self.config.pad_token_id).sum(-1) - 1
+            else:
+                sequence_lengths = -1
+        if self.num_tasks == 1:
+            logits = self.output(hidden_states)[range(batch_size), sequence_lengths]
+        else:
+            logits = self.output(hidden_states, batch_size, sequence_lengths)
+        loss = None
+        if labels is not None:
+            if self.num_labels == 2:
+                if label_mask is not None:
+                    nonempty_tasks = (label_mask == 1).view(-1)
+                    nonempty_logits = logits.view(-1, self.num_labels)[nonempty_tasks, :]
+                    nonempty_labels = labels.view(-1)[nonempty_tasks]
+                else:
+                    nonempty_logits = logits.view(-1, self.num_labels)
+                    nonempty_labels = labels.view(-1)
+                if len(labels.size()) == 1:
+                    labels = labels.reshape(1, -1)
+                loss = F.cross_entropy(nonempty_logits, nonempty_labels)
+            elif self.num_labels == 1:
+                loss = F.mse_loss(logits.view(-1), labels.view(-1))
+            else:
+                raise NotImplementedError(
+                    "Only binary classification and regression supported.")
+        if self.num_tasks > 1:
+            logits = logits.transpose(1, 2)
+        if labels is not None and self.num_labels == 2 and self.num_tasks == 1:
+            if label_mask is not None:
+                labels = labels.view(-1)
+            else:
+                labels = nonempty_labels
+        if not return_dict:
+            output = (logits,) + transformer_outputs[1:]
+            return ((loss,) + output) if loss is not None else output
+        return SequenceClassifierOutput(
+            loss=loss, logits=logits, target=labels,
+            past_key_values=transformer_outputs.past_key_values,
+            hidden_states=transformer_outputs.hidden_states,
+            attentions=transformer_outputs.attentions)
+class BypassNet(nn.Module):
+    """Bypass multi-task network from MoleculeNet project [Wu et al., 2018].
+    """
+    def __init__(self, hidden_size: int, intermediate_size: int,
+                 num_tasks: int, num_labels: int = 2,
+                 dropout: float = 0.2, use_bias: bool = False):
+        super().__init__()
+        self.independent = nn.ModuleList([
+            ClassificationHead(hidden_size, intermediate_size,
+                               num_labels, dropout, use_bias)
+            for _ in range(num_tasks)])
+        self.shared = ClassificationHead(hidden_size, intermediate_size,
+                                         num_labels, dropout, use_bias)
+    def forward(self, hidden_states, batch_size, sequence_lengths):
+        logits_list: List[torch.Tensor] = []
+        for layer in self.independent:
+            logits_list.append(layer(hidden_states))
+        shared_logits: torch.Tensor = self.shared(hidden_states)
+        for i in range(len(logits_list)):
+            logits_list[i] = (logits_list[i] + shared_logits)[range(batch_size),
+                                                              sequence_lengths]
+        return torch.stack(logits_list, dim=1)
+class ClassificationHead(nn.Module):
+    """Two-layer feed-forward network with GELU activation and intermediate dropout.
+    """
+    def __init__(self, hidden_size: int, intermediate_size: int,
+                 num_labels: int, dropout: float = 0.0, use_bias: bool = False):
+        super().__init__()
+        self.dense = nn.Linear(hidden_size, intermediate_size, bias=use_bias)
+        self.act = nn.GELU()
+        self.dropout = nn.Dropout(dropout)
+        self.out_proj = nn.Linear(intermediate_size, num_labels, bias=use_bias)
+    def forward(self, x, *args, **kwargs):
+        x = self.dense(x)
+        x = self.act(x)
+        x = self.dropout(x)
+        return self.out_proj(x)
+class ClassifierLitModel(pl.LightningModule):
+    """Pytorch-lightning module for single- or multi-task classification. Trains GPT2
+    model using `AdamW` optimizer with exponential LR scheduler. Evaluates valid and
+    test data on AUC-ROC and AUC-PRC.
+    Args:
+        transformer (`GPT2Model`): (Pretrained) HuggingFace GPT2 model.
+        num_tasks (int): The number of classification tasks.
+        has_empty_labels (bool)
+        batch_size (int)
+        learning_rate (float)
+        scheduler_lambda (float)
+        scheduler_step (int)
+        weight_decay (float)
+    """
+    def __init__(self, transformer: GPT2Model, num_tasks: int, has_empty_labels: bool,
+                 batch_size: int, learning_rate: float, scheduler_lambda: float,
+                 scheduler_step: int, weight_decay: float, *args, **kwargs):
+        super().__init__()
+        self.save_hyperparameters(ignore=("transformer", "num_tasks", "has_empty_labels"))
+        self.transformer = transformer
+        self.num_tasks = num_tasks
+        def get_metrics(metric_cls):
+            return [metric_cls(num_classes=2) for _ in range(num_tasks)]
+        if has_empty_labels:
+            self.train_roc = get_metrics(AUROC)
+            self.val_roc = get_metrics(AUROC)
+            self.test_roc = get_metrics(AUROC)
+            self.train_prc = get_metrics(AveragePrecision)
+            self.val_prc = get_metrics(AveragePrecision)
+            self.test_prc = get_metrics(AveragePrecision)
+            self.step = self._step_empty
+            self.epoch_end = self._epoch_end_empty
+        else:
+            #self.train_roc = AUROC(num_classes=2)
+            #self.val_roc = AUROC(num_classes=2)
+            #self.test_roc = AUROC(num_classes=2)
+            #self.train_prc = AveragePrecision(num_classes=2)
+            #self.val_prc = AveragePrecision(num_classes=2)
+            #self.test_prc = AveragePrecision(num_classes=2)
+            self.train_roc = AUROC(task='multiclass',num_classes=2)
+            self.val_roc = AUROC(task='multiclass',num_classes=2)
+            self.test_roc = AUROC(task='multiclass',num_classes=2)
+            self.train_prc = AveragePrecision(task='multiclass',num_classes=2)
+            self.val_prc = AveragePrecision(task='multiclass',num_classes=2)
+            self.test_prc = AveragePrecision(task='multiclass',num_classes=2)
+            self.step = self._step_nonempty
+            self.epoch_end = self._epoch_end_nonempty
+    def forward(self, *args, **kwargs):
+        return self.transformer(*args, **kwargs)
+    def _step_empty(self, batch, batch_idx, roc, prc):
+        outputs = self(**batch)
+        if self.num_tasks == 1:
+            outputs["target"] = outputs["target"][:, None]
+            outputs["logits"] = outputs["logits"][:, :, None]
+        for task_id in range(self.num_tasks):
+            target = outputs["target"][:, task_id]
+            nonempty_entries = target != -1
+            target = target[nonempty_entries]
+            if target.unique().size(0) > 1:
+                logits = outputs["logits"][:, :, task_id][nonempty_entries]
+                roc[task_id](logits, target)
+                prc[task_id](logits, target)
+        return {"loss": outputs["loss"]}
+    def _step_nonempty(self, batch, batch_idx, roc, prc):
+        outputs = self(**batch)
+        logits, target = outputs["logits"], outputs["target"]
+        if target.unique().size(0) > 1:
+            roc(logits, target)
+            prc(logits, target)
+        return {"loss": outputs["loss"]}
+    def _epoch_end_empty(self, outputs_ignored, roc, prc, prefix):
+        mean_roc = sum(a.compute() for a in roc) / self.num_tasks
+        self.log(f"{prefix}_roc", mean_roc, on_step=False, on_epoch=True, prog_bar=True)
+        mean_prc = sum(p.compute() for p in prc) / self.num_tasks  #p.compute()[1]
+        self.log(f"{prefix}_prc", mean_prc, on_step=False, on_epoch=True, prog_bar=True)
+    def _epoch_end_nonempty(self, outputs, roc, prc, prefix):
+        self.log(f"{prefix}_roc", roc.compute(),
+                 on_step=False, on_epoch=True, prog_bar=True)
+        self.log(f"{prefix}_prc", prc.compute(), #prc.compute()[1]
+                 on_step=False, on_epoch=True, prog_bar=True)
+    def training_step(self, batch, batch_idx):
+        return self.step(batch, batch_idx, self.train_roc, self.train_prc)
+    def training_epoch_end(self, outputs):
+        self.epoch_end(outputs, self.train_roc, self.train_prc, "train")
+    def validation_step(self, batch, batch_idx):
+        return self.step(batch, batch_idx, self.val_roc, self.val_prc)
+    def validation_epoch_end(self, outputs):
+        self.epoch_end(outputs, self.val_roc, self.val_prc, "val")
+    def test_step(self, batch, batch_idx):
+        self.step(batch, batch_idx, self.test_roc, self.test_prc)
+    def test_epoch_end(self, outputs):
+        self.epoch_end(outputs, self.test_roc, self.test_prc, "test")
+    def configure_optimizers(self):
+        optimizer = AdamW(self.parameters(), lr=self.hparams.learning_rate,
+                          weight_decay=self.hparams.weight_decay)
+        lr_scheduler = torch.optim.lr_scheduler.ExponentialLR(
+            optimizer, self.hparams.scheduler_lambda)
+        return {"optimizer": optimizer,
+                "lr_scheduler": {"scheduler": lr_scheduler,
+                                 "interval": "step",
+                                 "frequency": self.hparams.scheduler_step}}
+class RegressorLitModel(pl.LightningModule):
+    def __init__(self, transformer: GPT2Model,
+                 batch_size: int, learning_rate: float, scheduler_lambda: float,
+                 scheduler_step: int, weight_decay: float, *args, **kwargs):
+        super().__init__()
+        self.save_hyperparameters(ignore="transformer")
+        self.transformer = transformer
+    def forward(self, *args, **kwargs):
+        return self.transformer(*args, **kwargs)
+        hidden_states = transformer_outputs[0]
+    def step(self, batch, batch_idx):
+        outputs = self(**batch)
+        rmse_loss = torch.sqrt(outputs["loss"])
+        return {"loss": rmse_loss}
+    def epoch_end(self, outputs, prefix):
+        mean_rmse = torch.mean(torch.tensor([out["loss"] for out in outputs]))
+        self.log(f"{prefix}_rmse", mean_rmse, on_step=False, on_epoch=True, prog_bar=True)
+    def training_step(self, batch, batch_idx):
+        return self.step(batch, batch_idx)
+    def training_epoch_end(self, outputs):
+        self.epoch_end(outputs, "train")
+    def validation_step(self, batch, batch_idx):
+        return self.step(batch, batch_idx)
+    def validation_epoch_end(self, outputs):
+        self.epoch_end(outputs, "val")
+    def test_step(self, batch, batch_idx):
+        return self.step(batch, batch_idx)
+    def test_epoch_end(self, outputs):
+        self.epoch_end(outputs, "test")
+    def configure_optimizers(self):
+        optimizer = AdamW(self.parameters(), lr=self.hparams.learning_rate,
+                          weight_decay=self.hparams.weight_decay)
+        lr_scheduler = torch.optim.lr_scheduler.ExponentialLR(
+            optimizer, self.hparams.scheduler_lambda)
+        return {"optimizer": optimizer,
+                "lr_scheduler": {"scheduler": lr_scheduler,
+                                 "interval": "step",
+                                 "frequency": self.hparams.scheduler_step}}

iupac-gpt/iupac_gpt/data.py ADDED Viewed

	@@ -0,0 +1,269 @@

+"""Loads torch-compatible data sets and lightning-compatible data modules.
+"""
+__all__ = ("CSVDataset", "CSVDataModule", "CVSplitter", "LMDataset", "LMDataModule")
+from collections import defaultdict
+from dataclasses import dataclass
+from functools import partial
+from pathlib import Path
+from typing import Any, Callable, Dict, List, Literal, Optional, Sequence, Tuple, Union
+import torch
+from pytorch_lightning import LightningDataModule
+from sklearn.model_selection import ShuffleSplit
+from tokenizers.implementations import BaseTokenizer
+from transformers import PreTrainedTokenizerFast
+from transformers import DataCollatorForLanguageModeling, DataCollatorWithPadding
+from torch.utils.data import Dataset, DataLoader
+@dataclass(init=True, repr=True, eq=False, frozen=False)
+class CSVDataset(Dataset):
+    """Stores `pandas.DataFrame` instance of tabular data and retrieves encoded token
+    ids and attention mask. Optionally returns labels and their masks.
+    Args:
+        dataframe (`pandas.DataFrame`):
+            Data frame of SMILES strings and their (multi-task) labels.
+        tokenizer (`tokenizers.BaseTokenizer` or `SMILESBPETokenizer`)
+            SMILES tokenizer.
+        smiles_column (`str`, defaults to "smiles"):
+            Column name of SMILES strings in `dataframe`.
+        target_column (`str` or `list` of `str`, defaults to `None`):
+            Target column(s). If `None`, labels are ignored.
+        has_empty_target (`bool`, defaults to `False`):
+            Whether entries have empty target values. If `True`, additionally retrieves
+            a target mask.
+        task_type ("classification" or "regression", defaults to "classification")
+        encode_kwargs (dict, defaults to {"truncation": True})
+            Positional arguments for `tokenizer` encoding, e.g. {"padding": True}.
+    """
+    dataframe: "pandas.DataFrame"
+    tokenizer: BaseTokenizer
+    smiles_column: str = 'smiles'
+    target_column: Union[None, str, List[str]] = None
+    has_empty_target: bool = False
+    task_type: Literal["classification", "regression"] = "classification"
+    encode_kwargs: Optional[Dict[str, Any]] = None
+    def __post_init__(self) -> None:
+        if isinstance(self.tokenizer, PreTrainedTokenizerFast):
+            self._encode = partial(self.tokenizer.__call__, add_special_tokens=False)
+            self._id_key = "input_ids"
+        else:
+            self._encode = self.tokenizer.encode
+            self._id_key = "ids"
+        self.encode_kwargs = self.encode_kwargs or {"truncation": True}
+        self._encode = partial(self._encode, **self.encode_kwargs)
+    def __getitem__(self, index: int) -> Dict[str, torch.Tensor]:
+        """Returns dict of encoded token IDs, attention mask, and optionally labels
+        and label mask.
+        """
+        item: Dict[str, torch.Tensor] = {}
+        smiles = self.dataframe.iloc[index][self.smiles_column]
+        encodings = self._encode(smiles)
+        item["input_ids"] = torch.LongTensor(getattr(encodings, self._id_key))
+        item["attention_mask"] = torch.LongTensor(getattr(encodings, "attention_mask"))
+        if self.target_column is not None:
+            labels = self.dataframe.iloc[index][self.target_column]
+            if self.has_empty_target:
+                label_mask = ~labels.isna()
+                labels = labels.fillna(-1)
+                item["label_mask"] = torch.BoolTensor(label_mask)
+            if self.task_type == "regression":
+                tensor_type = torch.FloatTensor
+            elif self.task_type == "classification":
+                tensor_type = torch.LongTensor
+            else:
+                raise NotImplementedError("`CSVDataset` supports only classification and "
+                                          "regression tasks")
+            item["labels"] = tensor_type(labels)
+        return item
+    def __len__(self) -> int:
+        return self.dataframe.shape[0]
+@dataclass(init=True, eq=True, repr=True, frozen=False)
+class CVSplitter:
+    """Splits series of SMILES data with either random or scaffold splitting.
+    """
+    mode: str = "random"
+    train_size: float = 0.8
+    val_size: float = 0.1
+    test_size: float = 0.1
+    def __post_init__(self) -> None:
+        if self.mode == "scaffold":
+            self.train_val_test_split = self.scaffold_split
+        elif self.mode == "random":
+            self.train_val_test_split = self.random_split
+    @staticmethod
+    def get_sorted_scaffolds(smiles_seqs: Sequence[str]):
+        from rdkit.Chem import MolFromSmiles
+        from rdkit.Chem.Scaffolds.MurckoScaffold import MurckoScaffoldSmiles
+        scaffolds: Dict[str, List[int]] = defaultdict(list)
+        molecules = (MolFromSmiles(s, sanitize=True) for s in smiles_seqs)
+        for i, molecule in enumerate(molecules):
+            try:
+                scaffold = MurckoScaffoldSmiles(mol=molecule, includeChirality=False)
+                scaffolds[scaffold].append(i)
+            except Exception:  # Really don't know what exception is raised...
+                pass
+        scaffolds = {scaffold: sorted(ids) for scaffold, ids in scaffolds.items()}
+        scaffold_sets = [scaffold_set
+                         for scaffold, scaffold_set in
+                         sorted(scaffolds.items(), key=lambda x: (len(x[1]), x[1][0]),
+                                reverse=True)]
+        return scaffold_sets
+    def scaffold_split(self, smiles_seqs: Sequence[str]) \
+            -> Tuple[List[int], List[int], List[int]]:
+        scaffold_sets = self.get_sorted_scaffolds(smiles_seqs)
+        n_samples = len(smiles_seqs)
+        train_idx, val_idx, test_idx = [], [], []
+        train_cutoff = int(self.train_size * n_samples)
+        val_cutoff = int((self.train_size + self.val_size) * n_samples)
+        for group_indices in scaffold_sets:
+            n_group = len(group_indices)
+            n_train = len(train_idx)
+            if n_train + n_group > train_cutoff:
+                n_val = len(val_idx)
+                if n_train + n_val + n_group > val_cutoff:
+                    test_idx.extend(group_indices)
+                else:
+                    val_idx.extend(group_indices)
+            else:
+                train_idx.extend(group_indices)
+        return train_idx, val_idx, test_idx
+    def random_split(self, smiles_seqs: "pandas.Series") \
+            -> Tuple["numpy.array", "numpy.array", "numpy.array"]:
+        cv = ShuffleSplit(train_size=self.train_size + self.val_size)
+        train_idx, val_idx = next(cv.split(smiles_seqs))
+        cv.train_size = 1 - self.test_size / (self.train_size + self.val_size)
+        train_idx, test_idx = next(cv.split(smiles_seqs.iloc[train_idx]))
+        return train_idx, val_idx, test_idx
+@dataclass(init=True, repr=True, eq=False, frozen=False)
+class CSVDataModule(LightningDataModule):
+    """Lightning data module for tabular data. Accepts pandas `dataframe`, splits the
+    data into train/valid/test with `splitter`, creates `CSVDataset`s and Pytorch
+    `DataLoader`s with `DataCollatorWithPadding` collate function.
+    """
+    dataframe: "pandas.DataFrame"
+    tokenizer: BaseTokenizer
+    smiles_column: str = "smiles"
+    target_column: Union[None, str, List[str]] = None
+    has_empty_target: bool = False
+    task_type: Literal["classification", "regression"] = "classification"
+    splitter: CVSplitter = CVSplitter()
+    batch_size: int = 16
+    num_workers: int = 0
+    def __post_init__(self) -> None:
+        super().__init__()
+        self.train_dataset: Optional[CSVDataset] = None
+        self.val_dataset: Optional[CSVDataset] = None
+        self.test_dataset: Optional[CSVDataset] = None
+        self.collate_fn: Callable = DataCollatorWithPadding(self.tokenizer)
+    def setup(self, stage: Optional[str] = None) -> None:
+        train_idx, val_idx, test_idx = self.splitter.train_val_test_split(
+            self.dataframe[self.smiles_column])
+        train_dataframe = self.dataframe.iloc[train_idx].reset_index(drop=True)
+        self.train_dataset = CSVDataset(train_dataframe, self.tokenizer,
+                                        self.smiles_column, self.target_column,
+                                        self.has_empty_target, self.task_type)
+        valid_dataframe = self.dataframe.iloc[val_idx].reset_index(drop=True)
+        self.val_dataset = CSVDataset(valid_dataframe, self.tokenizer,
+                                      self.smiles_column, self.target_column,
+                                      self.has_empty_target, self.task_type)
+        test_dataframe = self.dataframe.iloc[test_idx].reset_index(drop=True)
+        self.test_dataset = CSVDataset(test_dataframe, self.tokenizer,
+                                       self.smiles_column, self.target_column,
+                                       self.has_empty_target, self.task_type)
+    def train_dataloader(self) -> Union[DataLoader, List[DataLoader],
+                                        Dict[str, DataLoader]]:
+        return DataLoader(self.train_dataset, batch_size=self.batch_size, shuffle=True,
+                          collate_fn=self.collate_fn, num_workers=self.num_workers)
+    def val_dataloader(self) -> Union[DataLoader, List[DataLoader],
+                                      Dict[str, DataLoader]]:
+        return DataLoader(self.val_dataset, batch_size=self.batch_size, shuffle=False,
+                          collate_fn=self.collate_fn, num_workers=self.num_workers)
+    def test_dataloader(self) -> Union[DataLoader, List[DataLoader],
+                                       Dict[str, DataLoader]]:
+        return DataLoader(self.test_dataset, batch_size=self.batch_size, shuffle=False,
+                          collate_fn=self.collate_fn, num_workers=self.num_workers)
+@dataclass(init=True, eq=False, repr=True, frozen=False)
+class LMDataset(Dataset):
+    """Simple sequential dataset for autoregressive language modeling.
+    """
+    filename: str
+    tokenizer: BaseTokenizer
+    def __post_init__(self) -> None:
+        self.smiles_strings = Path(self.filename).read_text(encoding='ascii').splitlines()
+        if isinstance(self.tokenizer, PreTrainedTokenizerFast):
+            self._encode = partial(self.tokenizer.__call__, truncation=True)
+            self._id_key = "input_ids"
+        else:
+            self._encode = self.tokenizer.encode
+            self._id_key = "ids"
+    def __len__(self) -> int:
+        return len(self.smiles_strings)
+    def __getitem__(self, i: int) -> torch.Tensor:
+        encodings = self._encode(self.smiles_strings[i])
+        return torch.LongTensor(getattr(encodings, self._id_key))
+@dataclass(init=True, repr=True, eq=False, frozen=False)
+class LMDataModule(LightningDataModule):
+    """Lightning data module for autoregressive language modeling.
+    """
+    filename: str
+    tokenizer: BaseTokenizer
+    batch_size: int = 128
+    num_workers: int = 0
+    collate_fn: Union[None, Literal["default"], Callable] = "default"
+    def __post_init__(self) -> None:
+        super().__init__()
+        if self.collate_fn == "default":
+            self.collate_fn = DataCollatorForLanguageModeling(self.tokenizer, mlm=False)
+    def setup(self, stage: Optional[str] = None) -> None:
+        self.dataset = LMDataset(self.filename, self.tokenizer)
+    def train_dataloader(self) -> Union[DataLoader, List[DataLoader],
+                                        Dict[str, DataLoader]]:
+        return DataLoader(self.dataset, batch_size=self.batch_size, shuffle=True,
+                          collate_fn=self.collate_fn, num_workers=self.num_workers)

iupac-gpt/iupac_gpt/iupac_dataset.py ADDED Viewed

	@@ -0,0 +1,121 @@

+import os
+import sys
+import time
+import random
+from itertools import chain
+from collections import Counter
+import numpy as np
+import torch
+from torch.nn.utils.rnn import pad_sequence
+from transformers.data.data_collator import DataCollator
+from multiprocessing import Pool
+import mmap
+from torch.utils.data import Dataset
+class IUPACDataset(Dataset):
+    def __init__(self, dataset_dir='./',dataset_filename="iupacs_logp.txt", tokenizer=None,max_length=None,target_col=None,
+                 dataset_size=None,iupac_name_col="iupac"):
+        self.dataset_dir = dataset_dir
+        self.tokenizer = tokenizer
+        self.target_col = target_col
+        self.max_length = max_length
+        self.dataset_size = dataset_size
+        self.dataset_filename = dataset_filename
+        # where the data is
+        self.dataset_fn = os.path.join(self.dataset_dir,self.dataset_filename)
+        # a bit of an odd way to read in a data file, but it lets
+        # us keep the data in csv format, and it's pretty fast
+        # (30s for 17G on my machine).
+        # we need to use mmap for data-parallel training with
+        # multiple processes so that the processes don't each keep
+        # a local copy of the dataset in host memory
+        line_offsets = []
+        # each element of data_mm is a character in the dataset file
+        self.data_mm = np.memmap(self.dataset_fn, dtype=np.uint8, mode="r")
+        # process chunksize bytes at a time
+        chunksize = int(1e9)
+        for i in range(0, len(self.data_mm), chunksize):
+            chunk = self.data_mm[i:i + chunksize]
+            # the index of each newline is the character before
+            # the beginning of the next line
+            newlines = np.nonzero(chunk == 0x0a)[0]
+            line_offsets.append(i + newlines + 1)
+            if self.dataset_size is not None and i > self.dataset_size:
+                # don't need to keep loading data
+                break
+        # line_offsets indicates the beginning of each line in self.dataset_fn
+        self.line_offsets = np.hstack(line_offsets)
+        if (self.dataset_size is not None
+                and self.dataset_size > self.line_offsets.shape[0]):
+            msg = "specified dataset_size {}, but the dataset only has {} items"
+            raise ValueError(msg.format(self.dataset_size,
+                                        self.line_offsets.shape[0]))
+        # extract headers
+        header_line = bytes(self.data_mm[0:self.line_offsets[0]])
+        headers = header_line.decode("utf8").strip().split("|")
+        # figure out which column IDs are of interest
+        try:
+            self.name_col_id = headers.index(iupac_name_col)
+        except ValueError as e:
+            raise RuntimeError("Expecting a column called '{}' "
+                               "that contains IUPAC names".format(iupac_name_col))
+        self.target_col_id = None
+        if self.target_col is not None:
+            try:
+                self.target_col_id = headers.index(self.target_col)
+            except ValueError as e:
+                raise RuntimeError("User supplied target col " + target_col + \
+                                   "but column is not present in data file")
+    def __getitem__(self, idx):
+        # model_inputs is a dict with keys
+        # input_ids, target
+        if self.dataset_size is not None and idx > self.dataset_size:
+            msg = "provided index {} is larger than dataset size {}"
+            raise IndexError(msg.format(idx, self.dataset_size))
+        start = self.line_offsets[idx]
+        end = self.line_offsets[idx + 1]
+        line = bytes(self.data_mm[start:end])
+        line = line.decode("utf8").strip().split("|")
+        name = line[self.name_col_id]
+        # get the target value, if needed
+        target = None
+        if self.target_col_id is not None:
+            target = line[self.target_col_id]
+            if self.target_col == "Log P" and len(target) == 0:
+                target = 3.16 # average of training data
+            else:
+                target = float(target)
+        tokenized = self.tokenizer(name) #after this the tokenizer.eos_token_id have been added automaticly
+        input_ids = torch.tensor(tokenized["input_ids"])
+        iupac_unk = torch.tensor([self.tokenizer._convert_token_to_id(self.tokenizer.unk_token)])
+        input_ids = torch.tensor(input_ids)
+        input_ids = torch.cat([iupac_unk,input_ids])
+        return_dict = {}
+        return_dict["input_ids"] = input_ids #np.array(tokenized["input_ids"])
+        return_dict["labels"]   = input_ids
+        #return_dict["property"] = torch.tensor(np.array(target))
+        if self.max_length is not None:
+            return_dict["input_ids"] = return_dict["input_ids"][:self.max_length]
+            return_dict["labels"] = return_dict["labels"][:self.max_length]
+        return return_dict
+    def __len__(self):
+        if self.dataset_size is None:
+            return len(self.line_offsets) - 1
+        else:
+            return self.dataset_size

iupac-gpt/iupac_gpt/iupac_dataset_class.py ADDED Viewed

	@@ -0,0 +1,128 @@

+import os
+import sys
+import time
+import random
+from itertools import chain
+from collections import Counter
+import numpy as np
+import torch
+from torch.nn.utils.rnn import pad_sequence
+from transformers.data.data_collator import DataCollator
+from multiprocessing import Pool
+import mmap
+from torch.utils.data import Dataset
+class IUPACDataset(Dataset):
+    def __init__(self, dataset_dir='./',dataset_filename="iupacs_logp.txt", tokenizer=None,max_length=None,target_col=None,
+                 dataset_size=None,iupac_name_col="iupac"):
+        self.dataset_dir = dataset_dir
+        self.tokenizer = tokenizer
+        self.target_col = target_col
+        self.max_length = max_length
+        self.dataset_size = dataset_size
+        self.dataset_filename = dataset_filename
+        # where the data is
+        self.dataset_fn = os.path.join(self.dataset_dir,self.dataset_filename)
+        # a bit of an odd way to read in a data file, but it lets
+        # us keep the data in csv format, and it's pretty fast
+        # (30s for 17G on my machine).
+        # we need to use mmap for data-parallel training with
+        # multiple processes so that the processes don't each keep
+        # a local copy of the dataset in host memory
+        line_offsets = []
+        # each element of data_mm is a character in the dataset file
+        self.data_mm = np.memmap(self.dataset_fn, dtype=np.uint8, mode="r")
+        # process chunksize bytes at a time
+        chunksize = int(1e9)
+        for i in range(0, len(self.data_mm), chunksize):
+            chunk = self.data_mm[i:i + chunksize]
+            # the index of each newline is the character before
+            # the beginning of the next line
+            newlines = np.nonzero(chunk == 0x0a)[0]
+            line_offsets.append(i + newlines + 1)
+            if self.dataset_size is not None and i > self.dataset_size:
+                # don't need to keep loading data
+                break
+        # line_offsets indicates the beginning of each line in self.dataset_fn
+        self.line_offsets = np.hstack(line_offsets)
+        if (self.dataset_size is not None
+                and self.dataset_size > self.line_offsets.shape[0]):
+            msg = "specified dataset_size {}, but the dataset only has {} items"
+            raise ValueError(msg.format(self.dataset_size,
+                                        self.line_offsets.shape[0]))
+        # extract headers
+        header_line = bytes(self.data_mm[0:self.line_offsets[0]])
+        headers = header_line.decode("utf8").strip().split("|")
+        # figure out which column IDs are of interest
+        try:
+            self.name_col_id = headers.index(iupac_name_col)
+        except ValueError as e:
+            raise RuntimeError("Expecting a column called '{}' "
+                               "that contains IUPAC names".format(iupac_name_col))
+        self.target_col_id = None
+        if self.target_col is not None:
+            try:
+                self.target_col_id = headers.index(self.target_col)
+            except ValueError as e:
+                raise RuntimeError("User supplied target col " + target_col + \
+                                   "but column is not present in data file")
+    def __getitem__(self, idx):
+        # model_inputs is a dict with keys
+        # input_ids, target
+        if self.dataset_size is not None and idx > self.dataset_size:
+            msg = "provided index {} is larger than dataset size {}"
+            raise IndexError(msg.format(idx, self.dataset_size))
+        start = self.line_offsets[idx]
+        end = self.line_offsets[idx + 1]
+        line = bytes(self.data_mm[start:end])
+        line = line.decode("utf8").strip().split("|")
+        name = line[self.name_col_id]
+        # get the target value, if needed
+        target = None
+        if self.target_col_id is not None:
+            target = line[self.target_col_id]
+            if self.target_col == "Log P" and len(target) == 0:
+                target = 3.16 # average of training data
+            else:
+                target = float(target)
+        if target>3.16:
+            target = 1
+        else:
+            target=0
+        tokenized = self.tokenizer(name) #after this the tokenizer.eos_token_id have been added automaticly
+        input_ids = torch.tensor(tokenized["input_ids"])
+        iupac_unk = torch.tensor([self.tokenizer._convert_token_to_id(self.tokenizer.unk_token)])
+        input_ids = torch.tensor(input_ids)
+        input_ids = torch.cat([iupac_unk,input_ids])
+        attention_mask = torch.ones(input_ids.numel(), dtype=int)
+        return_dict = {}
+        return_dict["input_ids"] = input_ids
+        return_dict["labels"]   =  torch.tensor(np.array(target))
+        return_dict["attention_mask"] = attention_mask
+        if self.max_length is not None:
+            return_dict["input_ids"] = return_dict["input_ids"][:self.max_length]
+            return_dict["attention_mask"] = return_dict["attention_mask"][:self.max_length]
+        return return_dict
+    def __len__(self):
+        if self.dataset_size is None:
+            return len(self.line_offsets) - 1
+        else:
+            return self.dataset_size

iupac-gpt/iupac_gpt/iupac_dataset_pro.py ADDED Viewed

	@@ -0,0 +1,124 @@

+import os
+import sys
+import time
+import random
+from itertools import chain
+from collections import Counter
+import numpy as np
+import torch
+from torch.nn.utils.rnn import pad_sequence
+from transformers.data.data_collator import DataCollator
+from multiprocessing import Pool
+import mmap
+from torch.utils.data import Dataset
+class IUPACDataset(Dataset):
+    def __init__(self, dataset_dir='./',dataset_filename="iupacs_logp.txt", tokenizer=None,max_length=None,target_col=None,
+                 dataset_size=None,iupac_name_col="iupac"):
+        self.dataset_dir = dataset_dir
+        self.tokenizer = tokenizer
+        self.target_col = target_col
+        self.max_length = max_length
+        self.dataset_size = dataset_size
+        self.dataset_filename = dataset_filename
+        # where the data is
+        self.dataset_fn = os.path.join(self.dataset_dir,self.dataset_filename)
+        # a bit of an odd way to read in a data file, but it lets
+        # us keep the data in csv format, and it's pretty fast
+        # (30s for 17G on my machine).
+        # we need to use mmap for data-parallel training with
+        # multiple processes so that the processes don't each keep
+        # a local copy of the dataset in host memory
+        line_offsets = []
+        # each element of data_mm is a character in the dataset file
+        self.data_mm = np.memmap(self.dataset_fn, dtype=np.uint8, mode="r")
+        # process chunksize bytes at a time
+        chunksize = int(1e9)
+        for i in range(0, len(self.data_mm), chunksize):
+            chunk = self.data_mm[i:i + chunksize]
+            # the index of each newline is the character before
+            # the beginning of the next line
+            newlines = np.nonzero(chunk == 0x0a)[0]
+            line_offsets.append(i + newlines + 1)
+            if self.dataset_size is not None and i > self.dataset_size:
+                # don't need to keep loading data
+                break
+        # line_offsets indicates the beginning of each line in self.dataset_fn
+        self.line_offsets = np.hstack(line_offsets)
+        if (self.dataset_size is not None
+                and self.dataset_size > self.line_offsets.shape[0]):
+            msg = "specified dataset_size {}, but the dataset only has {} items"
+            raise ValueError(msg.format(self.dataset_size,
+                                        self.line_offsets.shape[0]))
+        # extract headers
+        header_line = bytes(self.data_mm[0:self.line_offsets[0]])
+        headers = header_line.decode("utf8").strip().split("|")
+        # figure out which column IDs are of interest
+        try:
+            self.name_col_id = headers.index(iupac_name_col)
+        except ValueError as e:
+            raise RuntimeError("Expecting a column called '{}' "
+                               "that contains IUPAC names".format(iupac_name_col))
+        self.target_col_id = None
+        if self.target_col is not None:
+            try:
+                self.target_col_id = headers.index(self.target_col)
+            except ValueError as e:
+                raise RuntimeError("User supplied target col " + target_col + \
+                                   "but column is not present in data file")
+    def __getitem__(self, idx):
+        # model_inputs is a dict with keys
+        # input_ids, target
+        if self.dataset_size is not None and idx > self.dataset_size:
+            msg = "provided index {} is larger than dataset size {}"
+            raise IndexError(msg.format(idx, self.dataset_size))
+        start = self.line_offsets[idx]
+        end = self.line_offsets[idx + 1]
+        line = bytes(self.data_mm[start:end])
+        line = line.decode("utf8").strip().split("|")
+        name = line[self.name_col_id]
+        # get the target value, if needed
+        target = None
+        if self.target_col_id is not None:
+            target = line[self.target_col_id]
+            if self.target_col == "Log P" and len(target) == 0:
+                target = 3.16 # average of training data
+            else:
+                target = float(target)
+        tokenized = self.tokenizer(name) #after this the tokenizer.eos_token_id have been added automaticly
+        input_ids = torch.tensor(tokenized["input_ids"])
+        iupac_unk = torch.tensor([self.tokenizer._convert_token_to_id(self.tokenizer.unk_token)])
+        input_ids = torch.tensor(input_ids)
+        input_ids = torch.cat([iupac_unk,input_ids])
+        attention_mask = torch.ones(input_ids.numel(), dtype=int)
+        return_dict = {}
+        return_dict["input_ids"] = input_ids
+        return_dict["labels"]   =  torch.tensor(np.array(target))
+        return_dict["attention_mask"] = attention_mask
+        if self.max_length is not None:
+            return_dict["input_ids"] = return_dict["input_ids"][:self.max_length]
+            return_dict["attention_mask"] = return_dict["attention_mask"][:self.max_length]
+        return return_dict
+    def __len__(self):
+        if self.dataset_size is None:
+            return len(self.line_offsets) - 1
+        else:
+            return self.dataset_size

iupac-gpt/iupac_gpt/iupac_spm.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:bb18836fd01a60e6cf61ad64e7e6556ac1f676d3ca39a16f375d54e8a8fb4e60
+size 275487

iupac-gpt/iupac_gpt/iupac_spm.vocab ADDED Viewed

	@@ -0,0 +1,1391 @@

+<pad>	0
+</s>	0
+<unk>	0
+0	0
+1	0
+2	0
+3	0
+4	0
+5	0
+6	0
+7	0
+8	0
+9	0
+10	0
+11	0
+12	0
+13	0
+14	0
+15	0
+16	0
+17	0
+18	0
+19	0
+20	0
+21	0
+22	0
+23	0
+24	0
+25	0
+26	0
+27	0
+28	0
+29	0
+30	0
+31	0
+32	0
+33	0
+34	0
+35	0
+36	0
+37	0
+38	0
+39	0
+40	0
+41	0
+42	0
+43	0
+44	0
+45	0
+46	0
+47	0
+48	0
+49	0
+50	0
+51	0
+52	0
+53	0
+54	0
+55	0
+56	0
+57	0
+58	0
+59	0
+60	0
+61	0
+62	0
+63	0
+64	0
+65	0
+66	0
+67	0
+68	0
+69	0
+70	0
+71	0
+72	0
+73	0
+74	0
+75	0
+76	0
+77	0
+78	0
+79	0
+80	0
+81	0
+82	0
+83	0
+84	0
+85	0
+86	0
+87	0
+88	0
+89	0
+90	0
+91	0
+92	0
+93	0
+94	0
+95	0
+96	0
+97	0
+98	0
+99	0
+;	0
+.	0
+.0	0
+'	0
+R	0
+S	0
+H	0
+N	0
+E	0
+Z	0
+aR	0
+aS	0
+bR	0
+bS	0
+cR	0
+cS	0
+dR	0
+dS	0
+aH	0
+bH	0
+cH	0
+aE	0
+aZ	0
+a,	0
+a-	0
+b,	0
+b-	0
+c,	0
+c-	0
+d,	0
+d-	0
+a]	0
+b]	0
+c]	0
+d]	0
+e]	0
+f]	0
+g]	0
+h]	0
+i]	0
+j]	0
+k]	0
+l]	0
+m]	0
+<high>	0
+<med>	0
+<low>	0
+-	0
+yl	0
+,	0
+)	0
+(	0
+]	0
+[	0
+meth	0
+phenyl	0
+di	0
+an	0
+eth	0
+oxy	0
+prop	0
+e	0
+amino	0
+oxo	0
+fluoro	0
+cyclo	0
+o	0
+amide	0
+tri	0
+chloro	0
+but	0
+hydroxy	0
+a	0
+one	0
+pyridin	0
+hydro	0
+benzo	0
+acet	0
+l	0
+en	0
+ol	0
+amine	0
+ylamin	0
+oxa	0
+oyl	0
+carboxamide	0
+benz	0
+piperidin	0
+thia	0
+ate	0
+sulf	0
+bromo	0
+ylidene	0
+pyrimidin	0
+tetra	0
+ic_acid	0
+penta	0
+pyrrolidin	0
+sulfonyl	0
+hexa	0
+hex	0
+ane	0
+pyrazol	0
+phenoxy	0
+carbonyl	0
+thiophen	0
+aza	0
+piperazin	0
+azo	0
+carboxylate	0
+imidazol	0
+furan	0
+nitro	0
+carbam	0
+anilino	0
+pent	0
+d	0
+tert-	0
+benzen	0
+indol	0
+sulfon	0
+carboxylic_acid	0
+diazo	0
+az	0
+ene	0
+quinolin	0
+naphthalen	0
+morpholin	0
+ium	0
+cyano	0
+bi	0
+bis	0
+hepta	0
+pyrrol	0
+spiro	0
+r	0
+ole	0
+azin	0
+hydrochloride	0
+urea	0
+yn	0
+azido	0
+carbamate	0
+pyrrolo	0
+it	0
+imidazo	0
+pyrazin	0
+guanidin	0
+thio	0
+pyrazolo	0
+iodo	0
+imino	0
+sulfam	0
+carbon	0
+olidin	0
+epin	0
+isoquinolin	0
+deca	0
+anilin	0
+quinazolin	0
+nitrile	0
+hydrazin	0
+epan	0
+pyridazin	0
+chromen	0
+octa	0
+octan	0
+thieno	0
+in	0
+amido	0
+hept	0
+thiol	0
+hydroiodide	0
+imid	0
+isoindol	0
+nona	0
+pyrido	0
+inden	0
+carbazol	0
+ox	0
+dodeca	0
+etidin	0
+oct	0
+phenol	0
+imidazolidin	0
+sil	0
+carboxy	0
+imido	0
+phosphor	0
+purin	0
+phospha	0
+fluoren	0
+carbox	0
+indazol	0
+undeca	0
+furo	0
+tetradeca	0
+cyclopenta[a]phenanthren	0
+form	0
+quinoxalin	0
+trideca	0
+hexadeca	0
+imine	0
+sulfinyl	0
+octadeca	0
+carba	0
+dec	0
+adamant	0
+chloride	0
+sila	0
+icos	0
+ine	0
+ide	0
+naphthyridin	0
+heptadeca	0
+thione	0
+anthracen	0
+dodec	0
+oxir	0
+pyran	0
+hydrogen	0
+pentadeca	0
+oxido	0
+carbo	0
+henicos	0
+deuterio	0
+docos	0
+non	0
+id	0
+tert-butyl(dimethyl)silyl	0
+carbamic_acid	0
+pyrano	0
+nonadeca	0
+tris	0
+but-2-eno	0
+ic	0
+at	0
+phosphate	0
+hydrazide	0
+aceton	0
+octadec	0
+sulfo	0
+thiomorpholin	0
+pyrimido	0
+oxamide	0
+carbonimidoyl	0
+oxet	0
+inan	0
+sodium	0
+al	0
+(2+)	0
+oxide	0
+phthalazin	0
+benzal	0
+carbohydrazide	0
+bora	0
+benzhydr	0
+tetracos	0
+bor	0
+hexadec	0
+ioda	0
+azonia	0
+isocyano	0
+acridin	0
+hydroxylamin	0
+formamide	0
+phenanthren	0
+ul	0
+indeno	0
+xanthen	0
+nitroso	0
+tetradec	0
+phosphin	0
+olan	0
+peroxy	0
+phosphono	0
+tetr	0
+pyrazolidin	0
+dicarbon	0
+olate	0
+tricos	0
+hexacos	0
+indolo	0
+indolizin	0
+phosphon	0
+undec	0
+chromeno	0
+pentacos	0
+pyrazino	0
+thi	0
+hydrate	0
+bromide	0
+uid	0
+boronic_acid	0
+trityl	0
+cen	0
+sulfate	0
+isochromen	0
+octacos	0
+isocyanato	0
+acetal	0
+azide	0
+dimethylacetamide	0
+tetrakis	0
+iridin	0
+nonadec	0
+naphtho	0
+heptadec	0
+pyren	0
+heptacos	0
+carbamimidamido	0
+sulfinam	0
+oxid	0
+iodide	0
+etheno	0
+disulfon	0
+potassium	0
+chrysen	0
+yne	0
+phosphino	0
+carboximidoyl	0
+quinolizin	0
+tert-butyl(diphenyl)silyl	0
+formamid	0
+thiochromen	0
+porphyrin	0
+dicyan	0
+triacont	0
+pteridin	0
+(3+)	0
+sulfin	0
+ar	0
+pentadec	0
+io	0
+phenothiazin	0
+undecyl	0
+oxal	0
+phospho	0
+borin	0
+uide	0
+uranium	0
+picen	0
+hydrobromide	0
+cinnolin	0
+isoindolo	0
+phthal	0
+phenac	0
+phenanthridin	0
+azino	0
+tridec	0
+zirconium	0
+len	0
+phenanthrolin	0
+platinum	0
+phenolate	0
+sulfonato	0
+oxybenzon	0
+zinc	0
+chlora	0
+hydroperoxy	0
+yttrium	0
+pyrrolizin	0
+carbothioyl	0
+sel	0
+iron	0
+spirobi	0
+copper	0
+triphenylen	0
+titanium	0
+perox	0
+nonacos	0
+(1+)	0
+tridecyl	0
+lithium	0
+tetrol	0
+(4+)	0
+carboxylato	0
+thiopyran	0
+pentacont	0
+etan	0
+iridium	0
+thioxanthen	0
+nickel	0
+phenoxazin	0
+hexatriacont	0
+azulen	0
+tetracont	0
+tritriacont	0
+azon	0
+carbono	0
+sulfino	0
+dotriacont	0
+stann	0
+nitrate	0
+broma	0
+on	0
+et	0
+acetylen	0
+fluoride	0
+isothiocyanato	0
+magnesium	0
+cobalt	0
+acenaphthylen	0
+sulfamate	0
+ruthenium	0
+aldehyde	0
+phosphite	0
+nonafl	0
+palladium	0
+pentadecyl	0
+purino	0
+tetratriacont	0
+epoxy	0
+aluma	0
+phenanthro	0
+phenazin	0
+fluoranthen	0
+sulfinato	0
+ocin	0
+hentriacont	0
+azanida	0
+stanna	0
+toluen	0
+ylidyne	0
+thiopyrano	0
+perchlorate	0
+calcium	0
+mono	0
+tungsten	0
+sulfur	0
+cyanamide	0
+tricarbon	0
+chlorid	0
+dehydro	0
+pyridazino	0
+sulfido	0
+irin	0
+phosph	0
+iran	0
+thiocyanate	0
+hypoiodite	0
+ylium	0
+imidazolo	0
+octatriacont	0
+dimethylurea	0
+heptadecyl	0
+tritio	0
+hydrazono	0
+selena	0
+cyanide	0
+dotetracont	0
+isoquinolino	0
+diazonium	0
+pentatriacont	0
+hydroxide	0
+manganese	0
+chromium	0
+pentakis	0
+hypofluorite	0
+tin	0
+sulfono	0
+phosphoroso	0
+vanadium	0
+boranuida	0
+ecin	0
+hexakis	0
+s-indacen	0
+os	0
+fluoreno	0
+mercury	0
+sulfamic_acid	0
+thiochromeno	0
+phenalen	0
+rhodium	0
+amid	0
+sulfite	0
+ocan	0
+phosphonato	0
+heptatriacont	0
+nonatriacont	0
+borono	0
+silver	0
+gold	0
+isothiochromen	0
+nitron	0
+hafnium	0
+hexacont	0
+(2-)	0
+hypochlorite	0
+arsa	0
+diphosphat	0
+molybdenum	0
+thallium	0
+nonadecyl	0
+fluora	0
+nonatetracont	0
+rhenium	0
+tetracarbon	0
+perylen	0
+diphosphon	0
+cyanate	0
+oxygen	0
+germ	0
+nitramide	0
+tell	0
+aluminum	0
+azuleno	0
+quinolino	0
+iod	0
+actinium	0
+terephthal	0
+ecan	0
+trithion	0
+barium	0
+hentetracont	0
+dithion	0
+phosphat	0
+selenophen	0
+xylen	0
+germa	0
+hen	0
+perimidin	0
+nitric_acid	0
+rubidium	0
+octatetracont	0
+but-1-eno	0
+nitramido	0
+heptakis	0
+thiocyanat	0
+dibor	0
+nitrous	0
+hydrazon	0
+thianthren	0
+dili	0
+hydride	0
+oxonio	0
+tetratetracont	0
+isochromeno	0
+dihydropter	0
+indolizino	0
+osmium	0
+phosphonia	0
+oxanthren	0
+diazano	0
+do	0
+cyanato	0
+diacetamid	0
+oxam	0
+silicate	0
+cadmium	0
+hydrofluoride	0
+hexatetracont	0
+boron	0
+phosphindol	0
+phenoxathiin	0
+phosphonous_acid	0
+octakis	0
+bismuth	0
+chromenylium	0
+corrin	0
+pyrylium	0
+thion	0
+cinnam	0
+tritetracont	0
+nitrite	0
+gadolinium	0
+diazonio	0
+antimony	0
+oxalo	0
+onic_acid	0
+biphenylen	0
+sulfonio	0
+cesium	0
+oxonium	0
+stiba	0
+styren	0
+heptacont	0
+selenol	0
+chloroform	0
+diselen	0
+onin	0
+oxaldehyd	0
+cerium	0
+technetium	0
+(1-)	0
+lead	0
+ite	0
+acenaphthyleno	0
+dicarboximid	0
+oxonia	0
+strontium	0
+(5+)	0
+iodid	0
+lanthanum	0
+rutherfordium	0
+perchloric_acid	0
+iren	0
+tricosyl	0
+hypobromite	0
+europium	0
+isocyanate	0
+ido	0
+iodosyl	0
+nitrilium	0
+neodymium	0
+peroxide	0
+pentatetracont	0
+phenylen	0
+tantalum	0
+hect	0
+buta-1,3-dieno	0
+samarium	0
+galla	0
+methylal	0
+fluorid	0
+praseodymium	0
+ytterbium	0
+dimethoxyethane	0
+scandium	0
+seleno	0
+dimethoxyethan	0
+octacont	0
+cub	0
+gallium	0
+diphosphate	0
+pentacosyl	0
+thalla	0
+ous_acid	0
+selenoate	0
+arson	0
+niobium	0
+alumina	0
+anisol	0
+beryllium	0
+thioph	0
+heptatetracont	0
+onan	0
+tellura	0
+quinoxalino	0
+indiga	0
+heptacosyl	0
+isothiocyanate	0
+inin	0
+diphospho	0
+thionia	0
+selenido	0
+nonacosyl	0
+terbium	0
+(6+)	0
+indig	0
+dysprosium	0
+quinazolino	0
+iodyl	0
+indium	0
+hexatriacontyl	0
+thiopyr	0
+triphosphon	0
+thorium	0
+carbohydrazonoyl	0
+as-indacen	0
+fluoroform	0
+erbium	0
+phosphindolo	0
+lutetium	0
+selenopheno	0
+arsin	0
+arsor	0
+iodat	0
+silanuida	0
+plumba	0
+plumb	0
+borano	0
+sulfonium	0
+tellurophen	0
+indazolo	0
+nitroxyl	0
+nitrogen	0
+anthra	0
+isophosphindol	0
+disulfid	0
+nonacont	0
+selone	0
+iodonio	0
+onate	0
+trili	0
+iodine	0
+seleninyl	0
+phenoxaphosphinin	0
+phen	0
+thulium	0
+chloryl	0
+phosphinimyl	0
+cyanic_acid	0
+acridophosphin	0
+tetrali	0
+cumen	0
+holmium	0
+selenopyran	0
+dibenzamid	0
+nitrous_acid	0
+phthalal	0
+selenocyanate	0
+argon	0
+iodate	0
+isothiochromeno	0
+mercurio	0
+sulfide	0
+bromid	0
+iodonia	0
+disulfate	0
+fluorine	0
+aceanthrylen	0
+coronen	0
+phenoxid	0
+hydrazonic	0
+telluro	0
+silicon	0
+chloronio	0
+hypochlorous_acid	0
+dodecakis	0
+hydroseleno	0
+phosphinolin	0
+inda	0
+phenaleno	0
+phenylene	0
+arsenic	0
+chlorosyl	0
+perchloryl	0
+chlorate	0
+bism	0
+onat	0
+terephthalal	0
+7,8-dihydropter	0
+silano	0
+boranthren	0
+fermium	0
+phosphano	0
+arsoroso	0
+hydrido	0
+alum	0
+selenium	0
+pol	0
+nonakis	0
+stibo	0
+phospheno	0
+astatine	0
+phosphanida	0
+phenophosphazinin	0
+stibor	0
+sulfenat	0
+silanida	0
+pyranthren	0
+arsono	0
+decakis	0
+oxaldehyde	0
+cyanid	0
+neptunium	0
+diphosphor	0
+bromate	0
+selenate	0
+selenin	0
+selenonyl	0
+phenoselenazin	0
+hypoiodous_acid	0
+silanylia	0
+ditellur	0
+arso	0
+helicen	0
+americium	0
+pyreno	0
+selenoxanthen	0
+amoyl	0
+telluroate	0
+selen	0
+selenochromen	0
+diyl	0
+dithianon	0
+ose	0
+plutonium	0
+silicic_acid	0
+5,6,7,8-tetrahydropter	0
+xenon	0
+sulfamide	0
+bisma	0
+germanium	0
+triphosphate	0
+triphospho	0
+triselen	0
+isocyanide	0
+isophosphinolin	0
+tetrasulfide	0
+dict	0
+bromine	0
+curium	0
+acephenanthrylen	0
+promethium	0
+phosphanthridin	0
+gall	0
+selenocyanat	0
+stilben	0
+disulfide	0
+isochromenylium	0
+tetrathion	0
+thall	0
+selenat	0
+chlor	0
+silanthren	0
+(3-)	0
+tetradecakis	0
+xantheno	0
+chromio	0
+chlorite	0
+californium	0
+tetraphosphat	0
+chlorine	0
+iodoform	0
+telluropyran	0
+polona	0
+lawrencium	0
+naphthyridino	0
+selenon	0
+phenoxarsinin	0
+as-indaceno	0
+mercura	0
+periodate	0
+selenite	0
+hypofluorous_acid	0
+adip	0
+bromyl	0
+arsino	0
+tungstenio	0
+tellurochromen	0
+stibin	0
+trisulfide	0
+isoselenochromen	0
+zircona	0
+hexali	0
+tetraphosphate	0
+onamide	0
+chloronia	0
+thiochromenylium	0
+phosphorus	0
+titana	0
+dicyclohexylurea	0
+phenarsazinin	0
+(8+)	0
+nitroform	0
+molybdenio	0
+undecakis	0
+rubicen	0
+diselenid	0
+triphosphat	0
+diboron	0
+trisulfid	0
+hexadecakis	0
+pleiaden	0
+ter	0
+arsonous_acid	0
+ars	0
+permangan	0
+methoxychlor	0
+tellurinyl	0
+triacetamid	0
+isocyanatid	0
+(7+)	0
+phthalazino	0
+chloric_acid	0
+stibon	0
+tellone	0
+stib	0
+protactinium	0
+fluor	0
+arsonato	0
+einsteinium	0
+tellur	0
+molybda	0
+telluroxanthen	0
+water	0
+pentali	0
+vanadio	0
+formazan	0
+ovalen	0
+brom	0
+thioxantheno	0
+selenomorpholin	0
+arsonium	0
+nobelium	0
+cinnolino	0
+nitrid	0
+telluropyrano	0
+neo	0
+tellurate	0
+bromic_acid	0
+phosphinolino	0
+iodite	0
+arsindol	0
+phosphen	0
+tribenzamid	0
+tellurium	0
+oxyl	0
+icosakis	0
+tellurat	0
+krypton	0
+bromite	0
+tridecakis	0
+all	0
+isotellurochromen	0
+diarsor	0
+bromosyl	0
+helium	0
+disulfite	0
+deuteride	0
+carboselenoyl	0
+bromoform	0
+trinaphthylen	0
+octali	0
+furano	0
+selenino	0
+iodic_acid	0
+hydrotelluro	0
+boronia	0
+phosphinolizin	0
+prism	0
+periodic_acid	0
+orot	0
+pentadecakis	0
+polonium	0
+hexasulfide	0
+stibono	0
+selenanthren	0
+ozone	0
+phosphindolizin	0
+urana	0
+pyridino	0
+phenotellurazin	0
+meitnerium	0
+tetrasulfid	0
+selenonia	0
+hypobromous_acid	0
+selenopyrano	0
+chlorat	0
+trifluoromethanesulfonimid	0
+seaborgium	0
+azor	0
+azonous_acid	0
+selenoph	0
+periodyl	0
+perbromate	0
+oson	0
+berkelium	0
+tungsta	0
+ribo	0
+pentaphosphate	0
+hafna	0
+telluropheno	0
+tellurite	0
+nitronium	0
+mon	0
+astata	0
+isothiocyanatid	0
+dubnium	0
+isothiochromenylium	0
+tellurin	0
+sodio	0
+selenono	0
+selenochromeno	0
+nitrosyl	0
+mendelevium	0
+ous	0
+neon	0
+fluoronio	0
+azid	0
+then	0
+stannanylia	0
+potassio	0
+phosphanthren	0
+disilic	0
+chlorazin	0
+titanio	0
+bromat	0
+triacontakis	0
+pentasulfide	0
+nonadecakis	0
+rhenio	0
+platina	0
+phenoxatellurin	0
+pentazocine	0
+ferrio	0
+cos	0
+vanada	0
+triselenid	0
+telluronyl	0
+tellurocyanate	0
+pentazocin	0
+fulven	0
+distibor	0
+diphosphite	0
+radon	0
+pentathion	0
+nitrous_oxide	0
+ferra	0
+ditelluron	0
+bis(trifluoromethylsulfonyl)imid	0
+acridino	0
+telluron	0
+isophosphinolino	0
+diselenon	0
+diarson	0
+stibanuida	0
+germano	0
+xanthylium	0
+tert-butyl(dimethyl)silanyl	0
+radium	0
+osma	0
+chlorous_acid	0
+bromonio	0
+arsonia	0
+arsinolin	0
+amate	0
+urazol	0
+triphosphor	0
+nonali	0
+deutero	0
+nioba	0
+acridarsin	0
+yttrio	0
+tert-butyl-dimethylsilyl	0
+pyrimidino	0
+pteridino	0
+phenoxaselenin	0
+isocyanid	0
+irida	0
+heptadecakis	0
+bohrium	0
+pentacosakis	0
+octadecakis	0
+thianthreno	0
+telluroph	0
+t-	0
+isophosphindolo	0
+isoarsindol	0
+henicosakis	0
+(4-)	0
+ruthena	0
+heptali	0
+arsen	0
+telluranthren	0
+chryseno	0
+carbotelluroyl	0
+quinolizino	0
+nonacosakis	0
+francium	0
+ethion	0
+chroma	0
+arsanthridin	0
+arsanthren	0
+tricosakis	0
+tetraphosphor	0
+tetracosakis	0
+tellurocyanat	0
+stibonia	0
+stibonato	0
+phosphanuida	0
+phenoxathiino	0
+manganio	0
+eicosa	0
+cobaltio	0
+cera	0
+amic_acid	0
+stibino	0
+stannanuida	0
+samario	0
+s-indaceno	0
+praseodymio	0
+phenoxastibinin	0
+pallada	0
+neodymio	0
+isoselenocyanate	0
+germanuida	0
+diazoamino	0
+telluronia	0
+tantalio	0
+phenoxyl	0
+phenothiarsinin	0
+oxanthreno	0
+octacosakis	0
+mangana	0
+lanthanio	0
+isoarsinolin	0
+indan	0
+hexacosakis	0
+hassium	0
+arsinolizin	0
+alli	0
+thioxanth	0
+tert-butyl(diphenyl)silanyl	0
+stronta	0
+stannano	0
+rhodio	0
+rhoda	0
+praseodyma	0
+phenazino	0
+pentaphosphat	0
+nitric	0
+methoxyl	0
+magnesio	0
+dichrom	0
+chlorazine	0
+californa	0
+butoxyl	0
+bromous_acid	0
+azonic_acid	0
+arsinolino	0
+arsindolo	0
+arsindolizin	0
+allo	0
+actina	0
+uronic_acid	0
+thora	0
+telluromorpholin	0
+stibonium	0
+stibano	0
+rhena	0
+phosphinolizino	0
+phenothiazino	0
+perbromyl	0
+niobio	0
+nickelio	0
+isotellurochromeno	0
+isoselenocyanato	0
+iodous_acid	0
+iodous	0
+hydroselenonyl	0
+dysprosio	0
+cyclopenta[a]phenanthr	0
+cerio	0
+bara	0
+aurio	0
+arsanuida	0
+ytterbio	0
+uronate	0
+tol	0
+thulio	0
+tert-butyl-diphenylsilyl	0
+tellurono	0
+stannanida	0
+scandio	0
+propoxyl	0
+periodic	0
+perbromic_acid	0
+nitror	0
+lutetio	0
+isothiocyanic_acid	0
+iridio	0
+iodic	0
+hypobor	0
+hydroxyl	0
+hydroseleninyl	0
+holmio	0
+hexasulfid	0
+heptacosakis	0
+gadolinio	0
+europio	0
+ethoxyl	0
+erbio	0
+docosakis	0
+chlorous	0
+chloric	0
+arsinimyl	0
+argentio	0
+▁	-0.24368
+c	-3.77761
+m	-3.81933
+t	-4.1484
+p	-4.28552
+n	-4.34236
+u	-4.43826
+s	-4.52053
+i	-4.6648
+is	-4.8052
+g	-4.85455
+x	-5.02503
+y	-5.19016
+h	-5.25276
+b	-5.25733
+v	-5.50657
+th	-5.56431
+f	-5.60089
+ph	-5.65809
+hy	-5.71657
+▁p	-6.08895
+cy	-6.12699
+yc	-6.28409
+im	-6.3188
+ti	-6.4861
+ch	-6.53742
+ut	-6.55604
+cys	-6.59438
+st	-6.61931
+▁h	-6.69232
+pi	-6.72852
+uc	-6.85542
+us	-6.89267
+▁b	-6.96641
+▁g	-6.99289
+▁c	-7.03458
+ys	-7.04986
+ct	-7.06609
+▁hy	-7.10659
+gu	-7.12486
+sp	-7.1249
+xy	-7.2108
+▁s	-7.3108
+yp	-7.394
+um	-7.39798
+xim	-7.47115
+thy	-7.52489
+ps	-7.53214
+fu	-7.86517
+▁cy	-7.98841
+mph	-7.99202
+▁n	-8.03554
+ni	-8.04807
+▁m	-8.12601
+nth	-8.18462
+cu	-8.19705
+phth	-8.20839
+ip	-8.32472
+▁f	-8.36171
+ty	-8.47003
+▁cu	-8.49492
+ym	-8.59996
+ff	-8.60659
+uf	-8.65435
+fi	-8.70783
+pt	-8.74056
+tun	-8.78867
+yt	-8.80236
+▁ch	-8.81859
+▁ps	-9.02055
+▁sty	-9.02375
+▁phyt	-9.0593
+ub	-9.1473
+mb	-9.15357
+▁fu	-9.19661
+if	-9.24761
+ci	-9.28944
+▁sym	-9.29607
+ss	-9.31017
+up	-9.34393
+sty	-9.34753
+▁t	-9.40241
+pp	-9.46886
+mi	-9.51896
+gn	-9.58869
+ms	-9.85318
+▁pi	-9.85785
+ist	-9.89882
+tig	-9.95137
+▁thy	-10.0245
+vii	-10.0685
+hi	-10.077
+sym	-10.0864
+▁sub	-10.1129
+ptu	-10.1771
+cti	-10.2664
+ig	-10.5468
+tu	-10.5569
+▁fuc	-10.6338
+▁sy	-10.726
+▁th	-10.8515
+uv	-10.9123
+si	-10.9398
+▁cys	-11.0937
+bu	-11.3456
+mu	-11.3477
+vi	-11.4565
+mp	-11.4617
+ib	-11.5026
+pu	-11.5547
+▁i	-11.5794
+▁bu	-11.6761
+▁gu	-11.6864
+▁mu	-11.7005
+▁st	-11.7307
+un	-11.844
+uct	-11.8441
+▁u	-12.046

iupac-gpt/iupac_gpt/iupac_tokenization.py ADDED Viewed

	@@ -0,0 +1,131 @@

+from transformers import (
+    AdamW,
+    DataCollatorWithPadding,
+    HfArgumentParser,
+    T5Config,
+    T5ForConditionalGeneration,
+    T5Tokenizer,
+    Trainer,
+    TrainingArguments,
+)
+from torch.utils.data import DataLoader
+import os
+import tempfile
+import re
+import pandas as pd
+import numpy as np
+from typing import Dict, Optional
+from dataclasses import dataclass, field
+import logging
+import torch
+from torch.nn.utils.rnn import pad_sequence
+from torch.optim.lr_scheduler import LambdaLR
+import os.path as pt
+import torch.optim as optim
+import torch.nn as nn
+from tqdm import tqdm
+from torch.autograd import Variable
+from .iupac_dataset import IUPACDataset
+import os
+#os.environ["CUDA_VISIBLE_DEVICES"]="0"
+class T5Collator:
+    def __init__(self, pad_token_id):
+        super().__init__()
+        self.pad_token_id = pad_token_id
+    def __call__(self, records):
+        # records is a list of dicts
+        batch = {}
+        padvals = {"input_ids": self.pad_token_id,'labels':-100}
+        for k in records[0]:
+            if k in padvals:
+                batch[k] = pad_sequence([torch.tensor(r[k]) for r in records],
+                                        batch_first=True,
+                                        padding_value=padvals[k])
+            else:
+                batch[k] = torch.FloatTensor([r[k] for r in records]) #torch.Tensor
+        return batch
+class T5IUPACTokenizer(T5Tokenizer):
+    def prepare_for_tokenization(self, text, is_split_into_words=False,
+                                 **kwargs):
+        return re.sub(" ", "_", text), kwargs
+    def _decode(self, *args, **kwargs):
+        # replace "_" with " ", except for the _ in extra_id_#
+        text = super()._decode(*args, **kwargs)
+        text = re.sub("extra_id_", "extraAidA", text)
+        text = re.sub("_", " ", text)
+        text = re.sub("extraAidA", "extra_id_", text)
+        return text
+    def sentinels(self, sentinel_ids):
+        return self.vocab_size - sentinel_ids - 1
+    def sentinel_mask(self, ids):
+        return ((self.vocab_size - self._extra_ids <= ids) &
+                (ids < self.vocab_size))
+    def _tokenize(self, text, sample=False):
+        #pieces = super()._tokenize(text, sample=sample)
+        pieces = super()._tokenize(text)
+        # sentencepiece adds a non-printing token at the start. Remove it
+        return pieces[1:]
+def prepare_input(data,device):
+    from collections.abc import Mapping
+    if isinstance(data, Mapping):
+        return type(data)({k: prepare_input(v,device) for k, v in data.items()})
+    elif isinstance(data, (tuple, list)):
+        return type(data)(prepare_input(v,device) for v in data)
+    elif isinstance(data, torch.Tensor):
+        kwargs = dict(device=device)
+        if data.dtype != torch.int64:
+            # NLP models inputs are int64 and those get adjusted to the right dtype of the
+            # embedding. Other models such as wav2vec2's inputs are already float and thus
+            # may need special handling to match the dtypes of the model
+            kwargs.update(dict(dtype=torch.int64))
+        return data.to(**kwargs)
+    return data
+def get_data_loader(is_train=1):
+    full_path = '/home/jmwang/drugai/iupac-gpt/iupac_gpt/'
+    iupac_tokenizer = T5IUPACTokenizer(vocab_file=full_path+'iupac_spm.model')
+    iupac_vocab_size = iupac_tokenizer.vocab_size
+    print('iupac_vocab_size:',iupac_vocab_size)
+    if is_train:
+        torch.save(iupac_tokenizer, pt.join(full_path,"real_iupac_tokenizer.pt"))
+        print("training...",len(iupac_tokenizer))
+    else:
+        iupac_tokenizer = torch.load(pt.join(full_path,"real_iupac_tokenizer.pt"), map_location="cpu")
+        print('fina_tune...',len(iupac_tokenizer))
+    dataset_filename = 'data/pubchem_iupac_smile_gpt.csv'
+    target_col = "aLogP"
+    iupac_name_col = 'PUBCHEM_IUPAC_NAME' #canon_smiles
+    MAXLEN=1024
+    dataset_kwargs = {"dataset_dir":'/home/jmwang/drugai/iupac-gpt',"dataset_filename": dataset_filename,"tokenizer": iupac_tokenizer,"max_length": MAXLEN,"target_col": target_col,'dataset_size':None,"iupac_name_col":iupac_name_col}
+    train_dataset = IUPACDataset(**dataset_kwargs)
+    collator = T5Collator(iupac_tokenizer.pad_token_id)
+    train_dataloader = DataLoader(train_dataset,batch_size=64,collate_fn=collator,shuffle=True)
+    return train_dataloader,iupac_tokenizer
+if __name__ == "__main__":
+    train_dataloader,iupac_tokenizer = get_data_loader(is_train=1)
+    pbar = tqdm(train_dataloader)
+    device = 'cpu'
+    for inputs in pbar:
+        src_label = Variable(inputs["labels"].to(device))
+        inputs = prepare_input(inputs,device)
+        src = Variable(inputs["input_ids"].to(device))
+        #self.tokenizer._convert_token_to_id
+        print(src[:,:].shape,src_label)

iupac-gpt/iupac_gpt/iupac_tokenization_class.py ADDED Viewed

	@@ -0,0 +1,131 @@

+from transformers import (
+    AdamW,
+    DataCollatorWithPadding,
+    HfArgumentParser,
+    T5Config,
+    T5ForConditionalGeneration,
+    T5Tokenizer,
+    Trainer,
+    TrainingArguments,
+)
+from torch.utils.data import DataLoader
+import os
+import tempfile
+import re
+import pandas as pd
+import numpy as np
+from typing import Dict, Optional
+from dataclasses import dataclass, field
+import logging
+import torch
+from torch.nn.utils.rnn import pad_sequence
+from torch.optim.lr_scheduler import LambdaLR
+import os.path as pt
+import torch.optim as optim
+import torch.nn as nn
+from tqdm import tqdm
+from torch.autograd import Variable
+from .iupac_dataset_class import IUPACDataset
+import os
+#os.environ["CUDA_VISIBLE_DEVICES"]="0"
+class T5Collator:
+    def __init__(self, pad_token_id):
+        super().__init__()
+        self.pad_token_id = pad_token_id
+    def __call__(self, records):
+        # records is a list of dicts
+        batch = {}
+        padvals = {"input_ids": self.pad_token_id,'attention_mask':0}
+        for k in records[0]:
+            if k in padvals:
+                batch[k] = pad_sequence([torch.tensor(r[k]) for r in records],
+                                        batch_first=True,
+                                        padding_value=padvals[k])
+            else:
+                batch[k] = torch.LongTensor([r[k] for r in records]) #torch.Tensor LongTensor FloatTensor
+        return batch
+class T5IUPACTokenizer(T5Tokenizer):
+    def prepare_for_tokenization(self, text, is_split_into_words=False,
+                                 **kwargs):
+        return re.sub(" ", "_", text), kwargs
+    def _decode(self, *args, **kwargs):
+        # replace "_" with " ", except for the _ in extra_id_#
+        text = super()._decode(*args, **kwargs)
+        text = re.sub("extra_id_", "extraAidA", text)
+        text = re.sub("_", " ", text)
+        text = re.sub("extraAidA", "extra_id_", text)
+        return text
+    def sentinels(self, sentinel_ids):
+        return self.vocab_size - sentinel_ids - 1
+    def sentinel_mask(self, ids):
+        return ((self.vocab_size - self._extra_ids <= ids) &
+                (ids < self.vocab_size))
+    def _tokenize(self, text, sample=False):
+        #pieces = super()._tokenize(text, sample=sample)
+        pieces = super()._tokenize(text)
+        # sentencepiece adds a non-printing token at the start. Remove it
+        return pieces[1:]
+def prepare_input_class(data,device):
+    from collections.abc import Mapping
+    if isinstance(data, Mapping):
+        return type(data)({k: prepare_input_class(v,device) for k, v in data.items()})
+    elif isinstance(data, (tuple, list)):
+        return type(data)(prepare_input_class(v,device) for v in data)
+    elif isinstance(data, torch.Tensor):
+        kwargs = dict(device=device)
+        if data.dtype != torch.int64:
+            # NLP models inputs are int64 and those get adjusted to the right dtype of the
+            # embedding. Other models such as wav2vec2's inputs are already float and thus
+            # may need special handling to match the dtypes of the model
+            kwargs.update(dict(dtype=torch.int64))
+        return data.to(**kwargs)
+    return data
+def get_data_loader_class(is_train=1):
+    full_path = '/root/autodl-tmp/wjm/iupac-gpt/iupac_gpt/'
+    iupac_tokenizer = T5IUPACTokenizer(vocab_file=full_path+'iupac_spm.model')
+    iupac_vocab_size = iupac_tokenizer.vocab_size
+    print('iupac_vocab_size:',iupac_vocab_size)
+    if is_train:
+        torch.save(iupac_tokenizer, pt.join(full_path,"real_iupac_tokenizer.pt"))
+        print("training...",len(iupac_tokenizer))
+    else:
+        iupac_tokenizer = torch.load(pt.join(full_path,"real_iupac_tokenizer.pt"), map_location="cpu")
+        print('fina_tune...',len(iupac_tokenizer))
+    dataset_filename = 'iupacs_logp.csv' #'./pubchem_iupac_smile_gpt.csv'
+    target_col = "LogP" #"aLogP"
+    iupac_name_col = 'iupac'   #'PUBCHEM_IUPAC_NAME'
+    MAXLEN=1024
+    dataset_kwargs = {"dataset_dir":full_path,"dataset_filename": dataset_filename,"tokenizer": iupac_tokenizer,"max_length": MAXLEN,"target_col": target_col,'dataset_size':None,"iupac_name_col":iupac_name_col}
+    train_dataset = IUPACDataset(**dataset_kwargs)
+    collator = T5Collator(iupac_tokenizer.pad_token_id)
+    train_dataloader = DataLoader(train_dataset,batch_size=64,collate_fn=collator,shuffle=True)
+    return train_dataloader,iupac_tokenizer
+if __name__ == "__main__":
+    train_dataloader,iupac_tokenizer = get_data_loader_class(is_train=1)
+    pbar = tqdm(train_dataloader)
+    device = 'cpu'
+    for inputs in pbar:
+        src_label = Variable(inputs["labels"].to(device))
+        inputs = prepare_input_class(inputs,device)
+        src = Variable(inputs["input_ids"].to(device))
+        #self.tokenizer._convert_token_to_id
+        print(src[:,:].shape,src_label)

iupac-gpt/iupac_gpt/iupac_tokenization_iupac.py ADDED Viewed

	@@ -0,0 +1,134 @@

+from transformers import (
+    AdamW,
+    DataCollatorWithPadding,
+    HfArgumentParser,
+    T5Config,
+    T5ForConditionalGeneration,
+    T5Tokenizer,
+    Trainer,
+    TrainingArguments,
+)
+from torch.utils.data import DataLoader
+import os
+import tempfile
+import re
+import pandas as pd
+import numpy as np
+from typing import Dict, Optional
+from dataclasses import dataclass, field
+import logging
+import torch
+from torch.nn.utils.rnn import pad_sequence
+from torch.optim.lr_scheduler import LambdaLR
+import os.path as pt
+import torch.optim as optim
+import torch.nn as nn
+from tqdm import tqdm
+from torch.autograd import Variable
+from .iupac_dataset import IUPACDataset
+import os
+#os.environ["CUDA_VISIBLE_DEVICES"]="0"
+class T5Collator:
+    def __init__(self, pad_token_id):
+        super().__init__()
+        self.pad_token_id = pad_token_id
+    def __call__(self, records):
+        # records is a list of dicts
+        batch = {}
+        padvals = {"input_ids": self.pad_token_id,'attention_mask':0,'labels':-100}
+        for k in records[0]:
+            if k in padvals:
+                batch[k] = pad_sequence([torch.tensor(r[k]) for r in records],
+                                        batch_first=True,
+                                        padding_value=padvals[k])
+            else:
+                batch[k] = torch.FloatTensor([r[k] for r in records]) #torch.Tensor
+        return batch
+class T5IUPACTokenizer(T5Tokenizer):
+    def prepare_for_tokenization(self, text, is_split_into_words=False,
+                                 **kwargs):
+        return re.sub(" ", "_", text), kwargs
+    def _decode(self, *args, **kwargs):
+        # replace "_" with " ", except for the _ in extra_id_#
+        text = super()._decode(*args, **kwargs)
+        text = re.sub("extra_id_", "extraAidA", text)
+        text = re.sub("_", " ", text)
+        text = re.sub("extraAidA", "extra_id_", text)
+        return text
+    def sentinels(self, sentinel_ids):
+        return self.vocab_size - sentinel_ids - 1
+    def sentinel_mask(self, ids):
+        return ((self.vocab_size - self._extra_ids <= ids) &
+                (ids < self.vocab_size))
+    def _tokenize(self, text, sample=False):
+        #pieces = super()._tokenize(text, sample=sample)
+        pieces = super()._tokenize(text)
+        # sentencepiece adds a non-printing token at the start. Remove it
+        return pieces[1:]
+def prepare_input(data,device):
+    from collections.abc import Mapping
+    if isinstance(data, Mapping):
+        return type(data)({k: prepare_input(v,device) for k, v in data.items()})
+    elif isinstance(data, (tuple, list)):
+        return type(data)(prepare_input(v,device) for v in data)
+    elif isinstance(data, torch.Tensor):
+        kwargs = dict(device=device)
+        if data.dtype != torch.int64:
+            # NLP models inputs are int64 and those get adjusted to the right dtype of the
+            # embedding. Other models such as wav2vec2's inputs are already float and thus
+            # may need special handling to match the dtypes of the model
+            kwargs.update(dict(dtype=torch.int64))
+        return data.to(**kwargs)
+    return data
+def get_data_loader(is_train=1,dataset_filename = './pubchem_iupac_smile_gpt.csv'):
+    full_path = '/home/jmwang/drugai/iupac-gpt/iupac_gpt/'
+    iupac_tokenizer = T5IUPACTokenizer(vocab_file=full_path+'iupac_spm.model')
+    iupac_vocab_size = iupac_tokenizer.vocab_size
+    print('iupac_vocab_size:',iupac_vocab_size)
+    if is_train:
+        torch.save(iupac_tokenizer, pt.join(full_path,"real_iupac_tokenizer.pt"))
+        print("training...",len(iupac_tokenizer))
+    else:
+        iupac_tokenizer = torch.load(pt.join(full_path,"real_iupac_tokenizer.pt"), map_location="cpu")
+        print('fina_tune...',len(iupac_tokenizer))
+    target_col = "aLogP"
+    iupac_name_col = 'PUBCHEM_IUPAC_NAME'
+    MAXLEN=1024
+    dataset_kwargs = {"dataset_dir":full_path,"dataset_filename": dataset_filename,"tokenizer": iupac_tokenizer,"max_length": MAXLEN,"target_col": target_col,'dataset_size':None,"iupac_name_col":iupac_name_col}
+    train_dataset = IUPACDataset(**dataset_kwargs)
+    #for i in train_dataset:
+    #    train_dataset[i]=train_dataset[i].to(device)
+    collator = T5Collator(iupac_tokenizer.pad_token_id)
+    train_dataloader = DataLoader(train_dataset,batch_size=64,collate_fn=collator,shuffle=True)
+    return train_dataloader,iupac_tokenizer
+if __name__ == "__main__":
+    train_dataloader,iupac_tokenizer = get_data_loader(is_train=1)
+    pbar = tqdm(train_dataloader)
+    device = 'cpu'
+    for inputs in pbar:
+        src_label = Variable(inputs["labels"].to(device))
+        inputs = prepare_input(inputs,device)
+        src = Variable(inputs["input_ids"].to(device))
+        #self.tokenizer._convert_token_to_id
+        print(src[:,:].shape,src_label)

iupac-gpt/iupac_gpt/iupac_tokenization_pro.py ADDED Viewed

	@@ -0,0 +1,131 @@

+from transformers import (
+    AdamW,
+    DataCollatorWithPadding,
+    HfArgumentParser,
+    T5Config,
+    T5ForConditionalGeneration,
+    T5Tokenizer,
+    Trainer,
+    TrainingArguments,
+)
+from torch.utils.data import DataLoader
+import os
+import tempfile
+import re
+import pandas as pd
+import numpy as np
+from typing import Dict, Optional
+from dataclasses import dataclass, field
+import logging
+import torch
+from torch.nn.utils.rnn import pad_sequence
+from torch.optim.lr_scheduler import LambdaLR
+import os.path as pt
+import torch.optim as optim
+import torch.nn as nn
+from tqdm import tqdm
+from torch.autograd import Variable
+from .iupac_dataset_pro import IUPACDataset
+import os
+#os.environ["CUDA_VISIBLE_DEVICES"]="0"
+class T5Collator:
+    def __init__(self, pad_token_id):
+        super().__init__()
+        self.pad_token_id = pad_token_id
+    def __call__(self, records):
+        # records is a list of dicts
+        batch = {}
+        padvals = {"input_ids": self.pad_token_id,'attention_mask':0}
+        for k in records[0]:
+            if k in padvals:
+                batch[k] = pad_sequence([torch.tensor(r[k]) for r in records],
+                                        batch_first=True,
+                                        padding_value=padvals[k])
+            else:
+                batch[k] = torch.FloatTensor([r[k] for r in records]) #torch.Tensor LongTensor FloatTensor
+        return batch
+class T5IUPACTokenizer(T5Tokenizer):
+    def prepare_for_tokenization(self, text, is_split_into_words=False,
+                                 **kwargs):
+        return re.sub(" ", "_", text), kwargs
+    def _decode(self, *args, **kwargs):
+        # replace "_" with " ", except for the _ in extra_id_#
+        text = super()._decode(*args, **kwargs)
+        text = re.sub("extra_id_", "extraAidA", text)
+        text = re.sub("_", " ", text)
+        text = re.sub("extraAidA", "extra_id_", text)
+        return text
+    def sentinels(self, sentinel_ids):
+        return self.vocab_size - sentinel_ids - 1
+    def sentinel_mask(self, ids):
+        return ((self.vocab_size - self._extra_ids <= ids) &
+                (ids < self.vocab_size))
+    def _tokenize(self, text, sample=False):
+        #pieces = super()._tokenize(text, sample=sample)
+        pieces = super()._tokenize(text)
+        # sentencepiece adds a non-printing token at the start. Remove it
+        return pieces[1:]
+def prepare_input_pro(data,device):
+    from collections.abc import Mapping
+    if isinstance(data, Mapping):
+        return type(data)({k: prepare_input_pro(v,device) for k, v in data.items()})
+    elif isinstance(data, (tuple, list)):
+        return type(data)(prepare_input_pro(v,device) for v in data)
+    elif isinstance(data, torch.Tensor):
+        kwargs = dict(device=device)
+        if data.dtype != torch.int64:
+            # NLP models inputs are int64 and those get adjusted to the right dtype of the
+            # embedding. Other models such as wav2vec2's inputs are already float and thus
+            # may need special handling to match the dtypes of the model
+            kwargs.update(dict(dtype=torch.int64))
+        return data.to(**kwargs)
+    return data
+def get_data_loader_pro(is_train=1):
+    full_path = '/root/autodl-tmp/wjm/iupac-gpt/iupac_gpt/'
+    iupac_tokenizer = T5IUPACTokenizer(vocab_file=full_path+'iupac_spm.model')
+    iupac_vocab_size = iupac_tokenizer.vocab_size
+    print('iupac_vocab_size:',iupac_vocab_size)
+    if is_train:
+        torch.save(iupac_tokenizer, pt.join(full_path,"real_iupac_tokenizer.pt"))
+        print("training...",len(iupac_tokenizer))
+    else:
+        iupac_tokenizer = torch.load(pt.join(full_path,"real_iupac_tokenizer.pt"), map_location="cpu")
+        print('fina_tune...',len(iupac_tokenizer))
+    dataset_filename = 'iupacs_logp.csv'
+    target_col = "LogP"
+    iupac_name_col = 'iupac'
+    MAXLEN=1024
+    dataset_kwargs = {"dataset_dir":full_path,"dataset_filename": dataset_filename,"tokenizer": iupac_tokenizer,"max_length": MAXLEN,"target_col": target_col,'dataset_size':None,"iupac_name_col":iupac_name_col}
+    train_dataset = IUPACDataset(**dataset_kwargs)
+    collator = T5Collator(iupac_tokenizer.pad_token_id)
+    train_dataloader = DataLoader(train_dataset,batch_size=64,collate_fn=collator,shuffle=True)
+    return train_dataloader,iupac_tokenizer
+if __name__ == "__main__":
+    train_dataloader,iupac_tokenizer = get_data_loader_class(is_train=1)
+    pbar = tqdm(train_dataloader)
+    device = 'cpu'
+    for inputs in pbar:
+        src_label = Variable(inputs["labels"].to(device))
+        inputs = prepare_input_class(inputs,device)
+        src = Variable(inputs["input_ids"].to(device))
+        #self.tokenizer._convert_token_to_id
+        print(src[:,:].shape,src_label)

iupac-gpt/iupac_gpt/iupacs_logp.csv ADDED Viewed

The diff for this file is too large to render. See raw diff

iupac-gpt/iupac_gpt/language_modeling.py ADDED Viewed

	@@ -0,0 +1,68 @@

+"""Pytorch-lightning module for causal language modeling.
+"""
+__all__ = ("GPT2LitModel",)
+import pytorch_lightning as pl
+import torch
+class GPT2LitModel(pl.LightningModule):
+    """Lightning module for autoregressive (causal) transformer language modeling.
+    Successfully tested on HuggingFace `GPT2LMHeadModel`.
+    """
+    def __init__(self, transformer, batch_size: int, learning_rate: float,
+                 final_learning_rate: float, weight_decay: float, adam_eps: float,
+                 adam_betas: tuple, scheduler_T_max: int,
+                 save_model_every: int = 10_000, checkpoint: str = ""):
+        super().__init__()
+        self.save_hyperparameters(ignore=("transformer", "save_model_every",
+                                          "checkpoints"))
+        self.transformer = transformer
+        self.save_model_every = save_model_every
+        self.checkpoint = checkpoint or "./gpt2litmodel-logs"
+    def forward(self, *args, **kwargs):
+        return self.transformer(*args, **kwargs)
+    def training_step(self, batch, batch_idx):
+        outputs = self(**batch)
+        if self.save_model_every > 0 and batch_idx % self.save_model_every == 0:
+            self.transformer.save_pretrained(self.checkpoint)
+        return {'loss': outputs['loss']}
+    def training_epoch_end(self, outputs):
+        if self.save_model_every > 0:
+            self.transformer.save_pretrained(self.checkpoint)
+        losses = [step_output["loss"] for step_output in outputs]
+        mean_loss = torch.tensor(losses).mean()
+        ppl = torch.exp(mean_loss)
+        self.log("ppl", ppl, on_step=False, on_epoch=True, prog_bar=True)
+    def configure_optimizers(self):
+        parameters = self.named_parameters()
+        no_decay = ["bias", "LayerNorm.weight"]
+        grouped_parameters = [
+            {"params": [p for n, p in parameters
+                        if not any(nd in n for nd in no_decay)],
+             "weight_decay": self.hparams.weight_decay},
+            {"params": [p for n, p in parameters
+                        if any(nd in n for nd in no_decay)],
+             "weight_decay": 0.0}]
+        optimizer = torch.optim.Adam(
+            grouped_parameters, lr=self.hparams.learning_rate,
+            weight_decay=self.hparams.weight_decay,
+            eps=self.hparams.adam_eps, betas=self.hparams.adam_betas)
+        lr_scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
+            optimizer, self.hparams.scheduler_T_max,
+            eta_min=self.hparams.final_learning_rate)
+        return {'optimizer': optimizer,
+                'lr_scheduler': {'scheduler': lr_scheduler,
+                                 'interval': 'step', 'frequency': 1}}

iupac-gpt/iupac_gpt/pubchem_iupac_smile_gpt.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b052dd26a26107e9c86a2b155a693669fb1f4fbf498762abe2d19fbaa6867567
+size 2825708735

iupac-gpt/iupac_gpt/real_iupac_tokenizer.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1696e3f3060bcce33275387e4eb4e175f4c64a015962ac4f3c5f49f25ed6f335
+size 3529

iupac-gpt/iupac_gpt/tokenization.py ADDED Viewed

	@@ -0,0 +1,193 @@

+"""SMILES-based tokenization utilities.
+"""
+__all__ = ("PAD_TOKEN", "BOS_TOKEN", "EOS_TOKEN", "UNK_TOKEN", "SUFFIX",
+           "SPECIAL_TOKENS", "PAD_TOKEN_ID", "BOS_TOKEN_ID", "EOS_TOKEN_ID",
+           "UNK_TOKEN_ID", "SMILESBPETokenizer", "SMILESAlphabet")
+from collections.abc import Collection, Iterator
+from dataclasses import dataclass
+from itertools import chain
+from typing import Any, Dict, FrozenSet, List, Optional, Set, Tuple, Union
+from tokenizers import AddedToken, Tokenizer
+from tokenizers import decoders, models, normalizers, processors, trainers
+from tokenizers.implementations import BaseTokenizer
+from transformers import PreTrainedTokenizerFast
+SUFFIX, PAD_TOKEN, BOS_TOKEN, EOS_TOKEN, UNK_TOKEN = "", "<pad>", "<s>", "</s>", "<unk>"
+SPECIAL_TOKENS = [PAD_TOKEN, BOS_TOKEN, EOS_TOKEN, UNK_TOKEN]
+PAD_TOKEN_ID, BOS_TOKEN_ID, EOS_TOKEN_ID, UNK_TOKEN_ID = range(4)
+class SMILESBPETokenizer(BaseTokenizer):
+    """Tokenizes SMILES strings and applies BPE.
+    Args:
+        vocab (`str` or `dict`, optional, defaults to `None`):
+            Token vocabulary.
+        merges (`str` or `dict` or `tuple`, optional, defaults to `None`):
+            BPE merges.
+        unk_token (`str` or `tokenizers.AddedToken`, optional, defaults to "<unk>")
+        suffix (`str`, defaults to "")
+        dropout (`float`, defaults to `None`)
+    Examples:
+        >>> tokenizer = SMILESBPETokenizer()
+        >>> tokenizer.train("path-to-smiles-strings-file")
+        Tokenization logs...
+        >>> tokenizer.save_model("checkpoints-path")
+        >>> same_tokenizer = SMILESBPETokenizer.from_file("checkpoints-path/vocab.json",
+        ...                                               "checkpoints-path/merges.txt")
+    """
+    def __init__(
+        self,
+        vocab: Optional[Union[str, Dict[str, int]]] = None,
+        merges: Optional[Union[str, Dict[Tuple[int, int], Tuple[int, int]]]] = None,
+        unk_token: Union[str, AddedToken] = "<unk>",
+        suffix: str = SUFFIX,
+        dropout: Optional[float] = None,
+    ) -> None:
+        unk_token_str = str(unk_token)
+        tokenizer = Tokenizer(models.BPE(vocab, merges, dropout=dropout,
+                                         unk_token=unk_token_str,
+                                         end_of_word_suffix=suffix))
+        if tokenizer.token_to_id(unk_token_str) is not None:
+            tokenizer.add_special_tokens([unk_token_str])
+        tokenizer.normalizer = normalizers.Strip(left=False, right=True)
+        tokenizer.decoder = decoders.Metaspace(add_prefix_space=True)
+        tokenizer.post_processor = processors.TemplateProcessing(
+            single=f"{BOS_TOKEN} $A {EOS_TOKEN}",
+            special_tokens=[(BOS_TOKEN, BOS_TOKEN_ID), (EOS_TOKEN, EOS_TOKEN_ID)])
+        parameters = {"model": "BPE", "unk_token": unk_token, "suffix": suffix,
+                      "dropout": dropout}
+        super().__init__(tokenizer, parameters)
+    @classmethod
+    def from_file(cls, vocab_filename: str, merges_filename: str, **kwargs) \
+            -> "SMILESBPETokenizer":
+        vocab, merges = models.BPE.read_file(vocab_filename, merges_filename)
+        return cls(vocab, merges, **kwargs)
+    def train(
+        self,
+        files: Union[str, List[str]],
+        vocab_size: int = 1_000,
+        min_frequency: int = 2,
+        special_tokens: List[Union[str, AddedToken]] = None,
+        limit_alphabet: int = 200,
+        initial_alphabet: List[str] = None,
+        suffix: Optional[str] = SUFFIX,
+        show_progress: bool = True,
+    ) -> None:
+        special_tokens = special_tokens or SPECIAL_TOKENS
+        initial_alphabet = initial_alphabet or []
+        trainer = trainers.BpeTrainer(vocab_size=vocab_size,
+                                      min_frequency=min_frequency,
+                                      special_tokens=special_tokens,
+                                      limit_alphabet=limit_alphabet,
+                                      initial_alphabet=initial_alphabet,
+                                      end_of_word_suffix=suffix,
+                                      show_progress=show_progress)
+        if isinstance(files, str):
+            files = [files]
+        self._tokenizer.train(files, trainer=trainer)
+    def train_from_iterator(
+        self,
+        iterator: Iterator,
+        vocab_size: int = 1_000,
+        min_frequency: int = 2,
+        special_tokens: List[Union[str, AddedToken]] = None,
+        limit_alphabet: int = 200,
+        initial_alphabet: List[str] = None,
+        suffix: Optional[str] = SUFFIX,
+        show_progress: bool = True,
+    ) -> None:
+        special_tokens = special_tokens or SPECIAL_TOKENS
+        initial_alphabet = initial_alphabet or []
+        trainer = trainers.BpeTrainer(vocab_size=vocab_size,
+                                      min_frequency=min_frequency,
+                                      special_tokens=special_tokens,
+                                      limit_alphabet=limit_alphabet,
+                                      initial_alphabet=initial_alphabet,
+                                      end_of_word_suffix=suffix,
+                                      show_progress=show_progress)
+        self._tokenizer.train_from_iterator(iterator, trainer=trainer)
+    @staticmethod
+    def get_hf_tokenizer(
+        tokenizer_file: str,
+        special_tokens: Optional[Dict[str, str]] = None,
+        model_max_length: int = 512,
+        *init_inputs, **kwargs
+    ) -> PreTrainedTokenizerFast:
+        """Gets HuggingFace tokenizer from the pretrained `tokenizer_file`. Optionally,
+        appends `special_tokens` to vocabulary and sets `model_max_length`.
+        """
+        tokenizer = PreTrainedTokenizerFast(tokenizer_file=tokenizer_file,
+                                            *init_inputs, **kwargs)
+        special_tokens = special_tokens or dict(zip(
+            ["pad_token", "bos_token", "eos_token", "unk_token"],
+            SPECIAL_TOKENS))
+        tokenizer.add_special_tokens(special_tokens)
+        tokenizer.model_max_length = model_max_length
+        return tokenizer
+@dataclass(init=True, eq=False, repr=True, frozen=True)
+class SMILESAlphabet(Collection):
+    atoms: FrozenSet[str] = frozenset([
+        'Ac', 'Ag', 'Al', 'Am', 'Ar', 'As', 'At', 'Au', 'B', 'Ba', 'Be', 'Bh',
+        'Bi', 'Bk', 'Br', 'C', 'Ca', 'Cd', 'Ce', 'Cf', 'Cl', 'Cm', 'Co', 'Cr',
+        'Cs', 'Cu', 'Db', 'Dy', 'Er', 'Es', 'Eu', 'F', 'Fe', 'Fm', 'Fr', 'Ga',
+        'Gd', 'Ge', 'H', 'He', 'Hf', 'Hg', 'Ho', 'Hs', 'I', 'In', 'Ir', 'K',
+        'Kr', 'La', 'Li', 'Lr', 'Lu', 'Md', 'Mg', 'Mn', 'Mo', 'Mt', 'N', 'Na',
+        'Nb', 'Nd', 'Ne', 'Ni', 'No', 'Np', 'O', 'Os', 'P', 'Pa', 'Pb', 'Pd',
+        'Pm', 'Po', 'Pr', 'Pt', 'Pu', 'Ra', 'Rb', 'Re', 'Rf', 'Rh', 'Rn',
+        'Ru', 'S', 'Sb', 'Sc', 'Se', 'Sg', 'Si', 'Sm', 'Sn', 'Sr', 'Ta', 'Tb',
+        'Tc', 'Te', 'Th', 'Ti', 'Tl', 'Tm', 'U', 'V', 'W', 'Xe', 'Y', 'Yb',
+        'Zn', 'Zr'
+    ])
+    # Bonds, charges, etc.
+    non_atoms: FrozenSet[str] = frozenset([
+        '-', '=', '#', ':', '(', ')', '.', '[', ']', '+', '-', '\\', '/', '*',
+        '1', '2', '3', '4', '5', '6', '7', '8', '9', '0',
+        '@', 'AL', 'TH', 'SP', 'TB', 'OH',
+    ])
+    additional: FrozenSet[str] = frozenset()
+    def __contains__(self, item: Any) -> bool:
+        return item in self.atoms or item in self.non_atoms
+    def __iter__(self):
+        return (token for token in chain(self.atoms, self.non_atoms))
+    def __len__(self) -> int:
+        return len(self.atoms) + len(self.non_atoms) + len(self.additional)
+    def get_alphabet(self) -> Set[str]:
+        alphabet = set()
+        for token in self.atoms:
+            if len(token) > 1:
+                alphabet.update(list(token))
+                alphabet.add(token[0].lower())
+            else:
+                alphabet.add(token)
+                alphabet.add(token.lower())
+        for token in chain(self.non_atoms, self.additional):
+            if len(token) > 1:
+                alphabet.update(list(token))
+            else:
+                alphabet.add(token)
+        return alphabet

iupac-gpt/nohup.out ADDED Viewed

The diff for this file is too large to render. See raw diff

iupac-gpt/notebooks/.ipynb_checkpoints/language-modeling-checkpoint.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

iupac-gpt/notebooks/iupac_head_view.html ADDED Viewed

The diff for this file is too large to render. See raw diff

iupac-gpt/notebooks/iupac_language-modeling.py ADDED Viewed

	@@ -0,0 +1,236 @@

+#!/usr/bin/env python
+# coding: utf-8
+# # Generative Pre-Training from Molecules
+import os
+#os.environ["CUDA_VISIBLE_DEVICES"] = ['1',"2"]
+from pprint import pprint
+import sys
+sys.path.append('/root/autodl-tmp/wjm/iupac-gpt')
+from tqdm import tqdm
+try:
+    import iupac_gpt as gpt
+except ImportError:
+    import sys
+    sys.path.extend([".."])  # Parent directory stores `smiles_gpt` package.
+    import iupac_gpt as gpt
+import torch
+# For demonstration purposes, we use only 10K subset of PubChem data made available by
+# [ChemBERTa](https://arxiv.org/abs/2010.09885) developers. The original model was pretrained
+# on the first 5M compounds with the following hyperparameters:
+# ```python
+# hyperparams = {"batch_size": 128, "max_epochs": 2, "max_length": 512,
+#                "learning_rate": 5e-4, "weight_decay": 0.0,
+#                "adam_eps": 1e-8, "adam_betas": (0.9, 0.999),
+#                "scheduler_T_max": 150_000, "final_learning_rate": 5e-8,
+#                "vocab_size": 1_000, "min_frequency": 2, "top_p": 0.96,
+#                "n_layer": 4, "n_head": 8, "n_embd": 512}
+# ```
+# Tokenizer, model, optimizer, scheduler, and trainer hyperparameters.
+hyperparams = {"batch_size": 64, "max_epochs": 10, "max_length": 1280,
+               "learning_rate": 5e-4, "weight_decay": 0.0,
+               "adam_eps": 1e-8, "adam_betas": (0.9, 0.999),
+               "scheduler_T_max": 1_000, "final_learning_rate": 5e-8,
+               "vocab_size": 1491, "min_frequency": 2, "top_p": 0.96,
+               "n_layer": 8, "n_head": 8, "n_embd": 256}
+gpus = [0]  # Specify either a list of GPU devices or an integer (0 for no GPU).
+num_workers = 24  # Number of dataloader worker processes.
+# ## Tokenization
+#
+# `smiles_gpt.SMILESBPETokenizer` first splits SMILES strings into characters, runs
+# byte-pair encoding, and augments the resulting list with `"<s>"` (beginning-of-SMILES) and
+# `"</s>"` (end-of-SMILES) special tokens. `smiles_gpt.SMILESAlphabet` stores 72 possible
+# characters as an initial vocabulary.
+device = 'gpu'
+train_dataloader,iupac_tokenizer = gpt.get_data_loader(is_train=1,dataset_filename = './pubchem_iupac_smile_gpt.csv')
+pbar = tqdm(train_dataloader)  #train_dataloader.cuda()
+'''
+for inputs in pbar:
+    src_label = Variable(inputs["labels"].to(device))
+    inputs = prepare_input(inputs,device)
+    src = Variable(inputs["input_ids"].to(device))
+    #self.tokenizer._convert_token_to_id
+    print(src[:,:].shape,src_label)
+'''
+tokenizer = iupac_tokenizer
+#start mark <unk> 2, end mark </s> 1,  pad   <pad> 0
+iupac_string = "2-amino-9-[4-hydroxy-3-(hydroxymethyl)-2-methylidenecyclopentyl]-1H-purin-6-one"
+iupac_encoded = tokenizer(iupac_string)
+iupac_encoded['input_ids'] = [2]+iupac_encoded['input_ids']
+iupac_merges = [tokenizer.decode(i) for i in iupac_encoded['input_ids']]
+#iupac_encoded['attention_mask']
+print(iupac_encoded['input_ids'])
+print(iupac_merges)
+print(tokenizer.unk_token_id,tokenizer.eos_token_id,tokenizer.unk_token,tokenizer.eos_token,tokenizer.vocab_size) #2 1 1491
+# ## Data Module
+batch = next(iter(pbar))
+# ## GPT-2 Model
+#
+# Now we load HuggingFace
+# [`GPT2LMHeadModel`](https://huggingface.co/transformers/model_doc/gpt2.html#gpt2lmheadmodel)
+# with the configuration composed of previously
+# defined model hyperparameters. The model processes mini-batch of input ids and labels, then
+# returns predictions and cross-entropy loss between labels and predictions.
+from transformers import GPT2Config, GPT2LMHeadModel
+config = GPT2Config(vocab_size=tokenizer.vocab_size,
+                    bos_token_id=tokenizer.unk_token_id,
+                    eos_token_id=tokenizer.eos_token_id,
+                    n_layer=hyperparams["n_layer"],
+                    n_head=hyperparams["n_head"],
+                    n_embd=hyperparams["n_embd"],
+                    n_positions=hyperparams["max_length"],
+                    n_ctx=hyperparams["max_length"])
+model = GPT2LMHeadModel(config)
+#model= torch.nn.DataParallel(model.cuda(),device_ids=gpus,output_device=gpus[0])
+outputs = model(**batch)
+print(outputs.keys())
+#['loss', 'logits', 'past_key_values']
+# ## Trainer
+#
+# GPT-2 is trained with autoregressive language modeling objective:
+# $$
+# P(\boldsymbol{s}) = P(s_1) \cdot P(s_2 | s_1) \cdots P(s_T | s_1, \ldots, s_{T-1}) =
+# \prod_{t=1}^{T} P(s_t | s_{j < t}),
+# $$
+# where $\boldsymbol{s}$ is a tokenized (encoded) SMILES string, $s_t$ is a token from pretrained
+# vocabulary $\mathcal{V}$.
+#
+# We use `pytorch_lightning.Trainer` to train GPT-2. Since `Trainer` requires lightning modules,
+# we import our
+# [`smiles_gpt.GPT2LitModel`](https://github.com/sanjaradylov/smiles-gpt/blob/master/smiles_gpt/language_modeling.py#L10)
+# wrapper that implements training phases for
+# `GPT2LMHeadModel`, configures an `Adam` optimizer with `CosineAnnealingLR` scheduler, and
+# logs average perplexity every epoch.
+# In[8]:
+from pytorch_lightning import Trainer
+from pytorch_lightning.callbacks.early_stopping import EarlyStopping
+checkpoint = "./checkpoints/iupac"
+'''
+trainer = Trainer(
+    gpus=gpus,
+    max_epochs=hyperparams["max_epochs"],
+    callbacks=[EarlyStopping("ppl", 0.1, 3)],  #[EarlyStopping("ppl", 0.2, 2)]
+    auto_lr_find=False,  # Set to True to search for optimal learning rate.
+    auto_scale_batch_size=False,  # Set to True to scale batch size
+    # accelerator="dp"  # Uncomment for GPU training.
+    accelerator="gpu", #devices=4,
+    strategy="ddp"
+)
+lit_model = gpt.GPT2LitModel(
+    model,
+    batch_size=hyperparams["batch_size"],
+    learning_rate=hyperparams["learning_rate"],
+    final_learning_rate=hyperparams["final_learning_rate"],
+    weight_decay=hyperparams["weight_decay"],
+    adam_eps=hyperparams["adam_eps"],
+    adam_betas=hyperparams["adam_betas"],
+    scheduler_T_max=hyperparams["scheduler_T_max"],
+    save_model_every=1, checkpoint=checkpoint)
+trainer.fit(lit_model, train_dataloader)
+#model.module.save_pretrained('./pretrained')
+model.save_pretrained('./pretrained')
+'''
+# ## Interpretability
+#
+# [BertViz](https://github.com/jessevig/bertviz) inspects attention heads of transformers
+# capturing specific patterns in data. Each head can be representative of some syntactic
+# or short-/long-term relationships between tokens.
+# In[9]:
+import torch
+from bertviz import head_view
+input_ids_list = iupac_encoded['input_ids']
+model = GPT2LMHeadModel.from_pretrained(checkpoint, output_attentions=True)
+attention = model(torch.LongTensor(input_ids_list))[-1]
+tokens = [tokenizer.decode(i) for i in input_ids_list]
+print(input_ids_list,attention,tokens)
+# Don't worry if a snippet is not displayed---just rerun this cell.
+head_view(attention, tokens)
+from bertviz import model_view
+# Don't worry if a snippet is not displayed---just rerun this cell.
+model_view(attention, tokens)
+# ## Sampling
+#
+# Finally, we generate novel SMILES strings with top-$p$ sampling$-$i.e., sampling from the
+# smallest vocabulary subset $\mathcal{V}^{(p)} \subset \mathcal{V}$ s.t. it takes up the most
+# probable tokens whose cumulative probability mass exceeds $p$, $0 < p < 1$. Model
+# terminates the procedure upon encountering `"</s>"` or reaching maximum number
+# `hyperparams["max_length"]`. Special tokens are eventually removed.
+import tqdm
+model.eval()  # Set the base model to evaluation mode.
+generated_smiles_list = []
+n_generated = 50000
+for _ in tqdm.tqdm(range(n_generated)):
+    # Generate from "<unk>" so that the next token is arbitrary.
+    smiles_start = torch.LongTensor([[tokenizer.unk_token_id]])
+    # Get generated token IDs.
+    generated_ids = model.generate(smiles_start,
+                                   max_length=hyperparams["max_length"],
+                                   do_sample=True,top_p=hyperparams["top_p"],
+                                   repetition_penalty=1.2,
+                                   pad_token_id=tokenizer.eos_token_id)
+    # Decode the IDs into tokens and remove "<s>" and "</s>".
+    generated_smiles = tokenizer.decode(generated_ids[0],
+                                        skip_special_tokens=True)
+    generated_smiles_list.append(generated_smiles)
+print(generated_smiles_list[:10])
+import numpy as np
+import pandas as pd
+df2 = pd.DataFrame(generated_smiles_list, columns=['iupac'])
+df2.to_csv("iupacGPT2-gen50K.csv",index=None,sep="|")

iupac-gpt/notebooks/iupac_language-modeling_retrain.py ADDED Viewed

	@@ -0,0 +1,224 @@

+#!/usr/bin/env python
+# coding: utf-8
+# # Generative Pre-Training from Molecules
+import os
+#os.environ["CUDA_VISIBLE_DEVICES"] = ['1',"2"]
+from pprint import pprint
+import sys
+sys.path.append('/root/autodl-tmp/wjm/iupac-gpt')
+from tqdm import tqdm
+try:
+    import iupac_gpt as gpt
+except ImportError:
+    import sys
+    sys.path.extend([".."])  # Parent directory stores `smiles_gpt` package.
+    import iupac_gpt as gpt
+import torch
+# For demonstration purposes, we use only 10K subset of PubChem data made available by
+# [ChemBERTa](https://arxiv.org/abs/2010.09885) developers. The original model was pretrained
+# on the first 5M compounds with the following hyperparameters:
+# ```python
+# hyperparams = {"batch_size": 128, "max_epochs": 2, "max_length": 512,
+#                "learning_rate": 5e-4, "weight_decay": 0.0,
+#                "adam_eps": 1e-8, "adam_betas": (0.9, 0.999),
+#                "scheduler_T_max": 150_000, "final_learning_rate": 5e-8,
+#                "vocab_size": 1_000, "min_frequency": 2, "top_p": 0.96,
+#                "n_layer": 4, "n_head": 8, "n_embd": 512}
+# ```
+# Tokenizer, model, optimizer, scheduler, and trainer hyperparameters.
+hyperparams = {"batch_size": 128, "max_epochs": 10, "max_length": 1280,
+               "learning_rate": 5e-4, "weight_decay": 0.0,
+               "adam_eps": 1e-8, "adam_betas": (0.9, 0.999),
+               "scheduler_T_max": 1_000, "final_learning_rate": 5e-8,
+               "vocab_size": 1491, "min_frequency": 2, "top_p": 0.96,
+               "n_layer": 8, "n_head": 8, "n_embd": 256}
+gpus = [0]  # Specify either a list of GPU devices or an integer (0 for no GPU).
+num_workers = 16  # Number of dataloader worker processes.
+# ## Tokenization
+#
+# `smiles_gpt.SMILESBPETokenizer` first splits SMILES strings into characters, runs
+# byte-pair encoding, and augments the resulting list with `"<s>"` (beginning-of-SMILES) and
+# `"</s>"` (end-of-SMILES) special tokens. `smiles_gpt.SMILESAlphabet` stores 72 possible
+# characters as an initial vocabulary.
+device = 'gpu'
+train_dataloader,iupac_tokenizer = gpt.get_data_loader(is_train=1,dataset_filename = './pubchem_iupac_smile_gpt.csv')
+pbar = tqdm(train_dataloader)  #train_dataloader.cuda()
+'''
+for inputs in pbar:
+    src_label = Variable(inputs["labels"].to(device))
+    inputs = prepare_input(inputs,device)
+    src = Variable(inputs["input_ids"].to(device))
+    #self.tokenizer._convert_token_to_id
+    print(src[:,:].shape,src_label)
+'''
+tokenizer = iupac_tokenizer
+#start mark <unk> 2, end mark </s> 1,  pad   <pad> 0
+iupac_string = "2-amino-9-[4-hydroxy-3-(hydroxymethyl)-2-methylidenecyclopentyl]-1H-purin-6-one"
+iupac_encoded = tokenizer(iupac_string)
+iupac_encoded['input_ids'] = [2]+iupac_encoded['input_ids']
+iupac_merges = [tokenizer.decode(i) for i in iupac_encoded['input_ids']]
+#iupac_encoded['attention_mask']
+print(iupac_encoded['input_ids'])
+print(iupac_merges)
+print(tokenizer.unk_token_id,tokenizer.eos_token_id,tokenizer.unk_token,tokenizer.eos_token,tokenizer.vocab_size) #2 1 1491
+# ## Data Module
+#batch = next(iter(pbar))
+# ## GPT-2 Model
+#
+# Now we load HuggingFace
+# [`GPT2LMHeadModel`](https://huggingface.co/transformers/model_doc/gpt2.html#gpt2lmheadmodel)
+# with the configuration composed of previously
+# defined model hyperparameters. The model processes mini-batch of input ids and labels, then
+# returns predictions and cross-entropy loss between labels and predictions.
+from transformers import GPT2Config, GPT2LMHeadModel
+config = GPT2Config(vocab_size=tokenizer.vocab_size,
+                    bos_token_id=tokenizer.unk_token_id,
+                    eos_token_id=tokenizer.eos_token_id,
+                    n_layer=hyperparams["n_layer"],
+                    n_head=hyperparams["n_head"],
+                    n_embd=hyperparams["n_embd"],
+                    n_positions=hyperparams["max_length"],
+                    n_ctx=hyperparams["max_length"])
+#model = GPT2LMHeadModel(config)
+#model= torch.nn.DataParallel(model.cuda(),device_ids=gpus,output_device=gpus[0])
+#outputs = model(**batch)
+#print(outputs.keys())
+#['loss', 'logits', 'past_key_values']
+# ## Trainer
+#
+# GPT-2 is trained with autoregressive language modeling objective:
+# $$
+# P(\boldsymbol{s}) = P(s_1) \cdot P(s_2 | s_1) \cdots P(s_T | s_1, \ldots, s_{T-1}) =
+# \prod_{t=1}^{T} P(s_t | s_{j < t}),
+# $$
+# where $\boldsymbol{s}$ is a tokenized (encoded) SMILES string, $s_t$ is a token from pretrained
+# vocabulary $\mathcal{V}$.
+#
+# We use `pytorch_lightning.Trainer` to train GPT-2. Since `Trainer` requires lightning modules,
+# we import our
+# [`smiles_gpt.GPT2LitModel`](https://github.com/sanjaradylov/smiles-gpt/blob/master/smiles_gpt/language_modeling.py#L10)
+# wrapper that implements training phases for
+# `GPT2LMHeadModel`, configures an `Adam` optimizer with `CosineAnnealingLR` scheduler, and
+# logs average perplexity every epoch.
+checkpoint = "../checkpoints/iupac"
+model = GPT2LMHeadModel.from_pretrained('./pretrained',local_files_only=True)
+from pytorch_lightning import Trainer
+from pytorch_lightning.callbacks.early_stopping import EarlyStopping
+trainer = Trainer(
+    gpus=gpus,
+    max_epochs=hyperparams["max_epochs"],
+    callbacks=[EarlyStopping("ppl", 0.1, 3)],  #[EarlyStopping("ppl", 0.2, 2)]
+    auto_lr_find=False,  # Set to True to search for optimal learning rate.
+    auto_scale_batch_size=False,  # Set to True to scale batch size
+    # accelerator="dp"  # Uncomment for GPU training.
+    accelerator="gpu", #devices=4,
+    strategy="ddp"
+)
+lit_model = gpt.GPT2LitModel(
+    model,
+    batch_size=hyperparams["batch_size"],
+    learning_rate=hyperparams["learning_rate"],
+    final_learning_rate=hyperparams["final_learning_rate"],
+    weight_decay=hyperparams["weight_decay"],
+    adam_eps=hyperparams["adam_eps"],
+    adam_betas=hyperparams["adam_betas"],
+    scheduler_T_max=hyperparams["scheduler_T_max"],
+    save_model_every=1, checkpoint=checkpoint)
+trainer.fit(lit_model, train_dataloader)
+#model.module.save_pretrained('./pretrained')
+model.save_pretrained('./pretrained')
+# ## Interpretability
+#
+# [BertViz](https://github.com/jessevig/bertviz) inspects attention heads of transformers
+# capturing specific patterns in data. Each head can be representative of some syntactic
+# or short-/long-term relationships between tokens.
+# In[9]:
+import torch
+from bertviz import head_view
+input_ids_list = iupac_encoded['input_ids']
+model = GPT2LMHeadModel.from_pretrained(checkpoint, output_attentions=True)
+attention = model(torch.LongTensor(input_ids_list))[-1]
+tokens = [tokenizer.decode(i) for i in input_ids_list]
+print(input_ids_list,attention,tokens)
+# Don't worry if a snippet is not displayed---just rerun this cell.
+head_view(attention, tokens)
+from bertviz import model_view
+# Don't worry if a snippet is not displayed---just rerun this cell.
+model_view(attention, tokens)
+# ## Sampling
+#
+# Finally, we generate novel SMILES strings with top-$p$ sampling$-$i.e., sampling from the
+# smallest vocabulary subset $\mathcal{V}^{(p)} \subset \mathcal{V}$ s.t. it takes up the most
+# probable tokens whose cumulative probability mass exceeds $p$, $0 < p < 1$. Model
+# terminates the procedure upon encountering `"</s>"` or reaching maximum number
+# `hyperparams["max_length"]`. Special tokens are eventually removed.
+import tqdm
+model.eval()  # Set the base model to evaluation mode.
+generated_smiles_list = []
+n_generated = 50000
+for _ in tqdm.tqdm(range(n_generated)):
+    # Generate from "<unk>" so that the next token is arbitrary.
+    smiles_start = torch.LongTensor([[tokenizer.unk_token_id]])
+    # Get generated token IDs.
+    generated_ids = model.generate(smiles_start,
+                                   max_length=hyperparams["max_length"],
+                                   do_sample=True,top_p=hyperparams["top_p"],
+                                   repetition_penalty=1.2,
+                                   pad_token_id=tokenizer.eos_token_id)
+    # Decode the IDs into tokens and remove "<s>" and "</s>".
+    generated_smiles = tokenizer.decode(generated_ids[0],
+                                        skip_special_tokens=True)
+    generated_smiles_list.append(generated_smiles)
+print(generated_smiles_list[:10])
+import numpy as np
+import pandas as pd
+df2 = pd.DataFrame(generated_smiles_list, columns=['iupac'])
+df2.to_csv("iupacGPT2-gen50K.csv",index=None,mode='a')

iupac-gpt/notebooks/iupac_language-modeling_train.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

iupac-gpt/notebooks/iupac_language-modeling_train.py ADDED Viewed

	@@ -0,0 +1,231 @@

+#!/usr/bin/env python
+# coding: utf-8
+# # Generative Pre-Training from Molecules
+import os
+#os.environ["CUDA_VISIBLE_DEVICES"] = ['1',"2"]
+from pprint import pprint
+import sys
+sys.path.append('/root/autodl-tmp/wjm/iupac-gpt')
+from tqdm import tqdm
+try:
+    import iupac_gpt as gpt
+except ImportError:
+    import sys
+    sys.path.extend([".."])  # Parent directory stores `smiles_gpt` package.
+    import iupac_gpt as gpt
+import torch
+# For demonstration purposes, we use only 10K subset of PubChem data made available by
+# [ChemBERTa](https://arxiv.org/abs/2010.09885) developers. The original model was pretrained
+# on the first 5M compounds with the following hyperparameters:
+# ```python
+# hyperparams = {"batch_size": 128, "max_epochs": 2, "max_length": 512,
+#                "learning_rate": 5e-4, "weight_decay": 0.0,
+#                "adam_eps": 1e-8, "adam_betas": (0.9, 0.999),
+#                "scheduler_T_max": 150_000, "final_learning_rate": 5e-8,
+#                "vocab_size": 1_000, "min_frequency": 2, "top_p": 0.96,
+#                "n_layer": 4, "n_head": 8, "n_embd": 512}
+# ```
+# Tokenizer, model, optimizer, scheduler, and trainer hyperparameters.
+hyperparams = {"batch_size": 128, "max_epochs": 10, "max_length": 1280,
+               "learning_rate": 5e-4, "weight_decay": 0.0,
+               "adam_eps": 1e-8, "adam_betas": (0.9, 0.999),
+               "scheduler_T_max": 1_000, "final_learning_rate": 5e-8,
+               "vocab_size": 1491, "min_frequency": 2, "top_p": 0.96,
+               "n_layer": 8, "n_head": 8, "n_embd": 256}
+gpus = [0,1,2]  # Specify either a list of GPU devices or an integer (0 for no GPU).
+num_workers = 32  # Number of dataloader worker processes.
+# ## Tokenization
+#
+# `smiles_gpt.SMILESBPETokenizer` first splits SMILES strings into characters, runs
+# byte-pair encoding, and augments the resulting list with `"<s>"` (beginning-of-SMILES) and
+# `"</s>"` (end-of-SMILES) special tokens. `smiles_gpt.SMILESAlphabet` stores 72 possible
+# characters as an initial vocabulary.
+device = 'gpu'
+train_dataloader,iupac_tokenizer = gpt.get_data_loader(is_train=1,dataset_filename = './pubchem_iupac_smile_gpt_1bw.csv')
+pbar = tqdm(train_dataloader)  #train_dataloader.cuda()
+'''
+for inputs in pbar:
+    src_label = Variable(inputs["labels"].to(device))
+    inputs = prepare_input(inputs,device)
+    src = Variable(inputs["input_ids"].to(device))
+    #self.tokenizer._convert_token_to_id
+    print(src[:,:].shape,src_label)
+'''
+tokenizer = iupac_tokenizer
+#start mark <unk> 2, end mark </s> 1,  pad   <pad> 0
+iupac_string = "2-amino-9-[4-hydroxy-3-(hydroxymethyl)-2-methylidenecyclopentyl]-1H-purin-6-one"
+iupac_encoded = tokenizer(iupac_string)
+iupac_encoded['input_ids'] = [2]+iupac_encoded['input_ids']
+iupac_merges = [tokenizer.decode(i) for i in iupac_encoded['input_ids']]
+#iupac_encoded['attention_mask']
+print(iupac_encoded['input_ids'])
+print(iupac_merges)
+print(tokenizer.unk_token_id,tokenizer.eos_token_id,tokenizer.unk_token,tokenizer.eos_token,tokenizer.vocab_size) #2 1 1491
+# ## Data Module
+batch = next(iter(pbar))
+# ## GPT-2 Model
+#
+# Now we load HuggingFace
+# [`GPT2LMHeadModel`](https://huggingface.co/transformers/model_doc/gpt2.html#gpt2lmheadmodel)
+# with the configuration composed of previously
+# defined model hyperparameters. The model processes mini-batch of input ids and labels, then
+# returns predictions and cross-entropy loss between labels and predictions.
+from transformers import GPT2Config, GPT2LMHeadModel
+config = GPT2Config(vocab_size=tokenizer.vocab_size,
+                    bos_token_id=tokenizer.unk_token_id,
+                    eos_token_id=tokenizer.eos_token_id,
+                    n_layer=hyperparams["n_layer"],
+                    n_head=hyperparams["n_head"],
+                    n_embd=hyperparams["n_embd"],
+                    n_positions=hyperparams["max_length"],
+                    n_ctx=hyperparams["max_length"])
+model = GPT2LMHeadModel(config)
+#model= torch.nn.DataParallel(model.cuda(),device_ids=gpus,output_device=gpus[0])
+outputs = model(**batch)
+print(outputs.keys())
+#['loss', 'logits', 'past_key_values']
+# ## Trainer
+#
+# GPT-2 is trained with autoregressive language modeling objective:
+# $$
+# P(\boldsymbol{s}) = P(s_1) \cdot P(s_2 | s_1) \cdots P(s_T | s_1, \ldots, s_{T-1}) =
+# \prod_{t=1}^{T} P(s_t | s_{j < t}),
+# $$
+# where $\boldsymbol{s}$ is a tokenized (encoded) SMILES string, $s_t$ is a token from pretrained
+# vocabulary $\mathcal{V}$.
+#
+# We use `pytorch_lightning.Trainer` to train GPT-2. Since `Trainer` requires lightning modules,
+# we import our
+# [`smiles_gpt.GPT2LitModel`](https://github.com/sanjaradylov/smiles-gpt/blob/master/smiles_gpt/language_modeling.py#L10)
+# wrapper that implements training phases for
+# `GPT2LMHeadModel`, configures an `Adam` optimizer with `CosineAnnealingLR` scheduler, and
+# logs average perplexity every epoch.
+# In[8]:
+from pytorch_lightning import Trainer
+from pytorch_lightning.callbacks.early_stopping import EarlyStopping
+checkpoint = "../checkpoints/iupac"
+trainer = Trainer(
+    gpus=gpus,
+    max_epochs=hyperparams["max_epochs"],
+    callbacks=[EarlyStopping("ppl", 0.1, 3)],  #[EarlyStopping("ppl", 0.2, 2)]
+    auto_lr_find=False,  # Set to True to search for optimal learning rate.
+    auto_scale_batch_size=False,  # Set to True to scale batch size
+    # accelerator="dp"  # Uncomment for GPU training.
+    accelerator="gpu", #devices=4,
+    strategy="ddp"
+)
+lit_model = gpt.GPT2LitModel(
+    model,
+    batch_size=hyperparams["batch_size"],
+    learning_rate=hyperparams["learning_rate"],
+    final_learning_rate=hyperparams["final_learning_rate"],
+    weight_decay=hyperparams["weight_decay"],
+    adam_eps=hyperparams["adam_eps"],
+    adam_betas=hyperparams["adam_betas"],
+    scheduler_T_max=hyperparams["scheduler_T_max"],
+    save_model_every=1, checkpoint=checkpoint)
+trainer.fit(lit_model, train_dataloader)
+#model.module.save_pretrained('./pretrained')
+model.save_pretrained('./pretrained')
+# ## Interpretability
+#
+# [BertViz](https://github.com/jessevig/bertviz) inspects attention heads of transformers
+# capturing specific patterns in data. Each head can be representative of some syntactic
+# or short-/long-term relationships between tokens.
+# In[9]:
+import torch
+from bertviz import head_view
+input_ids_list = iupac_encoded['input_ids']
+model = GPT2LMHeadModel.from_pretrained(checkpoint, output_attentions=True)
+attention = model(torch.LongTensor(input_ids_list))[-1]
+tokens = [tokenizer.decode(i) for i in input_ids_list]
+print(input_ids_list,attention,tokens)
+# Don't worry if a snippet is not displayed---just rerun this cell.
+head_view(attention, tokens)
+from bertviz import model_view
+# Don't worry if a snippet is not displayed---just rerun this cell.
+model_view(attention, tokens)
+# ## Sampling
+#
+# Finally, we generate novel SMILES strings with top-$p$ sampling$-$i.e., sampling from the
+# smallest vocabulary subset $\mathcal{V}^{(p)} \subset \mathcal{V}$ s.t. it takes up the most
+# probable tokens whose cumulative probability mass exceeds $p$, $0 < p < 1$. Model
+# terminates the procedure upon encountering `"</s>"` or reaching maximum number
+# `hyperparams["max_length"]`. Special tokens are eventually removed.
+import tqdm
+model.eval()  # Set the base model to evaluation mode.
+generated_smiles_list = []
+n_generated = 30000
+for _ in tqdm.tqdm(range(n_generated)):
+    # Generate from "<unk>" so that the next token is arbitrary.
+    smiles_start = torch.LongTensor([[tokenizer.unk_token_id]])
+    # Get generated token IDs.
+    generated_ids = model.generate(smiles_start,
+                                   max_length=hyperparams["max_length"],
+                                   do_sample=True,top_p=hyperparams["top_p"],
+                                   repetition_penalty=1.2,
+                                   pad_token_id=tokenizer.eos_token_id)
+    # Decode the IDs into tokens and remove "<s>" and "</s>".
+    generated_smiles = tokenizer.decode(generated_ids[0],
+                                        skip_special_tokens=True)
+    generated_smiles_list.append(generated_smiles)
+print(generated_smiles_list[:10])
+import numpy as np
+import pandas as pd
+df2 = pd.DataFrame(generated_smiles_list, columns=['iupac'])
+df2.to_csv("iupacGPT2-gen30K.csv",index=None,mode='a')