You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Lit2Vec Subfield Classifier (MLP)

Multi-label classifier for chemistry subfields using dense text embeddings.

Repo: https://huggingface.co/Bocklitz-Lab/lit2vec-subfield-classifier-model Dataset: https://huggingface.co/datasets/Bocklitz-Lab/lit2vec-subfield-classifier-dataset


Model Summary

This model is a Keras MLP for multi-label classification of chemistry-related scientific texts. It consumes a dense embedding vector and predicts one or more subfields (e.g., Catalysis, Energy Chemistry, Materials Science).

  • Input: dense embedding vector (embedding from the dataset)
  • Output: 18 sigmoid probabilities (one per subfield)
  • Task: multilabel text classification (thresholded at 0.5 by default)

Intended Use & Limitations

Intended use

  • Subfield tagging for chemistry abstracts/summaries
  • Metadata enrichment for literature databases
  • Retrieval, filtering, and analytics

Limitations

  • Trained only on chemistry texts → may not generalize to other domains
  • Requires the same embedding space as the dataset encoder (raw text is not accepted directly)
  • Long-tail subfields (few examples) may have lower F1

Labels

ID Subfield
0 Catalysis
1 Organic Chemistry
2 Polymer Chemistry
3 Inorganic Chemistry
4 Materials Science
5 Analytical Chemistry
6 Physical Chemistry
7 Biochemistry
8 Environmental Chemistry
9 Energy Chemistry
10 Medicinal Chemistry
11 Chemical Engineering
12 Supramolecular Chemistry
13 Radiochemistry & Nuclear Chemistry
14 Forensic & Legal Chemistry
15 Food Chemistry
16 Chemical Education
17 Others

The repo includes label_mapping.json.


Training Details

  • Framework: TensorFlow/Keras

  • Architecture:

    • Input → Dense(256, ReLU) → (BatchNorm) → Dropout(0.3)
    • Dense(256, ReLU) → (BatchNorm) → Dropout(0.3)
    • Output: Dense(18, sigmoid)
  • Loss: Weighted Binary Cross-Entropy (per-class weights from train frequency)

  • Optimizer: Adam (ReduceLROnPlateau)

  • Callbacks: EarlyStopping (restore best), ReduceLROnPlateau, optional W&B logging

  • Validation: 5-fold CV on train+val; final training on official splits

  • Best epoch (val): 11 (from W&B)


Evaluation

Validation (final run):

  • PR-AUC: 0.8688
  • ROC-AUC: 0.9725
  • Binary Accuracy: 0.9597

Test (held-out split, threshold = 0.5):

  • Micro F1: 0.81
  • Macro F1: 0.75
  • Weighted F1: 0.80
  • Samples F1: 0.80

Per-label (F1, support):

Subfield F1 Support
Catalysis 0.80 197
Organic Chemistry 0.70 245
Polymer Chemistry 0.72 120
Inorganic Chemistry 0.71 203
Materials Science 0.80 917
Analytical Chemistry 0.71 633
Physical Chemistry 0.63 240
Biochemistry 0.92 2106
Environmental Chemistry 0.79 508
Energy Chemistry 0.79 166
Medicinal Chemistry 0.82 1343
Chemical Engineering 0.53 413
Supramolecular Chemistry 0.68 34
Radiochemistry & Nuclear Chemistry 0.65 20
Forensic & Legal Chemistry 0.70 16
Food Chemistry 0.83 282
Chemical Education 0.85 20
Others 0.83 19

Notes

  • Strong performance on frequent classes like Biochemistry, Medicinal Chemistry, Food Chemistry.
  • Lower F1 on long-tail or heterogeneous labels like Chemical Engineering and Physical Chemistry.

If included in the repo, the plot f1_vs_freq.png shows F1 vs. training label frequency.


Usage

The model expects the same embedding space as the dataset’s embedding. If you want to apply it to new texts, you must compute embeddings with the same encoder used to create the dataset.

Quick start (load from Hub, inference)

# pip install -r requirements.txt 
import json
import numpy as np
from typing import List, Tuple
from huggingface_hub import hf_hub_download
from tensorflow import keras
from sentence_transformers import SentenceTransformer

REPO_ID = "Bocklitz-Lab/lit2vec-subfield-classifier-model"
EMBED_MODEL = "intfloat/e5-large-v2"   # must match what you used to train!
TEXT_PREFIX = {"abstract": "abstract: ", "summary": "summary: "}  # keep consistent with your pipeline
THRESHOLD = 0.5  # decision threshold for multilabel

# ----- Load model + label mapping -----
model_path = hf_hub_download(REPO_ID, filename="mlp_model.h5")
label_map_path = hf_hub_download(REPO_ID, filename="label_mapping.json")

with open(label_map_path, "r", encoding="utf-8") as f:
    mapping = json.load(f)
index_to_label = {int(k): v for k, v in mapping["index_to_label"].items()}

model = keras.models.load_model(model_path, compile=False)  # inference only
encoder = SentenceTransformer(EMBED_MODEL)

def encode_text(text: str, text_type: str = "summary") -> np.ndarray:
    """
    Encode text into a normalized embedding compatible with the classifier.
    text_type: "summary" or "abstract" (affects prefix)
    """
    prefix = TEXT_PREFIX.get(text_type, "")
    emb = encoder.encode([prefix + text], normalize_embeddings=True)  # shape: (1, D)
    return emb.astype("float32")

def predict_labels_from_text(text: str, text_type: str = "summary", threshold: float = THRESHOLD
                            ) -> Tuple[List[int], List[str], np.ndarray]:
    """
    Returns (predicted_ids, predicted_labels, probabilities)
    """
    x = encode_text(text, text_type=text_type)       # (1, D)
    probs = model.predict(x, verbose=0)[0]           # (18,)
    pred_ids = [i for i, p in enumerate(probs) if p > threshold]
    pred_labels = [index_to_label[i] for i in pred_ids]
    return pred_ids, pred_labels, probs

# ----- Example -----
if __name__ == "__main__":
    sample_text = (
        "The adsorption capacity of Helix aspera shell for Pb2+, Zn2+ and Ni2+ has been studied. This shell has the potential of adsorbing Pb2+, Zn2+ and Ni2+ from aqueous solution. The adsorption potentials of Helix aspera shell is largely influenced by the ionic character of the ions and occurred according to the order Pb2+ > Ni2+ > Zn2+. The adsorption of Pb(II), Zn(II) and Ni(II) ions from aqueous solutions by Helix aspera shell is thermodynamically feasible and is consistent with the models of Langmuir and Freundlich adsorption isotherms. From the results of the study, the shell of Helix aspera is recommended for use in the removal of Pb2+, Zn2+ and Ni2+ from aqueous solution."
    )
    ids, labels, probs = predict_labels_from_text(sample_text, text_type="abstract", threshold=0.5)
    print("Predicted IDs:", ids)
    print("Predicted Labels:", labels)
    print("Top scores:", sorted(((index_to_label[i], float(p)) for i, p in enumerate(probs)),
                               key=lambda x: x[1], reverse=True)[:5])

Batch inference

X = np.load("embeddings_batch.npy").astype("float32")  # shape (N, D)
probs = model.predict(X, verbose=0)  # shape (N, 18)
labels_per_row = [[index_to_label[i] for i, p in enumerate(row) if p > 0.5] for row in probs]

Tip: If you need to fine-tune the model, recompile it and reuse the weighted BCE:

  • At load time, pass compile=False and then model.compile(loss="binary_crossentropy", ...) for inference-only.
  • For training with class weights, reintroduce the weighted loss you used in the training script.

Files in this repository

  • mlp_model.h5 – Keras model weights/graph
  • label_mapping.json – name ↔ id mapping
  • training_history.json – training curves (optional)
  • f1_vs_freq.png – F1 vs frequency plot (optional)
  • README.md – this model card

Dataset

Lit2Vec Subfield Classifier Dataset


Model Index

model-index:
- name: Lit2Vec Subfield Classifier (MLP)
  results:
  - task:
      type: text-classification
      name: Multi-label text classification
    dataset:
      name: Lit2Vec Subfield Classifier Dataset
      type: Bocklitz-Lab/lit2vec-subfield-classifier-dataset
      split: test
    metrics:
    - type: micro_f1
      value: 0.81
    - type: macro_f1
      value: 0.75
    - type: weighted_f1
      value: 0.80
    - type: pr_auc
      value: 0.8688
    - type: roc_auc
      value: 0.9725

License

  • Model: CC BY 4.0
  • Dataset: CC BY 4.0

Citations

Dataset

@dataset{lit2vec_classifier_2025,
  author       = {Mahmoud Amiri, Thomas Bocklitz},
  title        = {Lit2Vec Subfield Classifier Dataset},
  year         = {2025},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/datasets/Bocklitz-Lab/lit2vec-subfield-classifier-dataset}},
  note         = {Submitted to Nature Scientific Data}
}

Model

@misc{lit2vec_mlp_classifier_2025,
  title        = {Lit2Vec Subfield Classifier Model},
  author       = {Mahmoud Amiri and Thomas Bocklitz},
  year         = {2025},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Bocklitz-Lab/lit2vec-subfield-classifier-model}}
}
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train Bocklitz-Lab/lit2vec-subfield-classifier-model

Space using Bocklitz-Lab/lit2vec-subfield-classifier-model 1

Evaluation results