HIGHT: Hierarchical Graph Tokenization for Graph-Language Alignment

This repo contains the model checkpoints of our ICML 2025 paper: Hierarchical Graph Tokenization for Molecule-Language Alignment, which has also been presented at ICML 2024 workshop on Foundation Models in the Wild. 😆😆😆

File Structures

The pretrained Hierarchical VQ-VAE model is stored in hivqvae.pth. The checkpoints of graph-language models based on llama2-7b-chat and vicuna-v1-3-7b are contained in /llama2 and /vicuna, respectively. Inside each directory, the remaining checkpoints are organized as (using vicuna as an example):

llava-hvqvae2-vicuna-v1-3-7b-pretrain: model after stage 1 pretraining;
graph-text-molgen: models finetuned using Mol-Instruction data under different tasks, e.g., forward reaction prediction;
molcap-llava-hvqvae2-vicuna-v1-3-7b-finetune_lora-50ep: model fintuned using CHEBI-20 dataset for molecular captioning;
MoleculeNet-llava-hvqvae2-vicuna-v1-3-7b-finetune_lora-large*: models finetuned via different classification-based molecular property prediction tasks;

Citation

If you find our model, paper and repo useful, please cite our paper:

@inproceedings{chen2025hierarchical,
title={Hierarchical Graph Tokenization for Molecule-Language Alignment},
author={Yongqiang Chen and Quanming Yao and Juzheng Zhang and James Cheng and Yatao Bian},
booktitle={Forty-second International Conference on Machine Learning},
year={2025},
url={https://openreview.net/forum?id=wpbNczwAwV}
}