Recently there has been a surge of interest in extending the success of large language models (LLMs) to graph modality, such as social networks and molecules. As LLMs are predominantly trained with 1D text data, most existing approaches adopt a graph neural network to represent a graph as a series of node tokens and feed these tokens to LLMs for graph-language alignment. Despite achieving some successes, existing approaches have overlooked the hierarchical structures that are inherent in graph data. Especially, in molecular graphs, the high-order structural information contains rich semantics of molecular functional groups, which encode crucial biochemical functionalities of the molecules. We establish a simple benchmark showing that neglecting the hierarchical information in graph tokenization will lead to subpar graph-language alignment and severe hallucination in generated outputs. To address this problem, we propose a novel strategy called HIerarchical GrapH Tokenization (HIGHT). HIGHT employs a hierarchical graph tokenizer that extracts and encodes the hierarchy of node, motif, and graph levels of informative tokens to improve the graph perception of LLMs. HIGHT also adopts an augmented graph-language supervised fine-tuning dataset, enriched with the hierarchical graph information, to further enhance the graph-language alignment. Extensive experiments on 7 molecule-centric benchmarks confirm the effectiveness of HIGHT in reducing hallucination by 40%, as well as significant improvements in various molecule-language downstream tasks.
Figure 1. Illustration of the HIGHT framework.
With only node-level tokens (i.e., discrete atom embeddings), LLMs have to identify and connect the nodes within a specific functional group in order to align useful molecular structures in a molecule to the corresponding language descriptions. Yet, due to the arbitrary order of atoms and position biases in LLMs, it is harder to distinguish each functional group, which further leads to hallucination and subpar alignment.
Figure 2. Illustration of motif hallucination caused by tokenization.
To understand the issue of node-centric tokenization more clearly, we construct a simple benchmark called MotifHallu, which measures the hallucination of common functional groups by LGLMs. Specifically, we consider the 38 common functional groups in RDKit, and construct 23,924 examples of query-answer pairs that asks LLMs about the existence of the functional groups.
Figure 3. Examples of MotifHallu benchmark.
Due to the limitation of motif perception and suboptimal alignment, it can be observed a severe hallucination ratio. In contrast, HIGHT significantly alleviates the hallucination.
Figure 4. Hallucination results.
We also compare the performances of HIGHT with other LGLMs across 6 downstream tasks including including property prediction, molecular description, and chemical reaction prediction.
Figure 5. ROC-AUC Results of molecular property prediction tasks (classification) on MoleculeNet.
Figure 6. Results of molecular property prediction tasks (regression) on QM9.
Figure 7. Results of molecular description generation task on the test split of ChEBI-20.
Figure 8. Results of chemical reaction tasks from Mol-Instructions.
Welcome to check our paper for more details of the research work. If there is any question, please feel free to contact us.
If you find our paper and repo useful, please consider to cite:
@article{higraphllm2024,
title={HIGHT: Hierarchical Graph Tokenization for Graph-Language Alignment},
author={Yongqiang Chen and Quanming Yao and Juzheng Zhang and James Cheng and Yatao Bian},
year={2024},
journal = {arXiv preprint},
volume = {arXiv:2406.14021}
}