HIGHT: Hierarchical Graph Tokenization for Graph-Language Alignment

1Tencent AI Lab, 2The Chinese University of Hong Kong, 3Tsinghua University
( *Intern at Tencent AI Lab)

Abstract

Recently there has been a surge of interest in extending the success of large language models (LLMs) to graph modality, such as social networks and molecules. As LLMs are predominantly trained with 1D text data, most existing approaches adopt a graph neural network to represent a graph as a series of node tokens and feed these tokens to LLMs for graph-language alignment. Despite achieving some successes, existing approaches have overlooked the hierarchical structures that are inherent in graph data. Especially, in molecular graphs, the high-order structural information contains rich semantics of molecular functional groups, which encode crucial biochemical functionalities of the molecules. We establish a simple benchmark showing that neglecting the hierarchical information in graph tokenization will lead to subpar graph-language alignment and severe hallucination in generated outputs. To address this problem, we propose a novel strategy called HIerarchical GrapH Tokenization (HIGHT). HIGHT employs a hierarchical graph tokenizer that extracts and encodes the hierarchy of node, motif, and graph levels of informative tokens to improve the graph perception of LLMs. HIGHT also adopts an augmented graph-language supervised fine-tuning dataset, enriched with the hierarchical graph information, to further enhance the graph-language alignment. Extensive experiments on 7 molecule-centric benchmarks confirm the effectiveness of HIGHT in reducing hallucination by 40%, as well as significant improvements in various molecule-language downstream tasks.

HIGHT Framework

img description

Figure 1. Illustration of the HIGHT framework.

Given a molecule (i.e., PubChem ID 3, 5,6-Dihydroxycyclohexa-1,3-diene-1-carboxylic acid), HIGHT detects the motifs and incorporates the ``supernodes'' for each motif (The whole graph is also considered as a ``super motif''.). Then, HIGHT tokenizes the molecule into both node-level (i.e., atoms) and motif-level (i.e., functional groups) tokens. The hierarchical view enables LLMs to align the molecular structures and the language descriptions of the molecule better.

Motif Hallucination

With only node-level tokens (i.e., discrete atom embeddings), LLMs have to identify and connect the nodes within a specific functional group in order to align useful molecular structures in a molecule to the corresponding language descriptions. Yet, due to the arbitrary order of atoms and position biases in LLMs, it is harder to distinguish each functional group, which further leads to hallucination and subpar alignment.

img description

Figure 2. Illustration of motif hallucination caused by tokenization.

To understand the issue of node-centric tokenization more clearly, we construct a simple benchmark called MotifHallu, which measures the hallucination of common functional groups by LGLMs. Specifically, we consider the 38 common functional groups in RDKit, and construct 23,924 examples of query-answer pairs that asks LLMs about the existence of the functional groups.

Interpolation end reference image.

Figure 3. Examples of MotifHallu benchmark.

Due to the limitation of motif perception and suboptimal alignment, it can be observed a severe hallucination ratio. In contrast, HIGHT significantly alleviates the hallucination.

Interpolation end reference image.

Figure 4. Hallucination results.

Results on Downstream Tasks

We also compare the performances of HIGHT with other LGLMs across 6 downstream tasks including including property prediction, molecular description, and chemical reaction prediction.

Interpolation end reference image.

Figure 5. ROC-AUC Results of molecular property prediction tasks (classification) on MoleculeNet.

Interpolation end reference image.

Figure 6. Results of molecular property prediction tasks (regression) on QM9.

Interpolation end reference image.

Figure 7. Results of molecular description generation task on the test split of ChEBI-20.

Interpolation end reference image.

Figure 8. Results of chemical reaction tasks from Mol-Instructions.

Contact

Welcome to check our paper for more details of the research work. If there is any question, please feel free to contact us.

If you find our paper and repo useful, please consider to cite:

                    
    @article{higraphllm2024,
      title={HIGHT: Hierarchical Graph Tokenization for Graph-Language Alignment}, 
      author={Yongqiang Chen and Quanming Yao and Juzheng Zhang and James Cheng and Yatao Bian},
      year={2024},
      journal = {arXiv preprint},
      volume = {arXiv:2406.14021}
    }