HiBench: Benchmarking LLMs Capability on Hierarchical Structure Reasoning

Authors

Zhuohang Jiang, Pangjing WU, Ziran Liang, Chen, Peter Q, Xu Yuan, Ye Jia, Jiancheng Tu, Chen Li, Peter H. F. Ng, Qing Li

Published in

KDD 2025 Datasets and Benchmarks Track (2025)

Keywords

Large language models Benchmark Dataset

Performance of each Family's State-of-the-art Models over Five Hierarchical Reasoning Capability Dimensions.

Abstract

Structure reasoning is a fundamental capability of large language models (LLMs), enabling them to reason about structured commonsense and answer multi-hop questions. However, existing benchmarks for structure reasoning mainly focus on horizontal and coordinate structures (\emph{e.g.} graphs), overlooking the hierarchical relationships within them. Hierarchical structure reasoning is crucial for human cognition, particularly in memory organization and problem-solving. It also plays a key role in various real-world tasks, such as information extraction and decision-making. To address this gap, we propose HiBench, the first framework spanning from initial structure generation to final proficiency assessment, designed to benchmark the hierarchical reasoning capabilities of LLMs systematically.HiBench encompasses six representative scenarios, covering both fundamental and practical aspects, and consists of 30 tasks with varying hierarchical complexity, totaling 39,519 queries.To evaluate LLMs comprehensively, we develop five capability dimensions that depict different facets of hierarchical structure understanding. Through extensive evaluation of 20 LLMs from 10 model families, we reveal key insights into their capabilities and limitations: 1) existing LLMs show proficiency in basic hierarchical reasoning tasks; 2) they still struggle with more complex structures and implicit hierarchical representations, especially in structural modification and textual reasoning. Based on these findings, we create a small yet well-designed instruction dataset, which enhances LLMs' performance on HiBench by an average of 88.84\% (Llama-3.1-8B) and 31.38\% (Qwen2.5-7B) across all tasks.The HiBench dataset and toolkit are available

Why Hierarchical Reasoning Matters

Most structure reasoning benchmarks for LLMs focus on flat relational structures — graphs, tables, knowledge base triples. But a large fraction of real-world knowledge is hierarchical: taxonomies, org charts, file systems, biological classifications, dependency trees, document outlines. Hierarchies have properties that flat graphs don't — transitivity (if A contains B and B contains C, then A contains C), inheritance (properties propagate down branches), and level-relative semantics (a "chapter" means something different at depth 1 vs. depth 4). Reasoning about hierarchies is a distinct cognitive capability from reasoning about graphs, and existing benchmarks weren't measuring it. HiBench fills this gap.

The Benchmark Design

HiBench is structured around six scenarios that span the ways hierarchies appear in practice:

Structure generation and completion: given a partial hierarchy, can the model fill in missing nodes at the correct level?
Structure modification: given a hierarchy and an edit operation (move node X under node Y), can the model predict the consequences?
Path reasoning: given two nodes, can the model identify the hierarchical relationship (ancestor, descendant, sibling, unrelated)?
Implicit hierarchy extraction: given unstructured text (e.g., a passage describing a company's departments), can the model reconstruct the hierarchy?
Constraint satisfaction: given hierarchical constraints (e.g., "A reports to B, B reports to C, C cannot report to A"), can the model detect violations?
Multi-hop inference: combining hierarchical facts to answer questions that require traversing multiple levels

Across these scenarios, the benchmark comprises 30 tasks totaling 39,519 queries — a scale that supports statistical analysis of model differences rather than anecdotal comparison.

The five capability dimensions used for evaluation are a significant methodological contribution in themselves. Rather than reporting a single "hierarchical reasoning score," the benchmark disaggregates performance into: structure recognition, structure generation, structure modification, implicit structure understanding, and multi-hop hierarchical inference. This disaggregation reveals that model rankings are not stable across dimensions — a model that excels at recognition may be near-random at modification, and vice versa.

What the Evaluation Revealed

Twenty LLMs from ten model families were evaluated, spanning a range of scales and architectures. Three main findings:

Basic hierarchical reasoning is largely solved. On structure recognition tasks — identifying parent-child relationships, detecting valid vs. invalid trees — most models performed well above random baseline, and the best models approached ceiling.
Structural modification and implicit hierarchy extraction remain hard. When asked to predict the result of restructuring a hierarchy (e.g., "move the Marketing department under the COO; what is the new path from the SEO specialist to the CEO?"), performance dropped sharply, even for strong models. Similarly, extracting a hierarchy from free text — a task humans find natural — produced error rates that make these models unreliable for practical applications like automatic document structuring.
Scale helps, but unevenly. Larger models performed better across the board, but the gap between small and large models was much wider on modification and implicit extraction than on recognition. This suggests that hierarchical reasoning is not a single capability that scales uniformly — some sub-capabilities may require architectural innovations beyond scaling existing transformers.

The Instruction Dataset and What It Proves

The authors created a small, curated instruction dataset targeting the specific failure modes identified in the evaluation. Fine-tuning Llama-3.1-8B on this dataset improved performance by 88.84% across all HiBench tasks; fine-tuning Qwen2.5-7B improved by 31.38%. These are substantial gains from a small dataset, which suggests two things: (1) the benchmark is measuring a real capability gap, not an artifact of prompt formatting or evaluation methodology, and (2) the gap is addressable through targeted training rather than requiring fundamentally different model architectures.

The asymmetry in improvement (88% for Llama vs. 31% for Qwen) is itself informative. It implies that different model families have different latent capabilities for hierarchical reasoning, and that pre-training data composition — which the authors can't directly observe for closed-source models — likely plays a significant role in determining baseline performance.

Boundaries

HiBench measures hierarchical reasoning in English text. How well these results generalize to other languages, particularly those with different hierarchical encoding conventions (e.g., classifier languages, languages with different honorific systems that encode social hierarchy grammatically), is untested. The benchmark also focuses on explicit, formal hierarchies; the much messier problem of reasoning about conflicting hierarchies (e.g., a person who reports to two managers in a matrix organization) is outside scope.

The instruction dataset is small and targeted — it's a proof of concept that the gap is trainable, not a production-ready fine-tuning recipe. And as with all LLM benchmarks, there's a shelf-life concern: models released after the benchmark was created may have been inadvertently trained on HiBench-like data, inflating scores. The authors address this by making the benchmark available through a toolkit that can generate new queries from the same templates, but the template structure itself could become part of future training distributions.