Sitemap
A list of all the posts and pages found on the site. For you robots out there, there is an XML version available for digesting as well.
Pages
Posts
Future Blog Post
Published:
This post will show up by default. To disable scheduling of future posts, edit config.yml and set future: false.
Blog Post number 4
Published:
This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.
Blog Post number 3
Published:
This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.
Blog Post number 2
Published:
This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.
Blog Post number 1
Published:
This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.
publications
The Curse of CoT: On the Limitations of Chain-of-Thought in In-Context Learning
Published in Arxiv Preprint, 2025
Chain-of-Thought (CoT) prompting has been widely recognized for its ability to enhance reasoning capabilities in large language models (LLMs) through the generation of explicit explanatory rationales. However, our study reveals a surprising contradiction to this prevailing perspective. Through extensive experiments involving 16 state-of-the-art LLMs and nine diverse pattern-based in-context learning (ICL) datasets, we demonstrate that CoT and its reasoning variants consistently underperform direct answering across varying model scales and benchmark complexities. To systematically investigate this unexpected phenomenon, we designed extensive experiments to validate several hypothetical explanations. Our analysis uncovers a fundamental explicit-implicit duality driving CoT’s performance in pattern-based ICL: while explicit reasoning falters due to LLMs’ struggles to infer underlying patterns from demonstrations, implicit reasoning-disrupted by the increased contextual distance of CoT rationales-often compensates, delivering correct answers despite flawed rationales. This duality explains CoT’s relative underperformance, as noise from weak explicit inference undermines the process, even as implicit mechanisms partially salvage outcomes. Notably, even long-CoT reasoning models, which excel in abstract and symbolic reasoning, fail to fully overcome these limitations despite higher computational costs. Our findings challenge existing assumptions regarding the universal efficacy of CoT, yielding novel insights into its limitations and guiding future research toward more nuanced and effective reasoning methodologies for LLMs.
INFERENCEDYNAMICS: Efficient Routing Across LLMs through Structured Capability and Knowledge Profiling
Published in ACL 2026, 2025
This paper presents InferenceDynamics, an efficient routing approach across LLMs through structured capability and knowledge profiling. Accepted by ACL 2026.
Controllable Logical Hypothesis Generation for Abductive Reasoning in Knowledge Graphs
Published in ICLR 2026, 2025
This paper presents controllable logical hypothesis generation for abductive reasoning in knowledge graphs. Accepted by ICLR 2026.
AutoSchemaKG: Autonomous Knowledge Graph Construction through Dynamic Schema Induction from Web-Scale Corpora
Published in ACL 2026, 2025
This paper presents AutoSchemaKG, an autonomous knowledge graph construction approach through dynamic schema induction from web-scale corpora. Accepted by ACL 2026.
Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training
Published in Arxiv Preprint, 2025
This paper presents Cognitive Kernel-Pro, a framework for deep research agents and agent foundation models training.
Rethinking Prospect Theory for LLMs: Revealing the Instability of Decision-Making under Epistemic Uncertainty
Published in Arxiv Preprint, 2025
Prospect Theory (PT) models human decision-making behaviour under uncertainty, among which linguistic uncertainty is commonly adopted in real-world scenarios. Although recent studies have developed some frameworks to test PT parameters for Large Language Models (LLMs), few have considered the fitness of PT itself on LLMs. Moreover, whether PT is robust under linguistic uncertainty perturbations, especially epistemic markers (e.g. “likely”), remains highly under-explored. To address these gaps, we design a three-stage workflow based on a classic behavioural economics experimental setup. Our findings suggest that modelling LLMs’ decision-making with PT is not consistently reliable across models, and applying Prospect Theory to LLMs is likely not robust to epistemic uncertainty.
Structuring the Unstructured: A Systematic Review of Text-to-Structure Generation for Agentic AI with a Universal Evaluation Framework
Published in Arxiv Preprint, 2025
This paper provides a systematic review of text-to-structure generation for agentic AI with a universal evaluation framework.
The Cognitive Bandwidth Bottleneck: Shifting Long-Horizon Agent from Planning with Actions to Planning with Schemas
Published in Arxiv Preprint, 2025
This paper explores shifting long-horizon agents from planning with actions to planning with schemas.
DixitWorld: Evaluating Multimodal Abductive Reasoning in Vision-Language Models with Multi-Agent Dixit Gameplay
Published in ACL 2026, 2025
Multimodal abductive reasoning–the generation and selection of explanatory hypotheses from partial observations–is a cornerstone of intelligence. Current evaluations of this ability in vision-language models (VLMs) are largely confined to static, single-agent tasks. Inspired by Dixit, we introduce DixitWorld, a comprehensive evaluation suite designed to deconstruct this challenge. DIXITWORLD features two core components: DixitArena, a dynamic, multi-agent environment that evaluates both hypothesis generation and hypothesis selection under imperfect information; and DixitBench, a static QA benchmark that isolates the listener’s task for efficient, controlled evaluation.
AutoGraph-R1: End-to-End Reinforcement Learning for Knowledge Graph Construction
Published in ACL 2026, 2025
This paper presents AutoGraph-R1, an end-to-end reinforcement learning approach for knowledge graph construction. Accepted by ACL 2026.
CritiCal: Can Critique Help LLM Uncertainty or Confidence Calibration?
Published in Arxiv Preprint, 2025
This paper explores whether critique can help LLM uncertainty or confidence calibration.
NAACL: Noise-AwAre Verbal Confidence Calibration for Robust LLMs in RAG Systems
Published in Arxiv Preprint, 2026
Accurately assessing model confidence is essential for deploying large language models (LLMs) in mission-critical factual domains. While retrieval-augmented generation (RAG) is widely adopted to improve grounding, confidence calibration in RAG settings remains poorly understood. We conduct a systematic study across four benchmarks, revealing that LLMs exhibit poor calibration performance due to noisy retrieved contexts. Specifically, contradictory or irrelevant evidence tends to inflate the model’s false certainty, leading to severe overconfidence. To address this, we propose NAACL Rules (Noise-AwAre Confidence CaLibration Rules) to provide a principled foundation for resolving overconfidence under noise. We further design NAACL, a noise-aware calibration framework that synthesizes supervision from about 2K HotpotQA examples guided by these rules. By performing supervised fine-tuning (SFT) with this data, NAACL equips models with intrinsic noise awareness without relying on stronger teacher models. Empirical results show that NAACL yields substantial gains, improving ECE scores by 10.9% in-domain and 8.0% out-of-domain.
SciResearcher: Scaling Deep Research Agents for Frontier Scientific Reasoning
Published in Arxiv Preprint, 2026
Frontier scientific reasoning is rapidly emerging as a key foundation for advancing AI agents in automated scientific discovery. Deep research agents offer a promising approach to this challenge. These models develop robust problem-solving capabilities through post-training on information-seeking tasks, which are typically curated via knowledge graph construction or iterative web browsing. However, these strategies face inherent limitations in frontier science, where domain-specific knowledge is scattered across sparse and heterogeneous academic sources, and problem solving requires sophisticated computation and reasoning far beyond factual recall. To bridge this gap, we introduce SciResearcher, a fully automated agentic framework for frontier-science data construction. SciResearcher synthesizes diverse conceptual and computational tasks grounded in academic evidence, while eliciting information acquisition, tool-integrated reasoning, and long-horizon capabilities. Leveraging the curated data for supervised fine-tuning and agentic reinforcement learning, we develop SciResearcher-8B, an agent foundation model that achieves 19.46% on the HLE-Bio/Chem-Gold benchmark, establishing a new state of the art at its parameter scale and surpassing several larger proprietary agents. It further achieves 13-15% absolute gains on SuperGPQA-Hard-Biology and TRQA-Literature benchmarks. Overall, SciResearcher introduces a new paradigm for automated data construction for frontier scientific reasoning and offers a scalable path toward future scientific agents.
teaching
Teaching experience 1
Undergraduate course, University 1, Department, 2014
This is a description of a teaching experience. You can use markdown like any other post.
Teaching experience 2
Workshop, University 1, Department, 2015
This is a description of a teaching experience. You can use markdown like any other post.
