Computation and Language 68
☆ SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning
Bo Liu, Leon Guertler, Simon Yu, Zichen Liu, Penghui Qi, Daniel Balcells, Mickel Liu, Cheston Tan, Weiyan Shi, Min Lin, Wee Sun Lee, Natasha Jaques
Recent advances in reinforcement learning have shown that language models can
develop sophisticated reasoning through training on tasks with verifiable
rewards, but these approaches depend on human-curated problem-answer pairs and
domain-specific reward engineering. We introduce SPIRAL, a self-play framework
where models learn by playing multi-turn, zero-sum games against continuously
improving versions of themselves, eliminating the need for human supervision.
Through self-play, SPIRAL generates an infinite curriculum of progressively
challenging problems as models must constantly adapt to stronger opponents. To
enable this self-play training at scale, We implement a fully online,
multi-turn, multi-agent reinforcement learning system for LLMs and propose
role-conditioned advantage estimation (RAE) to stabilize multi-agent training.
Using SPIRAL, self-play on zero-sum games produces reasoning capabilities that
transfer broadly. Training Qwen3-4B-Base on Kuhn Poker alone achieves 8.6%
improvement on math and 8.4% on general reasoning, outperforming SFT on 25,000
expert game trajectories. Analysis reveals that this transfer occurs through
three cognitive patterns: systematic decomposition, expected value calculation,
and case-by-case analysis. Multi-game training (TicTacToe, Kuhn Poker, Simple
Negotiation) further enhances performance as each game develops distinct
reasoning strengths. Applying SPIRAL to a strong reasoning model
(DeepSeek-R1-Distill-Qwen-7B) can still lead to 2.0% average improvement. These
results demonstrate that zero-sum games naturally develop transferable
reasoning capabilities, highlighting a promising direction for autonomous
reasoning development.
comment: Work in Progress
☆ Computational Detection of Intertextual Parallels in Biblical Hebrew: A Benchmark Study Using Transformer-Based Language Models
Identifying parallel passages in biblical Hebrew is foundational in biblical
scholarship for uncovering intertextual relationships. Traditional methods rely
on manual comparison, which is labor-intensive and prone to human error. This
study evaluates the potential of pre-trained transformer-based language models,
including E5, AlephBERT, MPNet, and LaBSE, for detecting textual parallels in
the Hebrew Bible. Focusing on known parallels between the books of Samuel/Kings
and Chronicles, I assessed each model's capability to generate word embeddings
that delineate parallel from non-parallel passages. Utilizing cosine similarity
and Wasserstein Distance measures, I found that E5 and AlephBERT show
significant promise, with E5 excelling in parallel detection and AlephBERT
demonstrating stronger non-parallel differentiation. These findings indicate
that pre-trained models can enhance the efficiency and accuracy of detecting
intertextual parallels in ancient texts, suggesting broader applications for
ancient language studies.
☆ On the Predictive Power of Representation Dispersion in Language Models
We show that a language model's ability to predict text is tightly linked to
the breadth of its embedding space: models that spread their contextual
representations more widely tend to achieve lower perplexity. Concretely, we
find that representation dispersion - the average pairwise cosine distance
among hidden vectors - strongly and negatively correlates with perplexity
across diverse model families (LLaMA, Qwen, and others) and domains (Wikipedia,
news, scientific abstracts). Beyond illustrating this link, we show how
dispersion can be leveraged for a range of practical tasks without requiring
labeled data. First, measuring dispersion on unlabeled text allows us to
predict downstream accuracy in new domains, offering a data-efficient tool for
model selection. Next, we find that identifying layers with higher dispersion
pinpoints the best representations for retrieval-based methods such as kNN-LM,
bypassing exhaustive layer-by-layer searches. Finally, we integrate a simple
push-away objective into training, which increases dispersion in both
single-domain and cross-domain scenarios and directly improves perplexity in
each.
☆ MotionGPT3: Human Motion as a Second Modality
Though recent advances in multimodal models have demonstrated strong
capabilities and opportunities in unified understanding and generation, the
development of unified motion-language models remains underexplored. To enable
such models with high-fidelity human motion, two core challenges must be
addressed. The first is the reconstruction gap between the continuous motion
modality and discrete representation in an autoregressive manner, and the
second is the degradation of language intelligence during unified training.
Inspired by the mixture of experts, we propose MotionGPT3, a bimodal
motion-language model that treats human motion as a second modality, decoupling
motion modeling via separate model parameters and enabling both effective
cross-modal interaction and efficient multimodal scaling training. To preserve
language intelligence, the text branch retains the original structure and
parameters of the pretrained language model, while a new motion branch is
integrated via a shared attention mechanism, enabling bidirectional information
flow between two modalities. We first employ a motion Variational Autoencoder
(VAE) to encode raw human motion into latent representations. Based on this
continuous latent space, the motion branch predicts motion latents directly
from intermediate hidden states using a diffusion head, bypassing discrete
tokenization. Extensive experiments show that our approach achieves competitive
performance on both motion understanding and generation tasks while preserving
strong language capabilities, establishing a unified bimodal motion diffusion
framework within an autoregressive manner.
comment: 21 pages, 8 figures
☆ STACK: Adversarial Attacks on LLM Safeguard Pipelines
Ian R. McKenzie, Oskar J. Hollinsworth, Tom Tseng, Xander Davies, Stephen Casper, Aaron D. Tucker, Robert Kirk, Adam Gleave
Frontier AI developers are relying on layers of safeguards to protect against
catastrophic misuse of AI systems. Anthropic guards their latest Claude 4 Opus
model using one such defense pipeline, and other frontier developers including
Google DeepMind and OpenAI pledge to soon deploy similar defenses. However, the
security of such pipelines is unclear, with limited prior work evaluating or
attacking these pipelines. We address this gap by developing and red-teaming an
open-source defense pipeline. First, we find that a novel few-shot-prompted
input and output classifier outperforms state-of-the-art open-weight safeguard
model ShieldGemma across three attacks and two datasets, reducing the attack
success rate (ASR) to 0% on the catastrophic misuse dataset ClearHarm. Second,
we introduce a STaged AttaCK (STACK) procedure that achieves 71% ASR on
ClearHarm in a black-box attack against the few-shot-prompted classifier
pipeline. Finally, we also evaluate STACK in a transfer setting, achieving 33%
ASR, providing initial evidence that it is feasible to design attacks with no
access to the target pipeline. We conclude by suggesting specific mitigations
that developers could use to thwart staged attacks.
☆ Logit-Gap Steering: Efficient Short-Suffix Jailbreaks for Aligned Large Language Models
We introduce logit-gap steering, a fast jailbreak framework that casts the
refusal-affirmation gap of RLHF-aligned language models as a single pass over
the vocabulary. A forward-computable score blends gap reduction with
lightweight proxies for KL penalty and reward shift, allowing a "sort-sum-stop"
sweep to complete in under a second and return a short suffix--two orders of
magnitude fewer model calls than beam or gradient attacks. The same suffix
generalises to unseen prompts and scales from 0.5 B to 70 B checkpoints,
lifting one-shot attack success from baseline levels to 80-100% while
preserving topical coherence. Beyond efficiency, these suffixes expose
sentence-boundary reward cliffs and other alignment artefacts, offering a
lightweight probe into how safety tuning reshapes internal representations.
☆ Ella: Embodied Social Agents with Lifelong Memory
We introduce Ella, an embodied social agent capable of lifelong learning
within a community in a 3D open world, where agents accumulate experiences and
acquire knowledge through everyday visual observations and social interactions.
At the core of Ella's capabilities is a structured, long-term multimodal memory
system that stores, updates, and retrieves information effectively. It consists
of a name-centric semantic memory for organizing acquired knowledge and a
spatiotemporal episodic memory for capturing multimodal experiences. By
integrating this lifelong memory system with foundation models, Ella retrieves
relevant information for decision-making, plans daily activities, builds social
relationships, and evolves autonomously while coexisting with other intelligent
beings in the open world. We conduct capability-oriented evaluations in a
dynamic 3D open world where 15 agents engage in social activities for days and
are assessed with a suite of unseen controlled evaluations. Experimental
results show that Ella can influence, lead, and cooperate with other agents
well to achieve goals, showcasing its ability to learn effectively through
observation and social interaction. Our findings highlight the transformative
potential of combining structured memory systems with foundation models for
advancing embodied intelligence. More videos can be found at
https://umass-embodied-agi.github.io/Ella/.
☆ EXPERT: An Explainable Image Captioning Evaluation Metric with Structured Explanations ACL 2025
Recent advances in large language models and vision-language models have led
to growing interest in explainable evaluation metrics for image captioning.
However, these metrics generate explanations without standardized criteria, and
the overall quality of the generated explanations remains unverified. In this
paper, we propose EXPERT, a reference-free evaluation metric that provides
structured explanations based on three fundamental criteria: fluency,
relevance, and descriptiveness. By constructing large-scale datasets of
high-quality structured explanations, we develop a two-stage evaluation
template to effectively supervise a vision-language model for both scoring and
explanation generation. EXPERT achieves state-of-the-art results on benchmark
datasets while providing significantly higher-quality explanations than
existing metrics, as validated through comprehensive human evaluation. Our code
and datasets are available at https://github.com/hjkim811/EXPERT.
comment: Accepted at ACL 2025 Findings
☆ Large Language Models Don't Make Sense of Word Problems. A Scoping Review from a Mathematics Education Perspective
The progress of Large Language Models (LLMs) like ChatGPT raises the question
of how they can be integrated into education. One hope is that they can support
mathematics learning, including word-problem solving. Since LLMs can handle
textual input with ease, they appear well-suited for solving mathematical word
problems. Yet their real competence, whether they can make sense of the
real-world context, and the implications for classrooms remain unclear. We
conducted a scoping review from a mathematics-education perspective, including
three parts: a technical overview, a systematic review of word problems used in
research, and a state-of-the-art empirical evaluation of LLMs on mathematical
word problems. First, in the technical overview, we contrast the
conceptualization of word problems and their solution processes between LLMs
and students. In computer-science research this is typically labeled
mathematical reasoning, a term that does not align with usage in mathematics
education. Second, our literature review of 213 studies shows that the most
popular word-problem corpora are dominated by s-problems, which do not require
a consideration of realities of their real-world context. Finally, our
evaluation of GPT-3.5-turbo, GPT-4o-mini, GPT-4.1, and o3 on 287 word problems
shows that most recent LLMs solve these s-problems with near-perfect accuracy,
including a perfect score on 20 problems from PISA. LLMs still showed
weaknesses in tackling problems where the real-world context is problematic or
non-sensical. In sum, we argue based on all three aspects that LLMs have
mastered a superficial solution process but do not make sense of word problems,
which potentially limits their value as instructional tools in mathematics
classrooms.
☆ Auto-TA: Towards Scalable Automated Thematic Analysis (TA) via Multi-Agent Large Language Models with Reinforcement Learning ACL 2025
Congenital heart disease (CHD) presents complex, lifelong challenges often
underrepresented in traditional clinical metrics. While unstructured narratives
offer rich insights into patient and caregiver experiences, manual thematic
analysis (TA) remains labor-intensive and unscalable. We propose a fully
automated large language model (LLM) pipeline that performs end-to-end TA on
clinical narratives, which eliminates the need for manual coding or full
transcript review. Our system employs a novel multi-agent framework, where
specialized LLM agents assume roles to enhance theme quality and alignment with
human analysis. To further improve thematic relevance, we optionally integrate
reinforcement learning from human feedback (RLHF). This supports scalable,
patient-centered analysis of large qualitative datasets and allows LLMs to be
fine-tuned for specific clinical contexts.
comment: Presented at ACL 2025 SRW
☆ Machine Understanding of Scientific Language
Scientific information expresses human understanding of nature. This
knowledge is largely disseminated in different forms of text, including
scientific papers, news articles, and discourse among people on social media.
While important for accelerating our pursuit of knowledge, not all scientific
text is faithful to the underlying science. As the volume of this text has
burgeoned online in recent years, it has become a problem of societal
importance to be able to identify the faithfulness of a given piece of
scientific text automatically. This thesis is concerned with the cultivation of
datasets, methods, and tools for machine understanding of scientific language,
in order to analyze and understand science communication at scale. To arrive at
this, I present several contributions in three areas of natural language
processing and machine learning: automatic fact checking, learning with limited
data, and scientific text processing. These contributions include new methods
and resources for identifying check-worthy claims, adversarial claim
generation, multi-source domain adaptation, learning from crowd-sourced labels,
cite-worthiness detection, zero-shot scientific fact checking, detecting
exaggerated scientific claims, and modeling degrees of information change in
science communication. Critically, I demonstrate how the research outputs of
this thesis are useful for effectively learning from limited amounts of
scientific text in order to identify misinformative scientific statements and
generate new insights into the science communication process
comment: PhD Thesis, 210 pages
☆ TaP: A Taxonomy-Guided Framework for Automated and Scalable Preference Data Generation
Renren Jin, Tianhao Shen, Xinwei Wu, Dan Shi, Haoran Sun, Wuwei Huang, Quandong Wang, Wei Liu, Jian Luan, Bin Wang, Deyi Xiong
Conducting supervised fine-tuning and preference fine-tuning on large
language models (LLMs) requires high-quality datasets to improve their ability
to follow instructions and align with human preferences and values. However,
constructing such datasets is resource-intensive, and most available datasets
for supervised and preference fine-tuning are in English. To address these
challenges, we propose the \underline{\textbf{Ta}}xonomy-Guided
\underline{\textbf{P}}reference Data Generation (TaP) framework, which
facilitates automated and scalable construction of preference datasets across
various languages. TaP is grounded in a structured taxonomy that allows
fine-grained control over dataset composition, thereby ensuring both diversity
and comprehensive coverage. We employ TaP-generated datasets to perform
supervised and preference fine-tuning on various LLMs. Experimental results
demonstrate that LLMs trained on TaP-generated datasets outperform those
trained on existing open-source datasets. Remarkably, LLMs trained on
TaP-generated datasets surpass the performance of those trained on an
open-source dataset that is 180 times larger.
comment: 33 pages, 15 tables, 11 figures
☆ LLM Agents Are the Antidote to Walled Gardens
While the Internet's core infrastructure was designed to be open and
universal, today's application layer is dominated by closed, proprietary
platforms. Open and interoperable APIs require significant investment, and
market leaders have little incentive to enable data exchange that could erode
their user lock-in. We argue that LLM-based agents fundamentally disrupt this
status quo. Agents can automatically translate between data formats and
interact with interfaces designed for humans: this makes interoperability
dramatically cheaper and effectively unavoidable. We name this shift universal
interoperability: the ability for any two digital services to exchange data
seamlessly using AI-mediated adapters. Universal interoperability undermines
monopolistic behaviours and promotes data portability. However, it can also
lead to new security risks and technical debt. Our position is that the ML
community should embrace this development while building the appropriate
frameworks to mitigate the downsides. By acting now, we can harness AI to
restore user freedom and competitive markets without sacrificing security.
☆ Unveiling Decision-Making in LLMs for Text Classification : Extraction of influential and interpretable concepts with Sparse Autoencoders
Sparse Autoencoders (SAEs) have been successfully used to probe Large
Language Models (LLMs) and extract interpretable concepts from their internal
representations. These concepts are linear combinations of neuron activations
that correspond to human-interpretable features. In this paper, we investigate
the effectiveness of SAE-based explainability approaches for sentence
classification, a domain where such methods have not been extensively explored.
We present a novel SAE-based architecture tailored for text classification,
leveraging a specialized classifier head and incorporating an activation rate
sparsity loss. We benchmark this architecture against established methods such
as ConceptShap, Independent Component Analysis, and other SAE-based concept
extraction techniques. Our evaluation covers two classification benchmarks and
four fine-tuned LLMs from the Pythia family. We further enrich our analysis
with two novel metrics for measuring the precision of concept-based
explanations, using an external sentence encoder. Our empirical results show
that our architecture improves both the causality and interpretability of the
extracted features.
☆ Graft: Integrating the Domain Knowledge via Efficient Parameter Synergy for MLLMs
Yang Dai, Jianxiang An, Tianwei Lin, Hongyang He, Hongzhe Huang, Wenqiao Zhang, Zheqi Lv, Siliang Tang, Yueting Zhuang
Multimodal Large Language Models (MLLMs) have achieved success across various
domains. However, their applicability tends to degrade when confronted with
different types of data inputs, especially for MLLMs that have been fine-tuned
for specific tasks. Despite its importance, the study of knowledge sharing
among domain-specific MLLMs--such as those trained for mathematics or
code--remains largely underexplored. To address the fragmentation of knowledge
across domain-specialized MLLMs, we propose a unified parameter integration
framework that enables modular composition of expert capabilities. Our method
is grounded in a novel Compatibility-Aware Parameter Splicing (CAPS) strategy,
which leverages both local functional attribution and global
information-theoretic signals to guide selective parameter fusion. By extending
this mechanism to the low-rank adaptation layer granularity, we ensure
efficient integration with minimal inference overhead. Furthermore, we
introduce a domain compatibility scoring mechanism that quantifies inter-expert
alignment at the activation level and correlates with downstream task utility.
This principled fusion protocol allows the final model to synergize
heterogeneous expertise while preserving structural modularity. Extensive
evaluations across diverse multimodal benchmarks validate the effectiveness of
our framework, offering a scalable path toward compositional, domain-adaptive
MLLMs.
☆ Leveraging the Potential of Prompt Engineering for Hate Speech Detection in Low-Resource Languages
The rapid expansion of social media leads to a marked increase in hate
speech, which threatens personal lives and results in numerous hate crimes.
Detecting hate speech presents several challenges: diverse dialects, frequent
code-mixing, and the prevalence of misspelled words in user-generated content
on social media platforms. Recent progress in hate speech detection is
typically concentrated on high-resource languages. However, low-resource
languages still face significant challenges due to the lack of large-scale,
high-quality datasets. This paper investigates how we can overcome this
limitation via prompt engineering on large language models (LLMs) focusing on
low-resource Bengali language. We investigate six prompting strategies -
zero-shot prompting, refusal suppression, flattering the classifier, multi-shot
prompting, role prompting, and finally our innovative metaphor prompting to
detect hate speech effectively in low-resource languages. We pioneer the
metaphor prompting to circumvent the built-in safety mechanisms of LLMs that
marks a significant departure from existing jailbreaking methods. We
investigate all six different prompting strategies on the Llama2-7B model and
compare the results extensively with three pre-trained word embeddings - GloVe,
Word2Vec, and FastText for three different deep learning models - multilayer
perceptron (MLP), convolutional neural network (CNN), and bidirectional gated
recurrent unit (BiGRU). To prove the effectiveness of our metaphor prompting in
the low-resource Bengali language, we also evaluate it in another low-resource
language - Hindi, and two high-resource languages - English and German. The
performance of all prompting techniques is evaluated using the F1 score, and
environmental impact factor (IF), which measures CO$_2$ emissions, electricity
usage, and computational time.
☆ IMPACT: Inflectional Morphology Probes Across Complex Typologies
Large Language Models (LLMs) have shown significant progress on various
multilingual benchmarks and are increasingly used to generate and evaluate text
in non-English languages. However, while they may produce fluent outputs, it
remains unclear to what extent these models truly grasp the underlying
linguistic complexity of those languages, particularly in morphology. To
investigate this, we introduce IMPACT, a synthetically generated evaluation
framework focused on inflectional morphology, which we publicly release,
designed to evaluate LLM performance across five morphologically rich
languages: Arabic, Russian, Finnish, Turkish, and Hebrew. IMPACT includes
unit-test-style cases covering both shared and language-specific phenomena,
from basic verb inflections (e.g., tense, number, gender) to unique features
like Arabic's reverse gender agreement and vowel harmony in Finnish and
Turkish. We assess eight multilingual LLMs that, despite strong English
performance, struggle with other languages and uncommon morphological patterns,
especially when judging ungrammatical examples. We also show that Chain of
Thought and Thinking Models can degrade performance. Our work exposes gaps in
LLMs' handling of linguistic complexity, pointing to clear room for
improvement. To support further research, we publicly release the IMPACT
framework.
☆ The Trilemma of Truth in Large Language Models
We often attribute human characteristics to large language models (LLMs) and
claim that they "know" certain things. LLMs have an internal probabilistic
knowledge that represents information retained during training. How can we
assess the veracity of this knowledge? We examine two common methods for
probing the veracity of LLMs and discover several assumptions that are flawed.
To address these flawed assumptions, we introduce sAwMIL (short for Sparse
Aware Multiple-Instance Learning), a probing method that utilizes the internal
activations of LLMs to separate statements into true, false, and neither.
sAwMIL is based on multiple-instance learning and conformal prediction. We
evaluate sAwMIL on 5 validity criteria across 16 open-source LLMs, including
both default and chat-based variants, as well as on 3 new datasets. Among the
insights we provide are: (1) the veracity signal is often concentrated in the
third quarter of an LLM's depth; (2) truth and falsehood signals are not always
symmetric; (3) linear probes perform better on chat models than on default
models; (4) nonlinear probes may be required to capture veracity signals for
some LLMs with reinforcement learning from human feedback or knowledge
distillation; and (5) LLMs capture a third type of signal that is distinct from
true and false and is neither true nor false. These findings provide a reliable
method for verifying what LLMs "know" and how certain they are of their
probabilistic internal knowledge.
☆ Advancing Multi-Step Mathematical Reasoning in Large Language Models through Multi-Layered Self-Reflection with Auto-Prompting ECML
Recent advancements in Large Language Models (LLMs) have significantly
improved their problem-solving capabilities. However, these models still
struggle when faced with complex multi-step reasoning tasks. In this paper, we
propose the Multi-Layered Self-Reflection with Auto-Prompting (MAPS) framework,
a novel approach designed to enhance multi-step mathematical reasoning in LLMs
by integrating techniques such as Chain of Thought (CoT), Self-Reflection, and
Auto-Prompting. Unlike traditional static prompting methods, MAPS employs an
iterative refinement process. Initially, the model generates a solution using
CoT prompting. When errors are detected, an adaptive self-reflection mechanism
identifies and analyzes them, generating tailored prompts to guide corrections.
These dynamically adjusted prompts enable the model to iteratively refine its
reasoning. Experiments on four well-established benchmarks across multiple LLMs
show that MAPS significantly outperforms standard CoT and achieves competitive
results with reasoning-optimized models. In addition, MAPS enables
general-purpose LLMs to reach performance levels comparable to specialized
reasoning models. While deeper reflection layers improve accuracy, they also
increase token usage and costs. To balance this trade-off, MAPS strategically
limits reflection depth, ensuring an optimal balance between cost and reasoning
performance.
comment: Accepted for publication in: European Conference on Machine Learning
and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD
2025). Research Track
☆ Garbage In, Reasoning Out? Why Benchmark Scores are Unreliable and What to Do About It
We conduct a systematic audit of three widely used reasoning benchmarks,
SocialIQa, FauxPas-EAI, and ToMi, and uncover pervasive flaws in both benchmark
items and evaluation methodology. Using five LLMs (GPT-{3, 3.5, 4, o1}, and
LLaMA 3.1) as diagnostic tools, we identify structural, semantic, and pragmatic
issues in benchmark design (e.g., duplicated items, ambiguous wording, and
implausible answers), as well as scoring procedures that prioritize output form
over reasoning process. Through systematic human annotation and re-evaluation
on cleaned benchmark subsets, we find that model scores often improve not due
to due to erratic surface wording variations and not to improved reasoning.
Infact, further analyses show that model performance is highly sensitive to
minor input variations such as context availability and phrasing, revealing
that high scores may reflect alignment with format-specific cues rather than
consistent inference based on the input. These findings challenge the validity
of current benchmark-based claims about reasoning in LLMs, and highlight the
need for evaluation protocols that assess reasoning as a process of drawing
inference from available information, rather than as static output selection.
We release audited data and evaluation tools to support more interpretable and
diagnostic assessments of model reasoning.
☆ Use Sparse Autoencoders to Discover Unknown Concepts, Not to Act on Known Concepts
While sparse autoencoders (SAEs) have generated significant excitement, a
series of negative results have added to skepticism about their usefulness.
Here, we establish a conceptual distinction that reconciles competing
narratives surrounding SAEs. We argue that while SAEs may be less effective for
acting on known concepts, SAEs are powerful tools for discovering unknown
concepts. This distinction cleanly separates existing negative and positive
results, and suggests several classes of SAE applications. Specifically, we
outline use cases for SAEs in (i) ML interpretability, explainability,
fairness, auditing, and safety, and (ii) social and health sciences.
☆ Do Thinking Tokens Help or Trap? Towards More Efficient Large Reasoning Model
Large Reasoning Models (LRMs) excel at solving complex problems but face an
overthinking dilemma. When handling simple tasks, they often produce verbose
responses overloaded with thinking tokens (e.g., wait, however). These tokens
trigger unnecessary high-level reasoning behaviors like reflection and
backtracking, reducing efficiency. In this work, our pilot study reveals that
these thinking-token-induced behaviors are not essential for effective
problem-solving and may even hinder correct reasoning within constrained token
budgets. We identify this phenomenon as the thinking trap. To mitigate this
issue, we propose Dual Policy Preference Optimization (DuP-PO), a novel
algorithm featuring: (1) A rollout sampling strategy that guarantees balanced
exposure to responses with and without thinking tokens; (2) A fine-grained
advantage control technique to dynamically regulate the prediction of target
tokens; (3) A policy shaping method ensuring stable gradient contributions from
thinking tokens. Experimental results on five popular math reasoning benchmarks
show that DuP-PO performs well on the popular LRM, which significantly improves
their token efficiency during reasoning, while achieving superior performance
of the base model.
comment: 13 pages, 5 figures
☆ Positional Bias in Binary Question Answering: How Uncertainty Shapes Model Preferences
Positional bias in binary question answering occurs when a model
systematically favors one choice over another based solely on the ordering of
presented options. In this study, we quantify and analyze positional bias
across five large language models under varying degrees of answer uncertainty.
We re-adapted the SQuAD-it dataset by adding an extra incorrect answer option
and then created multiple versions with progressively less context and more
out-of-context answers, yielding datasets that range from low to high
uncertainty. Additionally, we evaluate two naturally higher-uncertainty
benchmarks: (1) WebGPT - question pairs with unequal human-assigned quality
scores, and (2) Winning Arguments - where models predict the more persuasive
argument in Reddit's r/ChangeMyView exchanges. Across each dataset, the order
of the "correct" (or higher-quality/persuasive) option is systematically
flipped (first placed in position 1, then in position 2) to compute both
Preference Fairness and Position Consistency. We observe that positional bias
is nearly absent under low-uncertainty conditions, but grows exponentially when
it becomes doubtful to decide which option is correct.
☆ AutoEvoEval: An Automated Framework for Evolving Close-Ended LLM Evaluation Data
Large language models (LLMs) have shown remarkable performance on various
tasks, but existing evaluation benchmarks are often static and insufficient to
fully assess their robustness and generalization in realistic scenarios. Prior
work using evolutionary or adversarial data augmentation has improved
evaluation diversity but lacks systematic control over perturbation types and
multi-step complexity, limiting comprehensive robustness analysis. To address
these gaps, we propose AutoEvoEval, an evolution-based evaluation framework for
close-ended tasks such as multi-choice question answering. AutoEvoEval
introduces 22 interpretable atomic evolution operations and supports
multi-round compositions, enabling controlled generation of diverse,
challenging, and realistic test samples. We conduct extensive experiments
addressing four research questions on a broad set of open- and closed-source
LLMs. Our results show that atomic operations cause an average accuracy drop of
7.283\%, with structure-disrupting or misleading semantic edits causing the
largest declines. Model sensitivities vary significantly for the same
perturbation, and combining multiple evolution steps amplifies adversarial
effects by up to 52.932\%. These findings suggest current benchmarks may
overestimate true model generalization and emphasize the need for
evolution-aware robustness evaluation. Code and resources are available at:
https://github.com/SYSUSELab/AutoEvoEval.
☆ Towards an Automated Multimodal Approach for Video Summarization: Building a Bridge Between Text, Audio and Facial Cue-Based Summarization
The increasing volume of video content in educational, professional, and
social domains necessitates effective summarization techniques that go beyond
traditional unimodal approaches. This paper proposes a behaviour-aware
multimodal video summarization framework that integrates textual, audio, and
visual cues to generate timestamp-aligned summaries. By extracting prosodic
features, textual cues and visual indicators, the framework identifies
semantically and emotionally important moments. A key contribution is the
identification of bonus words, which are terms emphasized across multiple
modalities and used to improve the semantic relevance and expressive clarity of
the summaries. The approach is evaluated against pseudo-ground truth (pGT)
summaries generated using LLM-based extractive method. Experimental results
demonstrate significant improvements over traditional extractive method, such
as the Edmundson method, in both text and video-based evaluation metrics.
Text-based metrics show ROUGE-1 increasing from 0.4769 to 0.7929 and BERTScore
from 0.9152 to 0.9536, while in video-based evaluation, our proposed framework
improves F1-Score by almost 23%. The findings underscore the potential of
multimodal integration in producing comprehensive and behaviourally informed
video summaries.
comment: Accepted to HHAI WS 2025: Workshops at the Fourth International
Conference on Hybrid Human-Artificial Intelligence (HHAI)
☆ Attestable Audits: Verifiable AI Safety Benchmarks Using Trusted Execution Environments ICML 2024
Benchmarks are important measures to evaluate safety and compliance of AI
models at scale. However, they typically do not offer verifiable results and
lack confidentiality for model IP and benchmark datasets. We propose Attestable
Audits, which run inside Trusted Execution Environments and enable users to
verify interaction with a compliant AI model. Our work protects sensitive data
even when model provider and auditor do not trust each other. This addresses
verification challenges raised in recent AI governance frameworks. We build a
prototype demonstrating feasibility on typical audit benchmarks against
Llama-3.1.
comment: ICML 2024 Workshop TAIG
☆ Efficient Interleaved Speech Modeling through Knowledge Distillation
Current speech language models exceed the size and latency constraints of
many deployment environments. We build compact, expressive speech generation
models through layer-aligned distillation, matching hidden states, attention
maps, and softened logits to compress large multimodal transformers by 3x with
minimal loss in performance. We introduce TinyWave, a family of 2B-parameter
models for speech-to-speech and interleaved speech-text generation, trained on
50,000 hours of public audio. TinyWave supports (i) speech-only generation
using phonetic or expressive tokens and (ii) mixed speech-text continuations.
Evaluation on Libri-Light shows TinyWave within 1.4 normalized perplexity
points of its teacher. Accuracy on spoken StoryCloze and SALMon reaches 93-97%
of the teacher's performance, outperforming size-matched baselines. These
models are optimized for deployment on commodity hardware, enabling
applications in real-time conversational agents, assistive technologies, and
low-resource environments. We release models, training code, and evaluation
scripts to support reproducible research on compact, expressive speech
generation.
☆ L0: Reinforcement Learning to Become General Agents
Junjie Zhang, Jingyi Xi, Zhuoyang Song, Junyu Lu, Yuhua Ke, Ting Sun, Yukun Yang, Jiaxing Zhang, Songxin Zhang, Zejian Xie
Training large language models (LLMs) to act as autonomous agents for
multi-turn, long-horizon tasks remains significant challenges in scalability
and training efficiency. To address this, we introduce L-Zero (L0), a scalable,
end-to-end training pipeline for general-purpose agents. Featuring a low-cost,
extensible, and sandboxed concurrent agent worker pool, L0 lowers the barrier
for applying reinforcement learning in complex environments. We also introduce
NB-Agent, the agent scaffold within L0, which operates in a "code-as-action"
fashion via a Read-Eval-Print-Loop (REPL). We evaluate L0 on factuality
question-answering benchmarks. Our experiments demonstrate that a base model
can develop robust problem-solving skills using solely Reinforcement Learning
with Verifiable Rewards (RLVR). On the Qwen2.5-7B-Instruct model, our method
boosts accuracy on SimpleQA from 30 % to 80 % and on HotpotQA from 22 % to 41
%. We have open-sourced the entire L0 system, including our L0 series models,
the NB-Agent, a complete training pipeline, and the corresponding training
recipes on (https://github.com/cmriat/l0).
☆ Zero-Shot Contextual Embeddings via Offline Synthetic Corpus Generation
Context-aware embedding methods boost retrieval accuracy by conditioning on
corpus statistics (e.g., term co-occurrence and topical patterns) extracted
from neighboring documents. However, this context-aware approach requires
access to the target corpus or requires domain-specific finetuning, posing
practical barriers in privacy-sensitive or resource-constrained settings. We
present ZEST, a zero-shot contextual adaptation framework that replaces real
corpus access with a one-time offline synthesis of a compact proxy. Given only
a handful exemplar documents representative of the general target domain, we
use a multi-step hierarchical procedure to generate a synthetic context corpus
of several hundred documents that aims to emulate key domain-specific
distributions. At inference, the frozen context-aware encoder uses this proxy
corpus -- without any finetuning or target corpus access -- to produce
domain-adapted embeddings. Across the MTEB benchmark, ZEST's zero-shot
synthetic context adaptation using only five example documents performs within
0.5% of models leveraging full target corpus access -- demonstrating remarkable
efficacy without any retraining. ZEST thus provides a practical method for
deploying high-performance, adaptable embeddings in constrained environments.
☆ Robustness of Misinformation Classification Systems to Adversarial Examples Through BeamAttack
We extend BeamAttack, an adversarial attack algorithm designed to evaluate
the robustness of text classification systems through word-level modifications
guided by beam search. Our extensions include support for word deletions and
the option to skip substitutions, enabling the discovery of minimal
modifications that alter model predictions. We also integrate LIME to better
prioritize word replacements. Evaluated across multiple datasets and victim
models (BiLSTM, BERT, and adversarially trained RoBERTa) within the BODEGA
framework, our approach achieves over a 99\% attack success rate while
preserving the semantic and lexical similarity of the original texts. Through
both quantitative and qualitative analysis, we highlight BeamAttack's
effectiveness and its limitations. Our implementation is available at
https://github.com/LucK1Y/BeamAttack
comment: 12 pages main text, 27 pages total including references and
appendices. 13 figures, 10 tables. Accepted for publication in the LNCS
proceedings of CLEF 2025 (Best-of-Labs track)
☆ Evaluating the Simulation of Human Personality-Driven Susceptibility to Misinformation with LLMs
Large language models (LLMs) make it possible to generate synthetic
behavioural data at scale, offering an ethical and low-cost alternative to
human experiments. Whether such data can faithfully capture psychological
differences driven by personality traits, however, remains an open question. We
evaluate the capacity of LLM agents, conditioned on Big-Five profiles, to
reproduce personality-based variation in susceptibility to misinformation,
focusing on news discernment, the ability to judge true headlines as true and
false headlines as false. Leveraging published datasets in which human
participants with known personality profiles rated headline accuracy, we create
matching LLM agents and compare their responses to the original human patterns.
Certain trait-misinformation associations, notably those involving
Agreeableness and Conscientiousness, are reliably replicated, whereas others
diverge, revealing systematic biases in how LLMs internalize and express
personality. The results underscore both the promise and the limits of
personality-aligned LLMs for behavioral simulation, and offer new insight into
modeling cognitive diversity in artificial agents.
comment: pre-print version - paper actually under submission
☆ Semantic-guided Diverse Decoding for Large Language Model
Weijie Shi, Yue Cui, Yaguang Wu, Jingzhi Fang, Shibo Zhang, Mengze Li, Sirui Han, Jia Zhu, Jiajie Xu, Xiaofang Zhou
Diverse decoding of large language models is crucial for applications
requiring multiple semantically distinct responses, yet existing methods
primarily achieve lexical rather than semantic diversity. This limitation
significantly constrains Best-of-N strategies, group-based reinforcement
learning, and data synthesis. While temperature sampling and diverse beam
search modify token distributions or apply n-gram penalties, they fail to
ensure meaningful semantic differentiation. We introduce Semantic-guided
Diverse Decoding (SemDiD), operating directly in embedding space that balances
quality with diversity through three complementary mechanisms: orthogonal
directional guidance, dynamic inter-group repulsion, and position-debiased
probability assessment. SemDiD harmonizes these competing objectives using
adaptive gain functions and constraint optimization, ensuring both quality
thresholds and maximal semantic differentiation. Experiments show SemDiD
consistently outperforms existing methods, improving Best-of-N coverage by
1.4-5.2% across diverse tasks and accelerating RLHF training convergence by 15%
while increasing accuracy by up to 2.1%.
☆ Reachability in symmetric VASS
We investigate the reachability problem in symmetric vector addition systems
with states (VASS), where transitions are invariant under a group of
permutations of coordinates. One extremal case, the trivial groups, yields
general VASS. In another extremal case, the symmetric groups, we show that the
reachability problem can be solved in PSPACE, regardless of the dimension of
input VASS (to be contrasted with Ackermannian complexity in general VASS). We
also consider other groups, in particular alternating and cyclic ones.
Furthermore, motivated by the open status of the reachability problem in data
VASS, we estimate the gain in complexity when the group arises as a combination
of the trivial and symmetric groups.
☆ MMReason: An Open-Ended Multi-Modal Multi-Step Reasoning Benchmark for MLLMs Toward AGI
Huanjin Yao, Jiaxing Huang, Yawen Qiu, Michael K. Chen, Wenzheng Liu, Wei Zhang, Wenjie Zeng, Xikun Zhang, Jingyi Zhang, Yuxin Song, Wenhao Wu, Dacheng Tao
Reasoning plays a crucial role in advancing Multimodal Large Language Models
(MLLMs) toward Artificial General Intelligence. However, existing MLLM
benchmarks often fall short in precisely and comprehensively evaluating
long-chain reasoning abilities from three key aspects: (1) lack of difficulty
and diversity, (2) susceptibility to guessability and memorization, (3)
inadequate assessment of intermediate reasoning steps. To fill this gap, we
introduce MMReason, a new benchmark designed to precisely and comprehensively
evaluate MLLM long-chain reasoning capability with diverse, open-ended,
challenging questions. First, we curate challenging questions requiring
multi-step reasoning from various fields (i.e., 6 disciplines) and multiple
difficulty levels (i.e., from pre-university to university, and from
foundational to competition tiers). Second, these questions are reformulated
into an open-ended format and filtered using a multi-model voting technique to
eliminate shortcut cases related to guessing and memorization, ensuring robust
reasoning evaluations. Third, we annotate the questions with detailed
step-by-step solutions, and design a reference-based ternary scoring mechanism
to reliably assess intermediate reasoning steps. With MMReason, we benchmark
popular leading MLLMs and provide an in-depth analysis of their reasoning
capabilities. We hope MMReason will serve as a valuable resource for advancing
MLLM reasoning research. Code will be available at
https://github.com/HJYao00/MMReason.
comment: Technical report
☆ On Recipe Memorization and Creativity in Large Language Models: Is Your Model a Creative Cook, a Bad Cook, or Merely a Plagiator?
This work-in-progress investigates the memorization, creativity, and nonsense
found in cooking recipes generated from Large Language Models (LLMs).
Precisely, we aim (i) to analyze memorization, creativity, and non-sense in
LLMs using a small, high-quality set of human judgments and (ii) to evaluate
potential approaches to automate such a human annotation in order to scale our
study to hundreds of recipes. To achieve (i), we conduct a detailed human
annotation on 20 preselected recipes generated by LLM (Mixtral), extracting
each recipe's ingredients and step-by-step actions to assess which elements are
memorized--i.e., directly traceable to online sources possibly seen during
training--and which arise from genuine creative synthesis or outright nonsense.
We find that Mixtral consistently reuses ingredients that can be found in
online documents, potentially seen during model training, suggesting strong
reliance on memorized content. To achieve aim (ii) and scale our analysis
beyond small sample sizes and single LLM validation, we design an
``LLM-as-judge'' pipeline that automates recipe generation, nonsense detection,
parsing ingredients and recipe steps, and their annotation. For instance,
comparing its output against human annotations, the best ingredient extractor
and annotator is Llama 3.1+Gemma 2 9B, achieving up to 78% accuracy on
ingredient matching. This automated framework enables large-scale
quantification of memorization, creativity, and nonsense in generated recipes,
providing rigorous evidence of the models' creative capacities.
comment: 13 pages, 5 figures
☆ NEU-ESC: A Comprehensive Vietnamese dataset for Educational Sentiment analysis and topic Classification toward multitask learning
In the field of education, understanding students' opinions through their
comments is crucial, especially in the Vietnamese language, where resources
remain limited. Existing educational datasets often lack domain relevance and
student slang. To address these gaps, we introduce NEU-ESC, a new Vietnamese
dataset for Educational Sentiment Classification and Topic Classification,
curated from university forums, which offers more samples, richer class
diversity, longer texts, and broader vocabulary. In addition, we explore
multitask learning using encoder-only language models (BERT), in which we
showed that it achieves performance up to 83.7% and 79.8% accuracy for
sentiment and topic classification tasks. We also benchmark our dataset and
model with other datasets and models, including Large Language Models, and
discuss these benchmarks. The dataset is publicly available at:
https://huggingface.co/datasets/hung20gg/NEU-ESC.
☆ Assessing GPTZero's Accuracy in Identifying AI vs. Human-Written Essays
As the use of AI tools by students has become more prevalent, instructors
have started using AI detection tools like GPTZero and QuillBot to detect AI
written text. However, the reliability of these detectors remains uncertain. In
our study, we focused mostly on the success rate of GPTZero, the most-used AI
detector, in identifying AI-generated texts based on different lengths of
randomly submitted essays: short (40-100 word count), medium (100-350 word
count), and long (350-800 word count). We gathered a data set consisting of
twenty-eight AI-generated papers and fifty human-written papers. With this
randomized essay data, papers were individually plugged into GPTZero and
measured for percentage of AI generation and confidence. A vast majority of the
AI-generated papers were detected accurately (ranging from 91-100% AI believed
generation), while the human generated essays fluctuated; there were a handful
of false positives. These findings suggest that although GPTZero is effective
at detecting purely AI-generated content, its reliability in distinguishing
human-authored texts is limited. Educators should therefore exercise caution
when relying solely on AI detection tools.
☆ Reinforcement Fine-Tuning Enables MLLMs Learning Novel Tasks Stably
Zhihao Zhang, Qiaole Dong, Qi Zhang, Jun Zhao, Enyu Zhou, Zhiheng Xi, Senjie Jin, Xiaoran Fan, Yuhao Zhou, Yanwei Fu, Tao Ji, Tao Gui, Xuanjing Huang
Post-training algorithms such as Supervised Fine-Tuning (SFT) and
Reinforcement Fine-Tuning (RFT) are widely used to adapt multimodal large
language models to downstream tasks. While effective at task adaptation, their
impact on prior knowledge remains unclear. In this paper, we introduce jigsaw
puzzles as a novel task absent from existing pretraining corpora and
systematically study the behavior of SFT and RFT on an open-source multimodal
model, Qwen2.5-VL. Our experiments reveal a sharp trade-off: SFT enables rapid
task acquisition but leads to catastrophic forgetting, whereas RFT learns more
slowly on novel tasks but maintains prior knowledge. We analyze this phenomenon
through the lens of learning dynamics, showing that RFT reinforces correct
samples that are naturally aligned with the base model's probability landscape,
mitigating interference with prior knowledge. Moreover, supervised training on
correct RFT-simulated rollouts allows SFT to preserve knowledge while rapidly
learning new tasks. These findings suggest that data distribution, rather than
algorithmic differences, plays a central role in forgetting, and highlight
RFT's potential for stable continual learning in multimodal large language
models.
comment: 18 pages (Preprint. Work in progress)
☆ Thought-Augmented Planning for LLM-Powered Interactive Recommender Agent
Interactive recommendation is a typical information-seeking task that allows
users to interactively express their needs through natural language and obtain
personalized recommendations. Large language model-powered (LLM-powered) agents
have become a new paradigm in interactive recommendations, effectively
capturing users' real-time needs and enhancing personalized experiences.
However, due to limited planning and generalization capabilities, existing
formulations of LLM-powered interactive recommender agents struggle to
effectively address diverse and complex user intents, such as intuitive,
unrefined, or occasionally ambiguous requests. To tackle this challenge, we
propose a novel thought-augmented interactive recommender agent system (TAIRA)
that addresses complex user intents through distilled thought patterns.
Specifically, TAIRA is designed as an LLM-powered multi-agent system featuring
a manager agent that orchestrates recommendation tasks by decomposing user
needs and planning subtasks, with its planning capacity strengthened through
Thought Pattern Distillation (TPD), a thought-augmentation method that extracts
high-level thoughts from the agent's and human experts' experiences. Moreover,
we designed a set of user simulation schemes to generate personalized queries
of different difficulties and evaluate the recommendations based on specific
datasets. Through comprehensive experiments conducted across multiple datasets,
TAIRA exhibits significantly enhanced performance compared to existing methods.
Notably, TAIRA shows a greater advantage on more challenging tasks while
generalizing effectively on novel tasks, further validating its superiority in
managing complex user intents within interactive recommendation systems. The
code is publicly available at:https://github.com/Alcein/TAIRA.
☆ What to Keep and What to Drop: Adaptive Table Filtering Framework
Large language models (LLMs) for table-based reasoning often struggle with
large tables due to input length limits. We propose ATF (Adaptive Table
Filtering Framework), a modular and question-aware filtering pipeline that
prunes uninformative columns and rows using LLM-generated column descriptions,
clustering, and sparse-dense alignment scores. ATF integrates seamlessly with
existing models (e.g., TAPAS, TAPEX) without retraining. Experiments show that
ATF reduces table cells by ~70\%, boosting performance on out-of-domain TableQA
tasks while causing slight performance drops on Table Fact Verification, where
full-table context is more critical. These results highlight ATF's ability to
adaptively balance informativeness and minimalism across tasks.
comment: 26 pages, 9 figures
♻ ☆ Knowing You Don't Know: Learning When to Continue Search in Multi-round RAG through Self-Practicing SIGIR 2025
Retrieval Augmented Generation (RAG) has shown strong capability in enhancing
language models' knowledge and reducing AI generative hallucinations, driving
its widespread use. However, complex tasks requiring multi-round retrieval
remain challenging, and early attempts tend to be overly optimistic without a
good sense of self-skepticism. Current multi-round RAG systems may continue
searching even when enough information has already been retrieved, or they may
provide incorrect answers without having sufficient information or knowledge.
Existing solutions either require large amounts of expensive human-labeled
process supervision data or lead to subpar performance. This paper aims to
address these limitations by introducing a new framework, SIM-RAG, to
explicitly enhance RAG systems' self-awareness and multi-round retrieval
capabilities. To train SIM-RAG, we first let a RAG system self-practice
multi-round retrieval, augmenting existing question-answer pairs with
intermediate inner monologue reasoning steps to generate synthetic training
data. For each pair, the system may explore multiple retrieval paths, which are
labeled as successful if they reach the correct answer and unsuccessful
otherwise. Using this data, we train a lightweight information sufficiency
Critic. At inference time, the Critic evaluates whether the RAG system has
retrieved sufficient information at each round, guiding retrieval decisions and
improving system-level self-awareness through in-context reinforcement
learning. Experiments across multiple prominent RAG benchmarks show that
SIM-RAG is an effective multi-round RAG solution. Furthermore, this framework
is system-efficient, adding a lightweight component to RAG without requiring
modifications to existing LLMs or search engines, and data-efficient,
eliminating the need for costly human-annotated mid-step retrieval process
supervision data.
comment: Proceedings of the 48th International ACM SIGIR 2025
♻ ☆ SEUF: Is Unlearning One Expert Enough for Mixture-of-Experts LLMs? ACL'25
Recent advancements in LLMs unlearning have shown remarkable success in
removing unwanted data-model influences while preserving the model's utility
for legitimate knowledge. Despite these strides, sparse Mixture-of-Experts
(MoE) LLMs--a key subset of the LLM family--have remained unexplored in the
context of unlearning. As MoE LLMs are celebrated for their exceptional
performance, we ask:How can unlearning be performed effectively and efficiently
on MoE LLMs? Our pilot study shows that the dynamic routing nature of MoE LLMs
introduces unique challenges, leading to excessive forgetting, uncontrolled
knowledge erasure and substantial utility drops when existing unlearning
methods are applied. To address this, we propose a novel Selected-Expert
Unlearning Framework (SEUF). Through expert attribution, unlearning is
concentrated on the most actively engaged experts for the specified knowledge.
Concurrently, an anchor loss is applied to the router to stabilize the active
state of this targeted expert, ensuring focused and controlled unlearning. SEUF
is compatible with various standard unlearning algorithms. Extensive
experiments demonstrate that SEUF enhances both forget quality up to 5% and
model utility by 35% on MoE LLMs across various benchmarks and LLM
architectures (compared to standard unlearning algorithms), while only
unlearning 0.06% of the model parameters.
comment: Accepted to ACL'25
♻ ☆ KMI: A Dataset of Korean Motivational Interviewing Dialogues for Psychotherapy NAACL 2025
The increasing demand for mental health services has led to the rise of
AI-driven mental health chatbots, though challenges related to privacy, data
collection, and expertise persist. Motivational Interviewing (MI) is gaining
attention as a theoretical basis for boosting expertise in the development of
these chatbots. However, existing datasets are showing limitations for training
chatbots, leading to a substantial demand for publicly available resources in
the field of MI and psychotherapy. These challenges are even more pronounced in
non-English languages, where they receive less attention. In this paper, we
propose a novel framework that simulates MI sessions enriched with the
expertise of professional therapists. We train an MI forecaster model that
mimics the behavioral choices of professional therapists and employ Large
Language Models (LLMs) to generate utterances through prompt engineering. Then,
we present KMI, the first synthetic dataset theoretically grounded in MI,
containing 1,000 high-quality Korean Motivational Interviewing dialogues.
Through an extensive expert evaluation of the generated dataset and the
dialogue model trained on it, we demonstrate the quality, expertise, and
practicality of KMI. We also introduce novel metrics derived from MI theory in
order to evaluate dialogues from the perspective of MI.
comment: Accepted at NAACL 2025 Main Conference
♻ ☆ Position: Machine Learning Conferences Should Establish a "Refutations and Critiques" Track
Rylan Schaeffer, Joshua Kazdan, Yegor Denisov-Blanch, Brando Miranda, Matthias Gerstgrasser, Susan Zhang, Andreas Haupt, Isha Gupta, Elyas Obbad, Jesse Dodge, Jessica Zosa Forde, Koustuv Sinha, Francesco Orabona, Sanmi Koyejo, David Donoho
Science progresses by iteratively advancing and correcting humanity's
understanding of the world. In machine learning (ML) research, rapid
advancements have led to an explosion of publications, but have also led to
misleading, incorrect, flawed or perhaps even fraudulent studies being accepted
and sometimes highlighted at ML conferences due to the fallibility of peer
review. While such mistakes are understandable, ML conferences do not offer
robust processes to help the field systematically correct when such errors are
made. This position paper argues that ML conferences should establish a
dedicated "Refutations and Critiques" (R&C) Track. This R&C Track would provide
a high-profile, reputable platform to support vital research that critically
challenges prior research, thereby fostering a dynamic self-correcting research
ecosystem. We discuss key considerations including track design, review
principles, potential pitfalls, and provide an illustrative example submission
concerning a recent ICLR 2025 Oral. We conclude that ML conferences should
create official, reputable mechanisms to help ML research self-correct.
♻ ☆ Riddle Me This! Stealthy Membership Inference for Retrieval-Augmented Generation CCS 2025
Retrieval-Augmented Generation (RAG) enables Large Language Models (LLMs) to
generate grounded responses by leveraging external knowledge databases without
altering model parameters. Although the absence of weight tuning prevents
leakage via model parameters, it introduces the risk of inference adversaries
exploiting retrieved documents in the model's context. Existing methods for
membership inference and data extraction often rely on jailbreaking or
carefully crafted unnatural queries, which can be easily detected or thwarted
with query rewriting techniques common in RAG systems. In this work, we present
Interrogation Attack (IA), a membership inference technique targeting documents
in the RAG datastore. By crafting natural-text queries that are answerable only
with the target document's presence, our approach demonstrates successful
inference with just 30 queries while remaining stealthy; straightforward
detectors identify adversarial prompts from existing methods up to ~76x more
frequently than those generated by our attack. We observe a 2x improvement in
TPR@1%FPR over prior inference attacks across diverse RAG configurations, all
while costing less than $0.02 per document inference.
comment: This is the full version (27 pages) of the paper 'Riddle Me This!
Stealthy Membership Inference for Retrieval-Augmented Generation' published
at CCS 2025
♻ ☆ LibVulnWatch: A Deep Assessment Agent System and Leaderboard for Uncovering Hidden Vulnerabilities in Open-Source AI Libraries ACL 2025
Zekun Wu, Seonglae Cho, Umar Mohammed, Cristian Munoz, Kleyton Costa, Xin Guan, Theo King, Ze Wang, Emre Kazim, Adriano Koshiyama
Open-source AI libraries are foundational to modern AI systems, yet they
present significant, underexamined risks spanning security, licensing,
maintenance, supply chain integrity, and regulatory compliance. We introduce
LibVulnWatch, a system that leverages recent advances in large language models
and agentic workflows to perform deep, evidence-based evaluations of these
libraries. Built on a graph-based orchestration of specialized agents, the
framework extracts, verifies, and quantifies risk using information from
repositories, documentation, and vulnerability databases. LibVulnWatch produces
reproducible, governance-aligned scores across five critical domains,
publishing results to a public leaderboard for ongoing ecosystem monitoring.
Applied to 20 widely used libraries, including ML frameworks, LLM inference
engines, and agent orchestration tools, our approach covers up to 88% of
OpenSSF Scorecard checks while surfacing up to 19 additional risks per library,
such as critical RCE vulnerabilities, missing SBOMs, and regulatory gaps. By
integrating advanced language technologies with the practical demands of
software risk assessment, this work demonstrates a scalable, transparent
mechanism for continuous supply chain evaluation and informed library
selection.
comment: ACL 2025 Student Research Workshop and ICML 2025 TAIG Workshop
♻ ☆ TTRL: Test-Time Reinforcement Learning
Yuxin Zuo, Kaiyan Zhang, Li Sheng, Shang Qu, Ganqu Cui, Xuekai Zhu, Haozhan Li, Yuchen Zhang, Xinwei Long, Ermo Hua, Biqing Qi, Youbang Sun, Zhiyuan Ma, Lifan Yuan, Ning Ding, Bowen Zhou
This paper investigates Reinforcement Learning (RL) on data without explicit
labels for reasoning tasks in Large Language Models (LLMs). The core challenge
of the problem is reward estimation during inference while not having access to
ground-truth information. While this setting appears elusive, we find that
common practices in Test-Time Scaling (TTS), such as majority voting, yield
surprisingly effective rewards suitable for driving RL training. In this work,
we introduce Test-Time Reinforcement Learning (TTRL), a novel method for
training LLMs using RL on unlabeled data. TTRL enables self-evolution of LLMs
by utilizing the priors in the pre-trained models. Our experiments demonstrate
that TTRL consistently improves performance across a variety of tasks and
models. Notably, TTRL boosts the pass@1 performance of Qwen-2.5-Math-7B by
approximately 211% on the AIME 2024 with only unlabeled test data. Furthermore,
although TTRL is only supervised by the maj@n metric, TTRL has demonstrated
performance to consistently surpass the upper limit of the initial model maj@n,
and approach the performance of models trained directly on test data with
ground-truth labels. Our experimental findings validate the general
effectiveness of TTRL across various tasks and highlight TTRL's potential for
broader tasks and domains. GitHub: https://github.com/PRIME-RL/TTRL
♻ ☆ Empirical evidence of Large Language Model's influence on human spoken communication
Hiromu Yakura, Ezequiel Lopez-Lopez, Levin Brinkmann, Ignacio Serna, Prateek Gupta, Ivan Soraperra, Iyad Rahwan
From the invention of writing and the printing press, to television and
social media, human history is punctuated by major innovations in communication
technology, which fundamentally altered how ideas spread and reshaped our
culture. Recent chatbots powered by generative artificial intelligence
constitute a novel medium that encodes cultural patterns in their neural
representations and disseminates them in conversations with hundreds of
millions of people. Understanding whether these patterns transmit into human
language, and ultimately shape human culture, is a fundamental question. While
fully quantifying the causal impact of a chatbot like ChatGPT on human culture
is very challenging, lexicographic shift in human spoken communication may
offer an early indicator of such broad phenomenon. Here, we apply econometric
causal inference techniques to 740,249 hours of human discourse from 360,445
YouTube academic talks and 771,591 conversational podcast episodes across
multiple disciplines. We detect a measurable and abrupt increase in the use of
words preferentially generated by ChatGPT, such as delve, comprehend, boast,
swift, and meticulous, after its release. These findings suggest a scenario
where machines, originally trained on human data and subsequently exhibiting
their own cultural traits, can, in turn, measurably reshape human culture. This
marks the beginning of a closed cultural feedback loop in which cultural traits
circulate bidirectionally between humans and machines. Our results motivate
further research into the evolution of human-machine culture, and raise
concerns over the erosion of linguistic and cultural diversity, and the risks
of scalable manipulation.
♻ ☆ GeometryZero: Improving Geometry Solving for LLM with Group Contrastive Policy Optimization
Recent advances in large language models (LLMs) have demonstrated remarkable
capabilities across diverse domains, particularly in mathematical reasoning,
amid which geometry problem solving remains a challenging area where auxiliary
construction plays a enssential role. Existing approaches either achieve
suboptimal performance or rely on massive LLMs (e.g., GPT-4o), incurring
massive computational costs. We posit that reinforcement learning with
verifiable reward (e.g., GRPO) offers a promising direction for training
smaller models that effectively combine auxiliary construction with robust
geometric reasoning. However, directly applying GRPO to geometric reasoning
presents fundamental limitations due to its dependence on unconditional
rewards, which leads to indiscriminate and counterproductive auxiliary
constructions. To address these challenges, we propose Group Contrastive Policy
Optimization (GCPO), a novel reinforcement learning framework featuring two key
innovations: (1) Group Contrastive Masking, which adaptively provides positive
or negative reward signals for auxiliary construction based on contextual
utility, and a (2) length reward that promotes longer reasoning chains.
Building on GCPO, we develop GeometryZero, a family of affordable-size
geometric reasoning models that judiciously determine when to employ auxiliary
construction. Our extensive empirical evaluation across popular geometric
benchmarks (Geometry3K, MathVista) demonstrates that GeometryZero models
consistently outperform baselines (e.g. GRPO), achieving an average improvement
of 4.29% across all benchmarks.
♻ ☆ Explainable Sentiment Analysis with DeepSeek-R1: Performance, Efficiency, and Few-Shot Learning
Large language models (LLMs) have transformed sentiment analysis, yet
balancing accuracy, efficiency, and explainability remains a critical
challenge. This study presents the first comprehensive evaluation of
DeepSeek-R1--an open-source reasoning model--against OpenAI's GPT-4o and
GPT-4o-mini. We test the full 671B model and its distilled variants,
systematically documenting few-shot learning curves. Our experiments show
DeepSeek-R1 achieves a 91.39\% F1 score on 5-class sentiment and 99.31\%
accuracy on binary tasks with just 5 shots, an eightfold improvement in
few-shot efficiency over GPT-4o. Architecture-specific distillation effects
emerge, where a 32B Qwen2.5-based model outperforms the 70B Llama-based variant
by 6.69 percentage points. While its reasoning process reduces throughput,
DeepSeek-R1 offers superior explainability via transparent, step-by-step
traces, establishing it as a powerful, interpretable open-source alternative.
comment: 10 pages, 2 figures, 6 tables, revised and re-submitted to an IEEE
journal
♻ ☆ Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph ACL 2025
Roman Vashurin, Ekaterina Fadeeva, Artem Vazhentsev, Lyudmila Rvanova, Akim Tsvigun, Daniil Vasilev, Rui Xing, Abdelrahman Boda Sadallah, Kirill Grishchenkov, Sergey Petrakov, Alexander Panchenko, Timothy Baldwin, Preslav Nakov, Maxim Panov, Artem Shelmanov
The rapid proliferation of large language models (LLMs) has stimulated
researchers to seek effective and efficient approaches to deal with LLM
hallucinations and low-quality outputs. Uncertainty quantification (UQ) is a
key element of machine learning applications in dealing with such challenges.
However, research to date on UQ for LLMs has been fragmented in terms of
techniques and evaluation methodologies. In this work, we address this issue by
introducing a novel benchmark that implements a collection of state-of-the-art
UQ baselines and offers an environment for controllable and consistent
evaluation of novel UQ techniques over various text generation tasks. Our
benchmark also supports the assessment of confidence normalization methods in
terms of their ability to provide interpretable scores. Using our benchmark, we
conduct a large-scale empirical investigation of UQ and normalization
techniques across eleven tasks, identifying the most effective approaches.
Code: https://github.com/IINemo/lm-polygraph Benchmark:
https://huggingface.co/LM-Polygraph
comment: Published at TACL 2025, presented at ACL 2025. Roman Vashurin,
Ekaterina Fadeeva, Artem Vazhentsev contributed equally
♻ ☆ Computational Analysis of Character Development in Holocaust Testimonies
This work presents a computational approach to analyze character development
along the narrative timeline. The analysis characterizes the inner and outer
changes the protagonist undergoes within a narrative, and the interplay between
them. We consider transcripts of Holocaust survivor testimonies as a test case,
each telling the story of an individual in first-person terms. We focus on the
survivor's religious trajectory, examining the evolution of their disposition
toward religious belief and practice along the testimony. Clustering the
resulting trajectories in the dataset, we identify common sequences in the
data. Our findings highlight multiple common structures of religiosity across
the narratives: in terms of belief, most present a constant disposition, while
for practice, most present an oscillating structure, serving as valuable
material for historical and sociological research. This work demonstrates the
potential of natural language processing techniques for analyzing character
evolution through thematic trajectories in narratives.
♻ ☆ CSC-SQL: Corrective Self-Consistency in Text-to-SQL via Reinforcement Learning
Large language models (LLMs) have demonstrated strong capabilities in
translating natural language questions about relational databases into SQL
queries. In particular, test-time scaling techniques such as Self-Consistency
and Self-Correction can enhance SQL generation accuracy by increasing
computational effort during inference. However, these methods have notable
limitations: Self-Consistency may select suboptimal outputs despite majority
votes, while Self-Correction typically addresses only syntactic errors. To
leverage the strengths of both approaches, we propose CSC-SQL, a novel method
that integrates Self-Consistency and Self-Correction. CSC-SQL selects the two
most frequently occurring outputs from parallel sampling and feeds them into a
merge revision model for correction. Additionally, we employ the Group Relative
Policy Optimization (GRPO) algorithm to fine-tune both the SQL generation and
revision models via reinforcement learning, significantly enhancing output
quality. Experimental results confirm the effectiveness and generalizability of
CSC-SQL. On the BIRD private test set, our 7B model achieves 71.72\% execution
accuracy, while the 32B model achieves 73.67\%. The code has been open sourced
at https://github.com/CycloneBoy/csc_sql.
comment: 25 pages, 5 figures
♻ ☆ Sparsing Law: Towards Large Language Models with Greater Activation Sparsity
Yuqi Luo, Chenyang Song, Xu Han, Yingfa Chen, Chaojun Xiao, Xiaojun Meng, Liqun Deng, Jiansheng Wei, Zhiyuan Liu, Maosong Sun
Activation sparsity denotes the existence of substantial weakly-contributed
elements within activation outputs that can be eliminated, benefiting many
important applications concerned with large language models (LLMs). Although
promoting greater activation sparsity within LLMs deserves deep studies,
existing works lack comprehensive and quantitative research on the correlation
between activation sparsity and potentially influential factors. In this paper,
we present a comprehensive study on the quantitative scaling properties and
influential factors of the activation sparsity within decoder-only
Transformer-based LLMs. Specifically, we propose PPL-$p\%$ sparsity, a precise
and performance-aware activation sparsity metric that is applicable to any
activation function. Through extensive experiments, we find several important
phenomena. Firstly, different activation functions exhibit comparable
performance but opposite training-time sparsity trends. The activation ratio
(i.e., $1-\mathrm{sparsity\ ratio}$) evolves as a convergent increasing
power-law and decreasing logspace power-law with the amount of training data
for SiLU-activated and ReLU-activated LLMs, respectively. These demonstrate
that ReLU is more efficient as the activation function than SiLU and can
leverage more training data to improve activation sparsity. Secondly, the
activation ratio linearly increases with the width-depth ratio below a certain
bottleneck point, indicating the potential advantage of a deeper architecture
at a fixed parameter scale. Finally, at similar width-depth ratios, we
surprisingly find that the limit value of activation sparsity varies weakly
with the parameter scale, i.e., the activation patterns within LLMs are
insensitive to the parameter scale. These empirical laws towards LLMs with
greater activation sparsity have important implications for making LLMs more
efficient and interpretable.
comment: 23 pages, 13 figures, 6 tables
♻ ☆ Robust LLM Unlearning with MUDMAN: Meta-Unlearning with Disruption Masking And Normalization
Language models can retain dangerous knowledge and skills even after
extensive safety fine-tuning, posing both misuse and misalignment risks. Recent
studies show that even specialized unlearning methods can be easily reversed.
To address this, we systematically evaluate many existing and novel components
of unlearning methods and identify ones crucial for irreversible unlearning.
We introduce Disruption Masking, a technique in which we only allow updating
weights, where the signs of the unlearning gradient and the retaining gradient
are the same. This ensures all updates are non-disruptive.
Additionally, we identify the need for normalizing the unlearning gradients,
and also confirm the usefulness of meta-learning. We combine these insights
into MUDMAN (Meta-Unlearning with Disruption Masking and Normalization) and
validate its effectiveness at preventing the recovery of dangerous
capabilities. MUDMAN outperforms the prior TAR method by 40%, setting a new
state-of-the-art for robust unlearning.
♻ ☆ Evaluating K-Fold Cross Validation for Transformer Based Symbolic Regression Models
Symbolic Regression remains an NP-Hard problem, with extensive research
focusing on AI models for this task. Transformer models have shown promise in
Symbolic Regression, but performance suffers with smaller datasets. We propose
applying k-fold cross-validation to a transformer-based symbolic regression
model trained on a significantly reduced dataset (15,000 data points, down from
500,000). This technique partitions the training data into multiple subsets
(folds), iteratively training on some while validating on others. Our aim is to
provide an estimate of model generalization and mitigate overfitting issues
associated with smaller datasets. Results show that this process improves the
model's output consistency and generalization by a relative improvement in
validation loss of 53.31%. Potentially enabling more efficient and accessible
symbolic regression in resource-constrained environments.
♻ ☆ KAG-Thinker: Interactive Thinking and Deep Reasoning in LLMs via Knowledge-Augmented Generation
Dalong Zhang, Jun Xu, Jun Zhou, Lei Liang, Lin Yuan, Ling Zhong, Mengshu Sun, Peilong Zhao, QiWei Wang, Xiaorui Wang, Xinkai Du, YangYang Hou, Yu Ao, ZhaoYang Wang, Zhengke Gui, ZhiYing Yi, Zhongpu Bo, Haofen Wang, Huajun Chen
In this paper, we introduce KAG-Thinker, which upgrade KAG to a multi-turn
interactive thinking and deep reasoning framework powered by a dedicated
parameter-light large language model (LLM). Our approach constructs a
structured thinking process for solving complex problems, enhancing the the
logical coherence and contextual consistency of the reasoning process in
question-answering (Q&A) tasks on domain-specific knowledge bases (KBs) within
LLMs. Following the \textbf{Logical Form} guided retrieval and reasoning
technology route of KAG, this framework first decomposes complex questions into
independently solvable sub-problems (which are also referred to as logical
forms) through \textbf{breadth decomposition}. Each such logical form is
represented in two equivalent forms-natural language and logical function-and
subsequently classified as either a Knowledge Retrieval or Reasoning Analysis
task. Dependencies and parameter passing between these tasks are explicitly
modeled via logical function interfaces. In the solving process, the Retrieval
function performs retrieval tasks. It retrieves one-hop structured and
unstructured information of specified knowledge unit. While the Math and Deduce
functions are used to perform reasoning analysis tasks. Secondly, it is worth
noting that, in the Knowledge Retrieval sub-problem tasks, LLMs and external
knowledge sources are regarded as equivalent KBs. We use the \textbf{knowledge
boundary} module to determine the optimal source using self-regulatory
mechanisms such as confidence calibration and reflective reasoning, and use the
\textbf{depth solving} module to enhance the comprehensiveness of knowledge
acquisition...
♻ ☆ FedEx-LoRA: Exact Aggregation for Federated and Efficient Fine-Tuning of Foundation Models ACL 2025
Low-Rank Adaptation (LoRA) is a popular technique for efficient fine-tuning
of foundation models. However, applying LoRA in federated learning
environments, where data is distributed across multiple clients, presents
unique challenges. Existing methods rely on traditional federated averaging of
LoRA adapters, resulting in inexact updates. To address this, we propose
Federated Exact LoRA, or FedEx-LoRA, which adds a residual error term to the
pretrained frozen weight matrix. Our approach achieves exact updates with
minimal computational and communication overhead, preserving LoRA's efficiency.
We evaluate the method on various models across arithmetic reasoning,
commonsense reasoning, natural language understanding and natural language
generation tasks, showing consistent performance gains over state-of-the-art
methods across multiple settings. Through extensive analysis, we quantify that
the deviations in updates from the ideal solution are significant, highlighting
the need for exact aggregation. Our method's simplicity, efficiency, and broad
applicability position it as a promising solution for accurate and effective
federated fine-tuning of foundation models. Our code is publicly available at
https://github.com/RaghavSinghal10/fedex-lora.
comment: ACL 2025 - Oral. Raghav Singhal and Kaustubh Ponkshe contributed
equally to this work
♻ ☆ From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data
Audio-aware large language models (ALLMs) have recently made great strides in
understanding and processing audio inputs. These models are typically adapted
from text-based large language models (LLMs) through additional training on
audio-related tasks. However, this adaptation process presents two major
limitations. First, ALLMs often suffer from catastrophic forgetting, where
crucial textual capabilities like instruction-following are lost after training
on audio data. In some cases, models may even hallucinate sounds that are not
present in the input audio, raising concerns about reliability. Second,
achieving cross-modal alignment between audio and language typically relies on
large collections of task-specific question-answer pairs for instruction
tuning, making it resource-intensive. To address these issues, previous works
have leveraged the backbone LLMs to synthesize general-purpose, caption-style
alignment data. In this paper, we propose a data generation framework that
produces contrastive-like training data, designed to enhance ALLMs' ability to
differentiate between present and absent sounds. We further extend our approach
to multi-audio scenarios, enabling the model to either explain differences
between audio inputs or produce unified captions that describe all inputs,
thereby enhancing audio-language alignment. We refer to the entire ALLM
training framework as bootstrapping audio-language alignment via synthetic data
generation from backbone LLMs (BALSa). Experimental results indicate that our
method effectively mitigates audio hallucinations while reliably maintaining
strong performance on audio understanding and reasoning benchmarks, as well as
instruction-following skills. Moreover, incorporating multi-audio training
further enhances the model's comprehension and reasoning capabilities. Overall,
BALSa offers an efficient and scalable approach to developing ALLMs.
comment: Submitted to IEEE/ACM Transactions on Audio, Speech, and Language
Processing. Project Website: https://kuan2jiu99.github.io/Balsa
♻ ☆ FlexRAG: A Flexible and Comprehensive Framework for Retrieval-Augmented Generation ACL 2025
Retrieval-Augmented Generation (RAG) plays a pivotal role in modern large
language model applications, with numerous existing frameworks offering a wide
range of functionalities to facilitate the development of RAG systems. However,
we have identified several persistent challenges in these frameworks, including
difficulties in algorithm reproduction and sharing, lack of new techniques, and
high system overhead. To address these limitations, we introduce
\textbf{FlexRAG}, an open-source framework specifically designed for research
and prototyping. FlexRAG supports text-based, multimodal, and network-based
RAG, providing comprehensive lifecycle support alongside efficient asynchronous
processing and persistent caching capabilities. By offering a robust and
flexible solution, FlexRAG enables researchers to rapidly develop, deploy, and
share advanced RAG systems. Our toolkit and resources are available at
\href{https://github.com/ictnlp/FlexRAG}{https://github.com/ictnlp/FlexRAG}.
comment: Accepted by ACL 2025 Demo
♻ ☆ A Comprehensive Evaluation of Semantic Relation Knowledge of Pretrained Language Models and Humans
Recently, much work has concerned itself with the enigma of what exactly PLMs
(pretrained language models) learn about different aspects of language, and how
they learn it. One stream of this type of research investigates the knowledge
that PLMs have about semantic relations. However, many aspects of semantic
relations were left unexplored. Only one relation was considered, namely
hypernymy. Furthermore, previous work did not measure humans' performance on
the same task as that solved by the PLMs. This means that at this point in
time, there is only an incomplete view of models' semantic relation knowledge.
To address this gap, we introduce a comprehensive evaluation framework covering
five relations beyond hypernymy, namely hyponymy, holonymy, meronymy, antonymy,
and synonymy. We use six metrics (two newly introduced here) for recently
untreated aspects of semantic relation knowledge, namely soundness,
completeness, symmetry, asymmetry, prototypicality, and distinguishability and
fairly compare humans and models on the same task. Our extensive experiments
involve 16 PLMs, eight masked and eight causal language models. Up to now only
masked language models had been tested although causal and masked language
models treat context differently. Our results reveal a significant knowledge
gap between humans and models for almost all semantic relations. Antonymy is
the outlier relation where all models perform reasonably well. In general,
masked language models perform significantly better than causal language
models. Nonetheless, both masked and causal language models are likely to
confuse non-antonymy relations with antonymy.
comment: This manuscript is currently under review at Language Resources and
Evaluation
♻ ☆ Gumiho: A Hybrid Architecture to Prioritize Early Tokens in Speculative Decoding ICML 2025
Speculative decoding (SPD) aims to accelerate the auto-regressive token
generation process of a target Large Language Model (LLM). Some approaches
employ a draft model with multiple heads to predict a sequence of future
tokens, where each head handles a token in the sequence. The target LLM
verifies the predicted sequence and accepts aligned tokens, enabling efficient
multi-token generation. However, existing methods assume that all tokens within
a sequence are equally important, employing identical head structures and
relying on a single-generation paradigm, either serial or parallel. To this
end, we theoretically demonstrate that initial tokens in the draft sequence are
more important than later ones. Building on this insight, we propose Gumiho, a
hybrid model combining serial and parallel heads. Specifically, given the
critical importance of early tokens, we employ a sophisticated Transformer
architecture for the early draft heads in a serial configuration to improve
accuracy. For later tokens, we utilize multiple lightweight MLP heads operating
in parallel to enhance efficiency. By allocating more advanced model structures
and longer running times to the early heads, Gumiho achieves improved overall
performance. The experimental results demonstrate that our method outperforms
existing approaches, fully validating its effectiveness.
comment: Accepted to the 42nd International Conference on Machine Learning
(ICML 2025). Code: https://github.com/AMD-AIG-AIMA/Gumiho
♻ ☆ LLM Braces: Straightening Out LLM Predictions with Relevant Sub-Updates ACL 2025
Recent findings reveal that much of the knowledge in a Transformer-based
Large Language Model (LLM) is encoded in its feed-forward (FFN) layers, where
each FNN layer can be interpreted as the summation of sub-updates, each
corresponding to a weighted column vector from the FFN's value parameter matrix
that often encodes human-interpretable concepts. In light of this, we
hypothesize that model performance and behaviors can be further enhanced and
controlled by modulating the contributions of these sub-updates based on their
relevance to the input or target output style, and propose LLMBRACES, a novel
and efficient method that computes relevance scores associated with value
vectors in FFN layers and leverages these scores to dynamically adjust the
contribution of sub-updates. By optimizing sub-update contributions, LLMBRACES
refines the prediction process, leading to more accurate and reliable outputs,
much like a 'brace' providing support and stability. Moreover, LLMBRACES can be
extended to support conditional control over generation characteristics, such
as sentiment, thereby offering fine-grained steering of LLM outputs. Extensive
experiments on various LLMs-including Qwen2.5-1.5B, Llama2-7B, and
Llama3-8B-demonstrate that LLMBRACES outperforms baseline approaches in both
fine-tuning and zero-shot settings while requiring significantly fewer tunable
parameters, up to 75% fewer compared to LoRA. Furthermore, LLMBRACES excels in
sentiment-controlled generation and toxicity reduction, highlighting its
potential for flexible, controlled text generation across applications.
comment: ACL 2025, 16 pages, 2 figures
♻ ☆ FinEval-KR: A Financial Domain Evaluation Framework for Large Language Models' Knowledge and Reasoning
Shaoyu Dou, Yutian Shen, Mofan Chen, Zixuan Wang, Jiajie Xu, Qi Guo, Kailai Shao, Chao Chen, Haixiang Hu, Haibo Shi, Min Min, Liwen Zhang
Large Language Models (LLMs) demonstrate significant potential but face
challenges in complex financial reasoning tasks requiring both domain knowledge
and sophisticated reasoning. Current evaluation benchmarks often fall short by
not decoupling these capabilities indicators from single task performance and
lack root cause analysis for task failure. To address this, we introduce
FinEval-KR, a novel evaluation framework for decoupling and quantifying LLMs'
knowledge and reasoning abilities independently, proposing distinct knowledge
score and reasoning score metrics. Inspired by cognitive science, we further
propose a cognitive score based on Bloom's taxonomy to analyze capabilities in
reasoning tasks across different cognitive levels. We also release a new
open-source Chinese financial reasoning dataset covering 22 subfields to
support reproducible research and further advancements in financial reasoning.
Our experimental results reveal that LLM reasoning ability and higher-order
cognitive ability are the core factors influencing reasoning accuracy. We also
specifically find that even top models still face a bottleneck with knowledge
application. Furthermore, our analysis shows that specialized financial LLMs
generally lag behind the top general large models across multiple metrics.
comment: The statistics included in the paper are incomplete (e.g., Tables 2
and 5 report only the results of a single run), which may lead readers to
misunderstand
♻ ☆ CTISum: A New Benchmark Dataset For Cyber Threat Intelligence Summarization
Cyber Threat Intelligence (CTI) summarization involves generating concise and
accurate highlights from web intelligence data, which is critical for providing
decision-makers with actionable insights to swiftly detect and respond to cyber
threats in the cybersecurity domain. Despite that, the development of efficient
techniques for summarizing CTI reports, comprising facts, analytical insights,
attack processes, and more, has been hindered by the lack of suitable datasets.
To address this gap, we introduce CTISum, a new benchmark dataset designed for
the CTI summarization task. Recognizing the significance of understanding
attack processes, we also propose a novel fine-grained subtask: attack process
summarization, which aims to help defenders assess risks, identify security
gaps, and uncover vulnerabilities. Specifically, a multi-stage annotation
pipeline is designed to collect and annotate CTI data from diverse web sources,
alongside a comprehensive benchmarking of CTISum using both extractive,
abstractive and LLMs-based summarization methods. Experimental results reveal
that current state-of-the-art models face significant challenges when applied
to CTISum, highlighting that automatic summarization of CTI reports remains an
open research problem. The code and example dataset can be made publicly
available at https://github.com/pengwei-iie/CTISum.
♻ ☆ Parenting: Optimizing Knowledge Selection of Retrieval-Augmented Language Models with Parameter Decoupling and Tailored Tuning ACL 2025
Yongxin Xu, Ruizhe Zhang, Xinke Jiang, Yujie Feng, Yuzhen Xiao, Xinyu Ma, Runchuan Zhu, Xu Chu, Junfeng Zhao, Yasha Wang
Retrieval-Augmented Generation (RAG) offers an effective solution to the
issues faced by Large Language Models (LLMs) in hallucination generation and
knowledge obsolescence by incorporating externally retrieved knowledge.
However, existing methods lack effective control mechanisms for integrating
internal and external knowledge. Inspired by human cognitive processes, we
propose Parenting, a novel framework that decouples, identifies, and
purposefully optimizes parameter subspaces related to adherence and robustness.
Specifically, Parenting utilizes a key parameter mining method that combines
forward and backward propagation signals to localize subspaces representing
different capabilities. Then, Parenting employs a type-tailored tuning
strategy, applying specific and appropriate optimizations to different
subspaces, aiming to achieve a balanced enhancement of both adherence and
robustness. Extensive experiments on various datasets and models validate the
effectiveness and generalizability of our method.
comment: Accepted to ACL 2025 Main Conference
♻ ☆ Bridge: A Unified Framework to Knowledge Graph Completion via Language Models and Knowledge Representation
Knowledge graph completion (KGC) is a task of inferring missing triples based
on existing Knowledge Graphs (KGs). Both structural and semantic information
are vital for successful KGC. However, existing methods only use either the
structural knowledge from the KG embeddings or the semantic information from
pre-trained language models (PLMs), leading to suboptimal model performance.
Moreover, since PLMs are not trained on KGs, directly using PLMs to encode
triples may be inappropriate. To overcome these limitations, we propose a novel
framework called Bridge, which jointly encodes structural and semantic
information of KGs. Specifically, we strategically encode entities and
relations separately by PLMs to better utilize the semantic knowledge of PLMs
and enable structured representation learning via a structural learning
principle. Furthermore, to bridge the gap between KGs and PLMs, we employ a
self-supervised representation learning method called BYOL to fine-tune PLMs
with two different views of a triple. Unlike BYOL, which uses augmentation
methods to create two semantically similar views of the same image, potentially
altering the semantic information. We strategically separate the triple into
two parts to create different views, thus avoiding semantic alteration.
Experiments demonstrate that Bridge outperforms the SOTA models on three
benchmark datasets.
♻ ☆ Mechanistic Interpretability of Emotion Inference in Large Language Models ACL 2025
Large language models (LLMs) show promising capabilities in predicting human
emotions from text. However, the mechanisms through which these models process
emotional stimuli remain largely unexplored. Our study addresses this gap by
investigating how autoregressive LLMs infer emotions, showing that emotion
representations are functionally localized to specific regions in the model.
Our evaluation includes diverse model families and sizes and is supported by
robustness checks. We then show that the identified representations are
psychologically plausible by drawing on cognitive appraisal theory, a
well-established psychological framework positing that emotions emerge from
evaluations (appraisals) of environmental stimuli. By causally intervening on
construed appraisal concepts, we steer the generation and show that the outputs
align with theoretical and intuitive expectations. This work highlights a novel
way to causally intervene and precisely shape emotional text generation,
potentially benefiting safety and alignment in sensitive affective domains.
comment: ACL 2025 camera-ready version. First two authors contributed equally