Robotics 51
☆ HCRMP: A LLM-Hinted Contextual Reinforcement Learning Framework for Autonomous Driving
Integrating Large Language Models (LLMs) with Reinforcement Learning (RL) can
enhance autonomous driving (AD) performance in complex scenarios. However,
current LLM-Dominated RL methods over-rely on LLM outputs, which are prone to
hallucinations.Evaluations show that state-of-the-art LLM indicates a
non-hallucination rate of only approximately 57.95% when assessed on essential
driving-related tasks. Thus, in these methods, hallucinations from the LLM can
directly jeopardize the performance of driving policies. This paper argues that
maintaining relative independence between the LLM and the RL is vital for
solving the hallucinations problem. Consequently, this paper is devoted to
propose a novel LLM-Hinted RL paradigm. The LLM is used to generate semantic
hints for state augmentation and policy optimization to assist RL agent in
motion planning, while the RL agent counteracts potential erroneous semantic
indications through policy learning to achieve excellent driving performance.
Based on this paradigm, we propose the HCRMP (LLM-Hinted Contextual
Reinforcement Learning Motion Planner) architecture, which is designed that
includes Augmented Semantic Representation Module to extend state space.
Contextual Stability Anchor Module enhances the reliability of multi-critic
weight hints by utilizing information from the knowledge base. Semantic Cache
Module is employed to seamlessly integrate LLM low-frequency guidance with RL
high-frequency control. Extensive experiments in CARLA validate HCRMP's strong
overall driving performance. HCRMP achieves a task success rate of up to 80.3%
under diverse driving conditions with different traffic densities. Under
safety-critical driving conditions, HCRMP significantly reduces the collision
rate by 11.4%, which effectively improves the driving performance in complex
scenarios.
☆ Improving planning and MBRL with temporally-extended actions
Continuous time systems are often modeled using discrete time dynamics but
this requires a small simulation step to maintain accuracy. In turn, this
requires a large planning horizon which leads to computationally demanding
planning problems and reduced performance. Previous work in model free
reinforcement learning has partially addressed this issue using action repeats
where a policy is learned to determine a discrete action duration. Instead we
propose to control the continuous decision timescale directly by using
temporally-extended actions and letting the planner treat the duration of the
action as an additional optimization variable along with the standard action
variables. This additional structure has multiple advantages. It speeds up
simulation time of trajectories and, importantly, it allows for deep horizon
search in terms of primitive actions while using a shallow search depth in the
planner. In addition, in the model based reinforcement learning (MBRL) setting,
it reduces compounding errors from model learning and improves training time
for models. We show that this idea is effective and that the range for action
durations can be automatically selected using a multi-armed bandit formulation
and integrated into the MBRL framework. An extensive experimental evaluation
both in planning and in MBRL, shows that our approach yields faster planning,
better solutions, and that it enables solutions to problems that are not solved
in the standard formulation.
☆ UAV-Flow Colosseo: A Real-World Benchmark for Flying-on-a-Word UAV Imitation Learning
Unmanned Aerial Vehicles (UAVs) are evolving into language-interactive
platforms, enabling more intuitive forms of human-drone interaction. While
prior works have primarily focused on high-level planning and long-horizon
navigation, we shift attention to language-guided fine-grained trajectory
control, where UAVs execute short-range, reactive flight behaviors in response
to language instructions. We formalize this problem as the Flying-on-a-Word
(Flow) task and introduce UAV imitation learning as an effective approach. In
this framework, UAVs learn fine-grained control policies by mimicking expert
pilot trajectories paired with atomic language instructions. To support this
paradigm, we present UAV-Flow, the first real-world benchmark for
language-conditioned, fine-grained UAV control. It includes a task formulation,
a large-scale dataset collected in diverse environments, a deployable control
framework, and a simulation suite for systematic evaluation. Our design enables
UAVs to closely imitate the precise, expert-level flight trajectories of human
pilots and supports direct deployment without sim-to-real gap. We conduct
extensive experiments on UAV-Flow, benchmarking VLN and VLA paradigms. Results
show that VLA models are superior to VLN baselines and highlight the critical
role of spatial grounding in the fine-grained Flow setting.
☆ From Grounding to Manipulation: Case Studies of Foundation Model Integration in Embodied Robotic Systems
Foundation models (FMs) are increasingly used to bridge language and action
in embodied agents, yet the operational characteristics of different FM
integration strategies remain under-explored -- particularly for complex
instruction following and versatile action generation in changing environments.
This paper examines three paradigms for building robotic systems: end-to-end
vision-language-action (VLA) models that implicitly integrate perception and
planning, and modular pipelines incorporating either vision-language models
(VLMs) or multimodal large language models (LLMs). We evaluate these paradigms
through two focused case studies: a complex instruction grounding task
assessing fine-grained instruction understanding and cross-modal
disambiguation, and an object manipulation task targeting skill transfer via
VLA finetuning. Our experiments in zero-shot and few-shot settings reveal
trade-offs in generalization and data efficiency. By exploring performance
limits, we distill design implications for developing language-driven physical
agents and outline emerging challenges and opportunities for FM-powered
robotics in real-world conditions.
comment: 17 pages, 13 figures
☆ SwarmDiff: Swarm Robotic Trajectory Planning in Cluttered Environments via Diffusion Transformer
Swarm robotic trajectory planning faces challenges in computational
efficiency, scalability, and safety, particularly in complex, obstacle-dense
environments. To address these issues, we propose SwarmDiff, a hierarchical and
scalable generative framework for swarm robots. We model the swarm's
macroscopic state using Probability Density Functions (PDFs) and leverage
conditional diffusion models to generate risk-aware macroscopic trajectory
distributions, which then guide the generation of individual robot trajectories
at the microscopic level. To ensure a balance between the swarm's optimal
transportation and risk awareness, we integrate Wasserstein metrics and
Conditional Value at Risk (CVaR). Additionally, we introduce a Diffusion
Transformer (DiT) to improve sampling efficiency and generation quality by
capturing long-range dependencies. Extensive simulations and real-world
experiments demonstrate that SwarmDiff outperforms existing methods in
computational efficiency, trajectory validity, and scalability, making it a
reliable solution for swarm robotic trajectory planning.
☆ Exploring the Limits of Vision-Language-Action Manipulations in Cross-task Generalization
Jiaming Zhou, Ke Ye, Jiayi Liu, Teli Ma, Zifang Wang, Ronghe Qiu, Kun-Yu Lin, Zhilin Zhao, Junwei Liang
The generalization capabilities of vision-language-action (VLA) models to
unseen tasks are crucial to achieving general-purpose robotic manipulation in
open-world settings. However, the cross-task generalization capabilities of
existing VLA models remain significantly underexplored. To address this gap, we
introduce AGNOSTOS, a novel simulation benchmark designed to rigorously
evaluate cross-task zero-shot generalization in manipulation. AGNOSTOS
comprises 23 unseen manipulation tasks for testing, distinct from common
training task distributions, and incorporates two levels of generalization
difficulty to assess robustness. Our systematic evaluation reveals that current
VLA models, despite being trained on diverse datasets, struggle to generalize
effectively to these unseen tasks. To overcome this limitation, we propose
Cross-Task In-Context Manipulation (X-ICM), a method that conditions large
language models (LLMs) on in-context demonstrations from seen tasks to predict
action sequences for unseen tasks. Additionally, we introduce a dynamics-guided
sample selection strategy that identifies relevant demonstrations by capturing
cross-task dynamics. On AGNOSTOS, X-ICM significantly improves cross-task
zero-shot generalization performance over leading VLAs. We believe AGNOSTOS and
X-ICM will serve as valuable tools for advancing general-purpose robotic
manipulation.
comment: Project Page: https://jiaming-zhou.github.io/AGNOSTOS
☆ FLARE: Robot Learning with Implicit World Modeling
Ruijie Zheng, Jing Wang, Scott Reed, Johan Bjorck, Yu Fang, Fengyuan Hu, Joel Jang, Kaushil Kundalia, Zongyu Lin, Loic Magne, Avnish Narayan, You Liang Tan, Guanzhi Wang, Qi Wang, Jiannan Xiang, Yinzhen Xu, Seonghyeon Ye, Jan Kautz, Furong Huang, Yuke Zhu, Linxi Fan
We introduce $\textbf{F}$uture $\textbf{LA}$tent $\textbf{RE}$presentation
Alignment ($\textbf{FLARE}$), a novel framework that integrates predictive
latent world modeling into robot policy learning. By aligning features from a
diffusion transformer with latent embeddings of future observations,
$\textbf{FLARE}$ enables a diffusion transformer policy to anticipate latent
representations of future observations, allowing it to reason about long-term
consequences while generating actions. Remarkably lightweight, $\textbf{FLARE}$
requires only minimal architectural modifications -- adding a few tokens to
standard vision-language-action (VLA) models -- yet delivers substantial
performance gains. Across two challenging multitask simulation imitation
learning benchmarks spanning single-arm and humanoid tabletop manipulation,
$\textbf{FLARE}$ achieves state-of-the-art performance, outperforming prior
policy learning baselines by up to 26%. Moreover, $\textbf{FLARE}$ unlocks the
ability to co-train with human egocentric video demonstrations without action
labels, significantly boosting policy generalization to a novel object with
unseen geometry with as few as a single robot demonstration. Our results
establish $\textbf{FLARE}$ as a general and scalable approach for combining
implicit world modeling with high-frequency robotic control.
comment: Project Webpage / Blogpost:
https://research.nvidia.com/labs/gear/flare
☆ World Models as Reference Trajectories for Rapid Motor Adaptation
Deploying learned control policies in real-world environments poses a
fundamental challenge. When system dynamics change unexpectedly, performance
degrades until models are retrained on new data. We introduce Reflexive World
Models (RWM), a dual control framework that uses world model predictions as
implicit reference trajectories for rapid adaptation. Our method separates the
control problem into long-term reward maximization through reinforcement
learning and robust motor execution through rapid latent control. This dual
architecture achieves significantly faster adaptation with low online
computational cost compared to model-based RL baselines, while maintaining
near-optimal performance. The approach combines the benefits of flexible policy
learning through reinforcement learning with rapid error correction
capabilities, providing a principled approach to maintaining performance in
high-dimensional continuous control tasks under varying dynamics.
☆ Robo-DM: Data Management For Large Robot Datasets ICRA 2025
Kaiyuan Chen, Letian Fu, David Huang, Yanxiang Zhang, Lawrence Yunliang Chen, Huang Huang, Kush Hari, Ashwin Balakrishna, Ted Xiao, Pannag R Sanketi, John Kubiatowicz, Ken Goldberg
Recent results suggest that very large datasets of teleoperated robot
demonstrations can be used to train transformer-based models that have the
potential to generalize to new scenes, robots, and tasks. However, curating,
distributing, and loading large datasets of robot trajectories, which typically
consist of video, textual, and numerical modalities - including streams from
multiple cameras - remains challenging. We propose Robo-DM, an efficient
open-source cloud-based data management toolkit for collecting, sharing, and
learning with robot data. With Robo-DM, robot datasets are stored in a
self-contained format with Extensible Binary Meta Language (EBML). Robo-DM can
significantly reduce the size of robot trajectory data, transfer costs, and
data load time during training. Compared to the RLDS format used in OXE
datasets, Robo-DM's compression saves space by up to 70x (lossy) and 3.5x
(lossless). Robo-DM also accelerates data retrieval by load-balancing video
decoding with memory-mapped decoding caches. Compared to LeRobot, a framework
that also uses lossy video compression, Robo-DM is up to 50x faster when
decoding sequentially. We physically evaluate a model trained by Robo-DM with
lossy compression, a pick-and-place task, and In-Context Robot Transformer.
Robo-DM uses 75x compression of the original dataset and does not suffer
reduction in downstream task accuracy.
comment: Best paper finalist of IEEE ICRA 2025
☆ Robo2VLM: Visual Question Answering from Large-Scale In-the-Wild Robot Manipulation Datasets
Vision-Language Models (VLMs) acquire real-world knowledge and general
reasoning ability through Internet-scale image-text corpora. They can augment
robotic systems with scene understanding and task planning, and assist
visuomotor policies that are trained on robot trajectory data. We explore the
reverse paradigm - using rich, real, multi-modal robot trajectory data to
enhance and evaluate VLMs. In this paper, we present Robo2VLM, a Visual
Question Answering (VQA) dataset generation framework for VLMs. Given a human
tele-operated robot trajectory, Robo2VLM derives ground-truth from non-visual
and non-descriptive sensory modalities, such as end-effector pose, gripper
aperture, and force sensing. Based on these modalities, it segments the robot
trajectory into a sequence of manipulation phases. At each phase, Robo2VLM uses
scene and interaction understanding to identify 3D properties of the robot,
task goal, and the target object. The properties are used to generate
representative VQA queries - images with textural multiple-choice questions -
based on spatial, goal-conditioned, and interaction reasoning question
templates. We curate Robo2VLM-1, a large-scale in-the-wild dataset with 684,710
questions covering 463 distinct scenes and 3,396 robotic manipulation tasks
from 176k real robot trajectories. Results suggest that Robo2VLM-1 can
benchmark and improve VLM capabilities in spatial and interaction reasoning.
☆ Coloring Between the Lines: Personalization in the Null Space of Planning Constraints
Generalist robots must personalize in-the-wild to meet the diverse needs and
preferences of long-term users. How can we enable flexible personalization
without sacrificing safety or competency? This paper proposes Coloring Between
the Lines (CBTL), a method for personalization that exploits the null space of
constraint satisfaction problems (CSPs) used in robot planning. CBTL begins
with a CSP generator that ensures safe and competent behavior, then
incrementally personalizes behavior by learning parameterized constraints from
online interaction. By quantifying uncertainty and leveraging the
compositionality of planning constraints, CBTL achieves sample-efficient
adaptation without environment resets. We evaluate CBTL in (1) three diverse
simulation environments; (2) a web-based user study; and (3) a real-robot
assisted feeding system, finding that CBTL consistently achieves more effective
personalization with fewer interactions than baselines. Our results demonstrate
that CBTL provides a unified and practical approach for continual, flexible,
active, and safe robot personalization. Website:
https://emprise.cs.cornell.edu/cbtl/
☆ Synthetic Enclosed Echoes: A New Dataset to Mitigate the Gap Between Simulated and Real-World Sonar Data
This paper introduces Synthetic Enclosed Echoes (SEE), a novel dataset
designed to enhance robot perception and 3D reconstruction capabilities in
underwater environments. SEE comprises high-fidelity synthetic sonar data,
complemented by a smaller subset of real-world sonar data. To facilitate
flexible data acquisition, a simulated environment has been developed, enabling
the generation of additional data through modifications such as the inclusion
of new structures or imaging sonar configurations. This hybrid approach
leverages the advantages of synthetic data, including readily available ground
truth and the ability to generate diverse datasets, while bridging the
simulation-to-reality gap with real-world data acquired in a similar
environment. The SEE dataset comprehensively evaluates acoustic data-based
methods, including mathematics-based sonar approaches and deep learning
algorithms. These techniques were employed to validate the dataset, confirming
its suitability for underwater 3D reconstruction. Furthermore, this paper
proposes a novel modification to a state-of-the-art algorithm, demonstrating
improved performance compared to existing methods. The SEE dataset enables the
evaluation of acoustic data-based methods in realistic scenarios, thereby
improving their feasibility for real-world underwater applications.
comment: This work has been submitted to the IEEE for possible publication
☆ Guided Policy Optimization under Partial Observability
Reinforcement Learning (RL) in partially observable environments poses
significant challenges due to the complexity of learning under uncertainty.
While additional information, such as that available in simulations, can
enhance training, effectively leveraging it remains an open problem. To address
this, we introduce Guided Policy Optimization (GPO), a framework that co-trains
a guider and a learner. The guider takes advantage of privileged information
while ensuring alignment with the learner's policy that is primarily trained
via imitation learning. We theoretically demonstrate that this learning scheme
achieves optimality comparable to direct RL, thereby overcoming key limitations
inherent in existing approaches. Empirical evaluations show strong performance
of GPO across various tasks, including continuous control with partial
observability and noise, and memory-based challenges, significantly
outperforming existing methods.
comment: 24 pages, 13 figures
☆ Evaluation of Mobile Environment for Vehicular Visible Light Communication Using Multiple LEDs and Event Cameras
In the fields of Advanced Driver Assistance Systems (ADAS) and Autonomous
Driving (AD), sensors that serve as the ``eyes'' for sensing the vehicle's
surrounding environment are essential. Traditionally, image sensors and LiDAR
have played this role. However, a new type of vision sensor, event cameras, has
recently attracted attention. Event cameras respond to changes in the
surrounding environment (e.g., motion), exhibit strong robustness against
motion blur, and perform well in high dynamic range environments, which are
desirable in robotics applications. Furthermore, the asynchronous and
low-latency principles of data acquisition make event cameras suitable for
optical communication. By adding communication functionality to event cameras,
it becomes possible to utilize I2V communication to immediately share
information about forward collisions, sudden braking, and road conditions,
thereby contributing to hazard avoidance. Additionally, receiving information
such as signal timing and traffic volume enables speed adjustment and optimal
route selection, facilitating more efficient driving. In this study, we
construct a vehicle visible light communication system where event cameras are
receivers, and multiple LEDs are transmitters. In driving scenes, the system
tracks the transmitter positions and separates densely packed LED light sources
using pilot sequences based on Walsh-Hadamard codes. As a result, outdoor
vehicle experiments demonstrate error-free communication under conditions where
the transmitter-receiver distance was within 40 meters and the vehicle's
driving speed was 30 km/h (8.3 m/s).
comment: 6 pages, IEEE IV 2025 conference paper
☆ RAZER: Robust Accelerated Zero-Shot 3D Open-Vocabulary Panoptic Reconstruction with Spatio-Temporal Aggregation
Mapping and understanding complex 3D environments is fundamental to how
autonomous systems perceive and interact with the physical world, requiring
both precise geometric reconstruction and rich semantic comprehension. While
existing 3D semantic mapping systems excel at reconstructing and identifying
predefined object instances, they lack the flexibility to efficiently build
semantic maps with open-vocabulary during online operation. Although recent
vision-language models have enabled open-vocabulary object recognition in 2D
images, they haven't yet bridged the gap to 3D spatial understanding. The
critical challenge lies in developing a training-free unified system that can
simultaneously construct accurate 3D maps while maintaining semantic
consistency and supporting natural language interactions in real time. In this
paper, we develop a zero-shot framework that seamlessly integrates
GPU-accelerated geometric reconstruction with open-vocabulary vision-language
models through online instance-level semantic embedding fusion, guided by
hierarchical object association with spatial indexing. Our training-free system
achieves superior performance through incremental processing and unified
geometric-semantic updates, while robustly handling 2D segmentation
inconsistencies. The proposed general-purpose 3D scene understanding framework
can be used for various tasks including zero-shot 3D instance retrieval,
segmentation, and object detection to reason about previously unseen objects
and interpret natural language queries. The project page is available at
https://razer-3d.github.io.
☆ Saliency-Aware Quantized Imitation Learning for Efficient Robotic Control
Seongmin Park, Hyungmin Kim, Sangwoo kim, Wonseok Jeon, Juyoung Yang, Byeongwook Jeon, Yoonseon Oh, Jungwook Choi
Deep neural network (DNN)-based policy models, such as vision-language-action
(VLA) models, excel at automating complex decision-making from multi-modal
inputs. However, scaling these models greatly increases computational overhead,
complicating deployment in resource-constrained settings like robot
manipulation and autonomous driving. To address this, we propose Saliency-Aware
Quantized Imitation Learning (SQIL), which combines quantization-aware training
with a selective loss-weighting strategy for mission-critical states. By
identifying these states via saliency scores and emphasizing them in the
training loss, SQIL preserves decision fidelity under low-bit precision. We
validate SQIL's generalization capability across extensive simulation
benchmarks with environment variations, real-world tasks, and cross-domain
tasks (self-driving, physics simulation), consistently recovering
full-precision performance. Notably, a 4-bit weight-quantized VLA model for
robotic manipulation achieves up to 2.5x speedup and 2.5x energy savings on an
edge GPU with minimal accuracy loss. These results underline SQIL's potential
for efficiently deploying large IL-based policy models on resource-limited
devices.
☆ AgentThink: A Unified Framework for Tool-Augmented Chain-of-Thought Reasoning in Vision-Language Models for Autonomous Driving
Kangan Qian, Sicong Jiang, Yang Zhong, Ziang Luo, Zilin Huang, Tianze Zhu, Kun Jiang, Mengmeng Yang, Zheng Fu, Jinyu Miao, Yining Shi, He Zhe Lim, Li Liu, Tianbao Zhou, Hongyi Wang, Huang Yu, Yifei Hu, Guang Li, Guang Chen, Hao Ye, Lijun Sun, Diange Yang
Vision-Language Models (VLMs) show promise for autonomous driving, yet their
struggle with hallucinations, inefficient reasoning, and limited real-world
validation hinders accurate perception and robust step-by-step reasoning. To
overcome this, we introduce \textbf{AgentThink}, a pioneering unified framework
that, for the first time, integrates Chain-of-Thought (CoT) reasoning with
dynamic, agent-style tool invocation for autonomous driving tasks. AgentThink's
core innovations include: \textbf{(i) Structured Data Generation}, by
establishing an autonomous driving tool library to automatically construct
structured, self-verified reasoning data explicitly incorporating tool usage
for diverse driving scenarios; \textbf{(ii) A Two-stage Training Pipeline},
employing Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization
(GRPO) to equip VLMs with the capability for autonomous tool invocation; and
\textbf{(iii) Agent-style Tool-Usage Evaluation}, introducing a novel
multi-tool assessment protocol to rigorously evaluate the model's tool
invocation and utilization. Experiments on the DriveLMM-o1 benchmark
demonstrate AgentThink significantly boosts overall reasoning scores by
\textbf{53.91\%} and enhances answer accuracy by \textbf{33.54\%}, while
markedly improving reasoning quality and consistency. Furthermore, ablation
studies and robust zero-shot/few-shot generalization experiments across various
benchmarks underscore its powerful capabilities. These findings highlight a
promising trajectory for developing trustworthy and tool-aware autonomous
driving models.
comment: 18 pages, 8 figures
☆ R3GS: Gaussian Splatting for Robust Reconstruction and Relocalization in Unconstrained Image Collections
We propose R3GS, a robust reconstruction and relocalization framework
tailored for unconstrained datasets. Our method uses a hybrid representation
during training. Each anchor combines a global feature from a convolutional
neural network (CNN) with a local feature encoded by the multiresolution hash
grids [2]. Subsequently, several shallow multi-layer perceptrons (MLPs) predict
the attributes of each Gaussians, including color, opacity, and covariance. To
mitigate the adverse effects of transient objects on the reconstruction
process, we ffne-tune a lightweight human detection network. Once ffne-tuned,
this network generates a visibility map that efffciently generalizes to other
transient objects (such as posters, banners, and cars) with minimal need for
further adaptation. Additionally, to address the challenges posed by sky
regions in outdoor scenes, we propose an effective sky-handling technique that
incorporates a depth prior as a constraint. This allows the inffnitely distant
sky to be represented on the surface of a large-radius sky sphere,
signiffcantly reducing ffoaters caused by errors in sky reconstruction.
Furthermore, we introduce a novel relocalization method that remains robust to
changes in lighting conditions while estimating the camera pose of a given
image within the reconstructed 3DGS scene. As a result, R3GS significantly
enhances rendering ffdelity, improves both training and rendering efffciency,
and reduces storage requirements. Our method achieves state-of-the-art
performance compared to baseline methods on in-the-wild datasets. The code will
be made open-source following the acceptance of the paper.
comment: 7 pages, 4 figures
☆ Learning-based Autonomous Oversteer Control and Collision Avoidance
Oversteer, wherein a vehicle's rear tires lose traction and induce
unintentional excessive yaw, poses critical safety challenges. Failing to
control oversteer often leads to severe traffic accidents. Although recent
autonomous driving efforts have attempted to handle oversteer through
stabilizing maneuvers, the majority rely on expert-defined trajectories or
assume obstacle-free environments, limiting real-world applicability. This
paper introduces a novel end-to-end (E2E) autonomous driving approach that
tackles oversteer control and collision avoidance simultaneously. Existing E2E
techniques, including Imitation Learning (IL), Reinforcement Learning (RL), and
Hybrid Learning (HL), generally require near-optimal demonstrations or
extensive experience. Yet even skilled human drivers struggle to provide
perfect demonstrations under oversteer, and high transition variance hinders
accumulating sufficient data. Hence, we present Q-Compared Soft Actor-Critic
(QC-SAC), a new HL algorithm that effectively learns from suboptimal
demonstration data and adapts rapidly to new conditions. To evaluate QC-SAC, we
introduce a benchmark inspired by real-world driver training: a vehicle
encounters sudden oversteer on a slippery surface and must avoid randomly
placed obstacles ahead. Experimental results show QC-SAC attains near-optimal
driving policies, significantly surpassing state-of-the-art IL, RL, and HL
baselines. Our method demonstrates the world's first safe autonomous oversteer
control with obstacle avoidance.
☆ GCNT: Graph-Based Transformer Policies for Morphology-Agnostic Reinforcement Learning
Training a universal controller for robots with different morphologies is a
promising research trend, since it can significantly enhance the robustness and
resilience of the robotic system. However, diverse morphologies can yield
different dimensions of state space and action space, making it difficult to
comply with traditional policy networks. Existing methods address this issue by
modularizing the robot configuration, while do not adequately extract and
utilize the overall morphological information, which has been proven crucial
for training a universal controller. To this end, we propose GCNT, a
morphology-agnostic policy network based on improved Graph Convolutional
Network (GCN) and Transformer. It exploits the fact that GCN and Transformer
can handle arbitrary number of modules to achieve compatibility with diverse
morphologies. Our key insight is that the GCN is able to efficiently extract
morphology information of robots, while Transformer ensures that it is fully
utilized by allowing each node of the robot to communicate this information
directly. Experimental results show that our method can generate resilient
locomotion behaviors for robots with different configurations, including
zero-shot generalization to robot morphologies not seen during training. In
particular, GCNT achieved the best performance on 8 tasks in the 2 standard
benchmarks.
☆ EndoVLA: Dual-Phase Vision-Language-Action Model for Autonomous Tracking in Endoscopy
Chi Kit Ng, Long Bai, Guankun Wang, Yupeng Wang, Huxin Gao, Kun Yuan, Chenhan Jin, Tieyong Zeng, Hongliang Ren
In endoscopic procedures, autonomous tracking of abnormal regions and
following circumferential cutting markers can significantly reduce the
cognitive burden on endoscopists. However, conventional model-based pipelines
are fragile for each component (e.g., detection, motion planning) requires
manual tuning and struggles to incorporate high-level endoscopic intent,
leading to poor generalization across diverse scenes. Vision-Language-Action
(VLA) models, which integrate visual perception, language grounding, and motion
planning within an end-to-end framework, offer a promising alternative by
semantically adapting to surgeon prompts without manual recalibration. Despite
their potential, applying VLA models to robotic endoscopy presents unique
challenges due to the complex and dynamic anatomical environments of the
gastrointestinal (GI) tract. To address this, we introduce EndoVLA, designed
specifically for continuum robots in GI interventions. Given endoscopic images
and surgeon-issued tracking prompts, EndoVLA performs three core tasks: (1)
polyp tracking, (2) delineation and following of abnormal mucosal regions, and
(3) adherence to circular markers during circumferential cutting. To tackle
data scarcity and domain shifts, we propose a dual-phase strategy comprising
supervised fine-tuning on our EndoVLA-Motion dataset and reinforcement
fine-tuning with task-aware rewards. Our approach significantly improves
tracking performance in endoscopy and enables zero-shot generalization in
diverse scenes and complex sequential tasks.
☆ Cascaded Diffusion Models for Neural Motion Planning ICRA'25
Robots in the real world need to perceive and move to goals in complex
environments without collisions. Avoiding collisions is especially difficult
when relying on sensor perception and when goals are among clutter. Diffusion
policies and other generative models have shown strong performance in solving
local planning problems, but often struggle at avoiding all of the subtle
constraint violations that characterize truly challenging global motion
planning problems. In this work, we propose an approach for learning global
motion planning using diffusion policies, allowing the robot to generate full
trajectories through complex scenes and reasoning about multiple obstacles
along the path. Our approach uses cascaded hierarchical models which unify
global prediction and local refinement together with online plan repair to
ensure the trajectories are collision free. Our method outperforms (by ~5%) a
wide variety of baselines on challenging tasks in multiple domains including
navigation and manipulation.
comment: ICRA'25
☆ Object-Focus Actor for Data-efficient Robot Generalization Dexterous Manipulation
Yihang Li, Tianle Zhang, Xuelong Wei, Jiayi Li, Lin Zhao, Dongchi Huang, Zhirui Fang, Minhua Zheng, Wenjun Dai, Xiaodong He
Robot manipulation learning from human demonstrations offers a rapid means to
acquire skills but often lacks generalization across diverse scenes and object
placements. This limitation hinders real-world applications, particularly in
complex tasks requiring dexterous manipulation. Vision-Language-Action (VLA)
paradigm leverages large-scale data to enhance generalization. However, due to
data scarcity, VLA's performance remains limited. In this work, we introduce
Object-Focus Actor (OFA), a novel, data-efficient approach for generalized
dexterous manipulation. OFA exploits the consistent end trajectories observed
in dexterous manipulation tasks, allowing for efficient policy training. Our
method employs a hierarchical pipeline: object perception and pose estimation,
pre-manipulation pose arrival and OFA policy execution. This process ensures
that the manipulation is focused and efficient, even in varied backgrounds and
positional layout. Comprehensive real-world experiments across seven tasks
demonstrate that OFA significantly outperforms baseline methods in both
positional and background generalization tests. Notably, OFA achieves robust
performance with only 10 demonstrations, highlighting its data efficiency.
☆ Learning-based Airflow Inertial Odometry for MAVs using Thermal Anemometers in a GPS and vision denied environment
This work demonstrates an airflow inertial based odometry system with
multi-sensor data fusion, including thermal anemometer, IMU, ESC, and
barometer. This goal is challenging because low-cost IMUs and barometers have
significant bias, and anemometer measurements are very susceptible to
interference from spinning propellers and ground effects. We employ a GRU-based
deep neural network to estimate relative air speed from noisy and disturbed
anemometer measurements, and an observer with bias model to fuse the sensor
data and thus estimate the state of aerial vehicle. A complete flight data,
including takeoff and landing on the ground, shows that the approach is able to
decouple the downwash induced wind speed caused by propellers and the ground
effect, and accurately estimate the flight speed in a wind-free indoor
environment. IMU, and barometer bias are effectively estimated, which
significantly reduces the position integration drift, which is only 5.7m for
203s manual random flight. The open source is available on
https://github.com/SyRoCo-ISIR/Flight-Speed-Estimation-Airflow.
☆ Histo-Planner: A Real-time Local Planner for MAVs Teleoperation based on Histogram of Obstacle Distribution
This paper concerns real-time obstacle avoidance for micro aerial vehicles
(MAVs). Motivated by teleoperation applications in cluttered environments with
limited computational power, we propose a local planner that does not require
the knowledge or construction of a global map of the obstacles. The proposed
solution consists of a real-time trajectory planning algorithm that relies on
the histogram of obstacle distribution and a planner manager that triggers
different planning modes depending on obstacles location around the MAV. The
proposed solution is validated, for a teleoperation application, with both
simulations and indoor experiments. Benchmark comparisons based on a designed
simulation platform are also provided.
☆ Fault-Tolerant Multi-Robot Coordination with Limited Sensing within Confined Environments
As robots are increasingly deployed to collaborate on tasks within shared
workspaces and resources, the failure of an individual robot can critically
affect the group's performance. This issue is particularly challenging when
robots lack global information or direct communication, relying instead on
social interaction for coordination and to complete their tasks. In this study,
we propose a novel fault-tolerance technique leveraging physical contact
interactions in multi-robot systems, specifically under conditions of limited
sensing and spatial confinement. We introduce the "Active Contact Response"
(ACR) method, where each robot modulates its behavior based on the likelihood
of encountering an inoperative (faulty) robot. Active robots are capable of
collectively repositioning stationary and faulty peers to reduce obstructions
and maintain optimal group functionality. We implement our algorithm in a team
of autonomous robots, equipped with contact-sensing and collision-tolerance
capabilities, tasked with collectively excavating cohesive model pellets.
Experimental results indicate that the ACR method significantly improves the
system's recovery time from robot failures, enabling continued collective
excavation with minimal performance degradation. Thus, this work demonstrates
the potential of leveraging local, social, and physical interactions to enhance
fault tolerance and coordination in multi-robot systems operating in
constrained and extreme environments.
comment: 15 pages, 4 figures. Accepted to DARS 2024 (Distributed Autonomous
Robotic Systems), to appear in Springer Proceedings in Advanced Robotics
☆ Toward Task Capable Active Matter: Learning to Avoid Clogging in Confined Collectives via Collisions
Kehinde O. Aina, Ram Avinery, Hui-Shun Kuan, Meredith D. Betterton, Michael A. D. Goodisman, Daniel I. Goldman
Social organisms which construct nests consisting of tunnels and chambers
necessarily navigate confined and crowded conditions. Unlike low-density
collectives like bird flocks and insect swarms, in which hydrodynamic and
statistical phenomena dominate, the physics of glasses and supercooled fluids
is important to understand clogging behaviors in high-density collectives. Our
previous work revealed that fire ants flowing in confined tunnels utilize
diverse behaviors like unequal workload distributions, spontaneous direction
reversals, and limited interaction times to mitigate clogging and jamming and
thus maintain functional flow; implementation of similar rules in a small
robophysical swarm led to high performance through spontaneous dissolution of
clogs and clusters. However, how the insects learn such behaviors, and how we
can develop "task capable" active matter in such regimes, remains a challenge
in part because interaction dynamics are dominated by local, time-consuming
collisions and no single agent can guide the entire collective. Here, we
hypothesized that effective flow and clog mitigation could emerge purely
through local learning. We tasked small groups of robots with pellet excavation
in a narrow tunnel, allowing them to modify reversal probabilities over time.
Initially, robots had equal probabilities and clogs were common. Reversals
improved flow. When reversal probabilities adapted via collisions and noisy
tunnel length estimates, workload inequality and performance improved. Our
robophysical study of an excavating swarm shows that, despite the seeming
complexity and difficulty of the task, simple learning rules can mitigate or
leverage unavoidable features in task-capable dense active matter, leading to
hypotheses for dense biological and robotic swarms.
comment: 13 pages, 9 figures. Published in Frontiers in Physics, Social
Physics section. Includes experimental and simulation analysis of multi-robot
excavation using decentralized learning
★ Shape-Adaptive Planning and Control for a Deformable Quadrotor
Drones have become essential in various applications, but conventional
quadrotors face limitations in confined spaces and complex tasks. Deformable
drones, which can adapt their shape in real-time, offer a promising solution to
overcome these challenges, while also enhancing maneuverability and enabling
novel tasks like object grasping. This paper presents a novel approach to
autonomous motion planning and control for deformable quadrotors. We introduce
a shape-adaptive trajectory planner that incorporates deformation dynamics into
path generation, using a scalable kinodynamic A* search to handle deformation
parameters in complex environments. The backend spatio-temporal optimization is
capable of generating optimally smooth trajectories that incorporate shape
deformation. Additionally, we propose an enhanced control strategy that
compensates for external forces and torque disturbances, achieving a 37.3\%
reduction in trajectory tracking error compared to our previous work. Our
approach is validated through simulations and real-world experiments,
demonstrating its effectiveness in narrow-gap traversal and multi-modal
deformable tasks.
☆ UniSTPA: A Safety Analysis Framework for End-to-End Autonomous Driving
As autonomous driving technology continues to advance, end-to-end models have
attracted considerable attention owing to their superior generalisation
capability. Nevertheless, such learning-based systems entail numerous safety
risks throughout development and on-road deployment, and existing
safety-analysis methods struggle to identify these risks comprehensively. To
address this gap, we propose the Unified System Theoretic Process Analysis
(UniSTPA) framework, which extends the scope of STPA from the operational phase
to the entire lifecycle of an end-to-end autonomous driving system, including
information gathering, data preparation, closed loop training, verification,
and deployment. UniSTPA performs hazard analysis not only at the component
level but also within the model's internal layers, thereby enabling
fine-grained assessment of inter and intra module interactions. Using a highway
Navigate on Autopilot function as a case study, UniSTPA uncovers multi-stage
hazards overlooked by conventional approaches including scene design defects,
sensor fusion biases, and internal model flaws, through multi-level causal
analysis, traces these hazards to deeper issues such as data quality, network
architecture, and optimisation objectives. The analysis result are used to
construct a safety monitoring and safety response mechanism that supports
continuous improvement from hazard identification to system optimisation. The
proposed framework thus offers both theoretical and practical guidance for the
safe development and deployment of end-to-end autonomous driving systems.
☆ AnyBody: A Benchmark Suite for Cross-Embodiment Manipulation
Generalizing control policies to novel embodiments remains a fundamental
challenge in enabling scalable and transferable learning in robotics. While
prior works have explored this in locomotion, a systematic study in the context
of manipulation tasks remains limited, partly due to the lack of standardized
benchmarks. In this paper, we introduce a benchmark for learning
cross-embodiment manipulation, focusing on two foundational tasks-reach and
push-across a diverse range of morphologies. The benchmark is designed to test
generalization along three axes: interpolation (testing performance within a
robot category that shares the same link structure), extrapolation (testing on
a robot with a different link structure), and composition (testing on
combinations of link structures). On the benchmark, we evaluate the ability of
different RL policies to learn from multiple morphologies and to generalize to
novel ones. Our study aims to answer whether morphology-aware training can
outperform single-embodiment baselines, whether zero-shot generalization to
unseen morphologies is feasible, and how consistently these patterns hold
across different generalization regimes. The results highlight the current
limitations of multi-embodiment learning and provide insights into how
architectural and training design choices influence policy generalization.
☆ Toward Informed AV Decision-Making: Computational Model of Well-being and Trust in Mobility
For future human-autonomous vehicle (AV) interactions to be effective and
smooth, human-aware systems that analyze and align human needs with automation
decisions are essential. Achieving this requires systems that account for human
cognitive states. We present a novel computational model in the form of a
Dynamic Bayesian Network (DBN) that infers the cognitive states of both AV
users and other road users, integrating this information into the AV's
decision-making process. Specifically, our model captures the well-being of
both an AV user and an interacting road user as cognitive states alongside
trust. Our DBN models infer beliefs over the AV user's evolving well-being,
trust, and intention states, as well as the possible well-being of other road
users, based on observed interaction experiences. Using data collected from an
interaction study, we refine the model parameters and empirically assess its
performance. Finally, we extend our model into a causal inference model (CIM)
framework for AV decision-making, enabling the AV to enhance user well-being
and trust while balancing these factors with its own operational costs and the
well-being of interacting road users. Our evaluation demonstrates the model's
effectiveness in accurately predicting user's states and guiding informed,
human-centered AV decisions.
♻ ☆ RoboCrowd: Scaling Robot Data Collection through Crowdsourcing ICRA
Suvir Mirchandani, David D. Yuan, Kaylee Burns, Md Sazzad Islam, Tony Z. Zhao, Chelsea Finn, Dorsa Sadigh
In recent years, imitation learning from large-scale human demonstrations has
emerged as a promising paradigm for training robot policies. However, the
burden of collecting large quantities of human demonstrations is significant in
terms of collection time and the need for access to expert operators. We
introduce a new data collection paradigm, RoboCrowd, which distributes the
workload by utilizing crowdsourcing principles and incentive design. RoboCrowd
helps enable scalable data collection and facilitates more efficient learning
of robot policies. We build RoboCrowd on top of ALOHA (Zhao et al. 2023) -- a
bimanual platform that supports data collection via puppeteering -- to explore
the design space for crowdsourcing in-person demonstrations in a public
environment. We propose three classes of incentive mechanisms to appeal to
users' varying sources of motivation for interacting with the system: material
rewards, intrinsic interest, and social comparison. We instantiate these
incentives through tasks that include physical rewards, engaging or challenging
manipulations, as well as gamification elements such as a leaderboard. We
conduct a large-scale, two-week field experiment in which the platform is
situated in a university cafe. We observe significant engagement with the
system -- over 200 individuals independently volunteered to provide a total of
over 800 interaction episodes. Our findings validate the proposed incentives as
mechanisms for shaping users' data quantity and quality. Further, we
demonstrate that the crowdsourced data can serve as useful pre-training data
for policies fine-tuned on expert demonstrations -- boosting performance up to
20% compared to when this data is not available. These results suggest the
potential for RoboCrowd to reduce the burden of robot data collection by
carefully implementing crowdsourcing and incentive design principles.
comment: 21 pages, 25 figures. International Conference on Robotics and
Automation (ICRA) 2025
♻ ★ Multi-Robot System for Cooperative Exploration in Unknown Environments: A Survey
Chuqi Wang, Chao Yu, Xin Xu, Yuman Gao, Xinyi Yang, Wenhao Tang, Shu'ang Yu, Yinuo Chen, Feng Gao, ZhuoZhu Jian, Xinlei Chen, Fei Gao, Boyu Zhou, Yu Wang
With the real need of field exploration in large-scale and extreme outdoor
environments, cooperative exploration tasks have garnered increasing attention.
This paper presents a comprehensive review of multi-robot cooperative
exploration systems. First, we review the evolution of robotic exploration and
introduce a modular research framework tailored for multi-robot cooperative
exploration. Based on this framework, we systematically categorize and
summarize key system components. As a foundational module for multi-robot
exploration, the localization and mapping module is primarily introduced by
focusing on global and relative pose estimation, as well as multi-robot map
merging techniques. The cooperative motion module is further divided into
learning-based approaches and multi-stage planning, with the latter
encompassing target generation, task allocation, and motion planning
strategies. Given the communication constraints of real-world environments, we
also analyze the communication module, emphasizing how robots exchange
information within local communication ranges and under limited transmission
capabilities. In addition, we introduce the actual application of multi-robot
cooperative exploration systems in DARPA SubT Challenge. Finally, we discuss
the challenges and future research directions for multi-robot cooperative
exploration in light of real-world trends. This review aims to serve as a
valuable reference for researchers and practitioners in the field.
♻ ☆ Effective Sampling for Robot Motion Planning Through the Lens of Lattices
Sampling-based methods for motion planning, which capture the structure of
the robot's free space via (typically random) sampling, have gained popularity
due to their scalability, simplicity, and for offering global guarantees, such
as probabilistic completeness and asymptotic optimality. Unfortunately, the
practicality of those guarantees remains limited as they do not provide
insights into the behavior of motion planners for a finite number of samples
(i.e., a finite running time). In this work, we harness lattice theory and the
concept of $(\delta,\epsilon)$-completeness by Tsao et al. (2020) to construct
deterministic sample sets that endow their planners with strong finite-time
guarantees while minimizing running time. In particular, we introduce a
highly-efficient deterministic sampling approach based on the $A_d^*$ lattice,
which is the best-known geometric covering in dimensions $\leq 21$. Using our
new sampling approach, we obtain at least an order-of-magnitude speedup over
existing deterministic and uniform random sampling methods for complex
motion-planning problems. Overall, our work provides deep mathematical insights
while advancing the practical applicability of sampling-based motion planning.
comment: To appear in Robotics: Science and Systems, 2025
♻ ☆ M3TR: A Generalist Model for Real-World HD Map Completion
Autonomous vehicles rely on HD maps for their operation, but offline HD maps
eventually become outdated. For this reason, online HD map construction methods
use live sensor data to infer map information instead. Research on real map
changes shows that oftentimes entire parts of an HD map remain unchanged and
can be used as a prior. We therefore introduce M3TR (Multi-Masking Map
Transformer), a generalist approach for HD map completion both with and without
offline HD map priors. As a necessary foundation, we address shortcomings in
ground truth labels for Argoverse 2 and nuScenes and propose the first
comprehensive benchmark for HD map completion. Unlike existing models that
specialize in a single kind of map change, which is unrealistic for deployment,
our Generalist model handles all kinds of changes, matching the effectiveness
of Expert models. With our map masking as augmentation regime, we can even
achieve a +1.4 mAP improvement without a prior. Finally, by fully utilizing
prior HD map elements and optimizing query designs, M3TR outperforms existing
methods by +4.3 mAP while being the first real-world deployable model for
offline HD map priors. Code is available at https://github.com/immel-f/m3tr
♻ ☆ PlaySlot: Learning Inverse Latent Dynamics for Controllable Object-Centric Video Prediction and Planning ICML 2025
Predicting future scene representations is a crucial task for enabling robots
to understand and interact with the environment. However, most existing methods
rely on videos and simulations with precise action annotations, limiting their
ability to leverage the large amount of available unlabeled video data. To
address this challenge, we propose PlaySlot, an object-centric video prediction
model that infers object representations and latent actions from unlabeled
video sequences. It then uses these representations to forecast future object
states and video frames. PlaySlot allows the generation of multiple possible
futures conditioned on latent actions, which can be inferred from video
dynamics, provided by a user, or generated by a learned action policy, thus
enabling versatile and interpretable world modeling. Our results show that
PlaySlot outperforms both stochastic and object-centric baselines for video
prediction across different environments. Furthermore, we show that our
inferred latent actions can be used to learn robot behaviors sample-efficiently
from unlabeled video demonstrations. Videos and code are available on
https://play-slot.github.io/PlaySlot/.
comment: ICML 2025
♻ ☆ OpenFly: A Comprehensive Platform for Aerial Vision-Language Navigation
Yunpeng Gao, Chenhui Li, Zhongrui You, Junli Liu, Zhen Li, Pengan Chen, Qizhi Chen, Zhonghan Tang, Liansheng Wang, Penghui Yang, Yiwen Tang, Yuhang Tang, Shuai Liang, Songyi Zhu, Ziqin Xiong, Yifei Su, Xinyi Ye, Jianan Li, Yan Ding, Dong Wang, Zhigang Wang, Bin Zhao, Xuelong Li
Vision-Language Navigation (VLN) aims to guide agents by leveraging language
instructions and visual cues, playing a pivotal role in embodied AI. Indoor VLN
has been extensively studied, whereas outdoor aerial VLN remains underexplored.
The potential reason is that outdoor aerial view encompasses vast areas, making
data collection more challenging, which results in a lack of benchmarks. To
address this problem, we propose OpenFly, a platform comprising various
rendering engines, a versatile toolchain, and a large-scale benchmark for
aerial VLN. Firstly, we integrate diverse rendering engines and advanced
techniques for environment simulation, including Unreal Engine, GTA V, Google
Earth, and 3D Gaussian Splatting (3D GS). Particularly, 3D GS supports
real-to-sim rendering, further enhancing the realism of our environments.
Secondly, we develop a highly automated toolchain for aerial VLN data
collection, streamlining point cloud acquisition, scene semantic segmentation,
flight trajectory creation, and instruction generation. Thirdly, based on the
toolchain, we construct a large-scale aerial VLN dataset with 100k
trajectories, covering diverse heights and lengths across 18 scenes. Moreover,
we propose OpenFly-Agent, a keyframe-aware VLN model emphasizing key
observations during flight. For benchmarking, extensive experiments and
analyses are conducted, evaluating several recent VLN methods and showcasing
the superiority of our OpenFly platform and agent. The toolchain, dataset, and
codes will be open-sourced.
♻ ☆ DLO-Splatting: Tracking Deformable Linear Objects Using 3D Gaussian Splatting ICRA2025
Holly Dinkel, Marcel Büsching, Alberta Longhini, Brian Coltin, Trey Smith, Danica Kragic, Mårten Björkman, Timothy Bretl
This work presents DLO-Splatting, an algorithm for estimating the 3D shape of
Deformable Linear Objects (DLOs) from multi-view RGB images and gripper state
information through prediction-update filtering. The DLO-Splatting algorithm
uses a position-based dynamics model with shape smoothness and rigidity
dampening corrections to predict the object shape. Optimization with a 3D
Gaussian Splatting-based rendering loss iteratively renders and refines the
prediction to align it with the visual observations in the update step. Initial
experiments demonstrate promising results in a knot tying scenario, which is
challenging for existing vision-only methods.
comment: 5 pages, 2 figures, presented at the 2025 5th Workshop: Reflections
on Representations and Manipulating Deformable Objects at the IEEE
International Conference on Robotics and Automation. RMDO workshop
(https://deformable-workshop.github.io/icra2025/). Video
(https://www.youtube.com/watch?v=CG4WDWumGXA). Poster
(https://hollydinkel.github.io/assets/pdf/ICRA2025RMDO_poster.pdf)
♻ ☆ ABPT: Amended Backpropagation through Time with Partially Differentiable Rewards
Quadrotor control policies can be trained with high performance using the
exact gradients of the rewards to directly optimize policy parameters via
backpropagation-through-time (BPTT). However, designing a fully differentiable
reward architecture is often challenging. Partially differentiable rewards will
result in biased gradient propagation that degrades training performance. To
overcome this limitation, we propose Amended Backpropagation-through-Time
(ABPT), a novel approach that mitigates gradient bias while preserving the
training efficiency of BPTT. ABPT combines 0-step and N-step returns,
effectively reducing the bias by leveraging value gradients from the learned
Q-value function. Additionally, it adopts entropy regularization and state
initialization mechanisms to encourage exploration during training. We evaluate
ABPT on four representative quadrotor flight tasks \li{in both real world and
simulation}. Experimental results demonstrate that ABPT converges significantly
faster and achieves higher ultimate rewards than existing learning algorithms,
particularly in tasks involving partially differentiable rewards. The code will
be released at http://github.com/Fanxing-LI/ABPT.
♻ ☆ Adaptive Diffusion Constrained Sampling for Bimanual Robot Manipulation
Coordinated multi-arm manipulation requires satisfying multiple simultaneous
geometric constraints across high-dimensional configuration spaces, which poses
a significant challenge for traditional planning and control methods. In this
work, we propose Adaptive Diffusion Constrained Sampling (ADCS), a generative
framework that flexibly integrates both equality (e.g., relative and absolute
pose constraints) and structured inequality constraints (e.g., proximity to
object surfaces) into an energy-based diffusion model. Equality constraints are
modeled using dedicated energy networks trained on pose differences in Lie
algebra space, while inequality constraints are represented via Signed Distance
Functions (SDFs) and encoded into learned constraint embeddings, allowing the
model to reason about complex spatial regions. A key innovation of our method
is a Transformer-based architecture that learns to weight constraint-specific
energy functions at inference time, enabling flexible and context-aware
constraint integration. Moreover, we adopt a two-phase sampling strategy that
improves precision and sample diversity by combining Langevin dynamics with
resampling and density-aware re-weighting. Experimental results on dual-arm
manipulation tasks show that ADCS significantly improves sample diversity and
generalization across settings demanding precise coordination and adaptive
constraint handling.
♻ ☆ From Words to Collisions: LLM-Guided Evaluation and Adversarial Generation of Safety-Critical Driving Scenarios
Ensuring the safety of autonomous vehicles requires virtual scenario-based
testing, which depends on the robust evaluation and generation of
safety-critical scenarios. So far, researchers have used scenario-based testing
frameworks that rely heavily on handcrafted scenarios as safety metrics. To
reduce the effort of human interpretation and overcome the limited scalability
of these approaches, we combine Large Language Models (LLMs) with structured
scenario parsing and prompt engineering to automatically evaluate and generate
safety-critical driving scenarios. We introduce Cartesian and Ego-centric
prompt strategies for scenario evaluation, and an adversarial generation module
that modifies trajectories of risk-inducing vehicles (ego-attackers) to create
critical scenarios. We validate our approach using a 2D simulation framework
and multiple pre-trained LLMs. The results show that the evaluation module
effectively detects collision scenarios and infers scenario safety. Meanwhile,
the new generation module identifies high-risk agents and synthesizes
realistic, safety-critical scenarios. We conclude that an LLM equipped with
domain-informed prompting techniques can effectively evaluate and generate
safety-critical driving scenarios, reducing dependence on handcrafted metrics.
We release our open-source code and scenarios at:
https://github.com/TUM-AVS/From-Words-to-Collisions.
comment: New version of the paper
♻ ☆ DualLQR: Efficient Grasping of Oscillating Apples using Task Parameterized Learning from Demonstration
Learning from Demonstration offers great potential for robots to learn to
perform agricultural tasks, specifically selective harvesting. One of the
challenges is that the target fruit can be oscillating while approaching.
Grasping oscillating targets has two requirements: 1) close tracking of the
target during the final approach for damage-free grasping, and 2) the complete
path should be as short as possible for improved efficiency. We propose a new
method called DualLQR. In this method, we use a finite horizon Linear Quadratic
Regulator (LQR) on a moving target, without the need of refitting the LQR. To
make this possible, we use a dual LQR set-up, with an LQR running in two
separate reference frames. Through extensive simulation testing, it was found
that the state-of-art method barely meets the required final accuracy without
oscillations and drops below the required accuracy with an oscillating target.
DualLQR, on the other hand, was found to be able to meet the required final
accuracy even with high oscillations, while travelling the least distance.
Further testing on a real-world apple grasping task showed that DualLQR was
able to successfully grasp oscillating apples, with a success rate of 99%.
comment: Accepted to IAS-19 url: https://openreview.net/forum?id=vipwIaog1u
♻ ☆ Occupancy-SLAM: An Efficient and Robust Algorithm for Simultaneously Optimizing Robot Poses and Occupancy Map
Joint optimization of poses and features has been extensively studied and
demonstrated to yield more accurate results in feature-based SLAM problems.
However, research on jointly optimizing poses and non-feature-based maps
remains limited. Occupancy maps are widely used non-feature-based environment
representations because they effectively classify spaces into obstacles, free
areas, and unknown regions, providing robots with spatial information for
various tasks. In this paper, we propose Occupancy-SLAM, a novel
optimization-based SLAM method that enables the joint optimization of robot
trajectory and the occupancy map through a parameterized map representation.
The key novelty lies in optimizing both robot poses and occupancy values at
different cell vertices simultaneously, a significant departure from existing
methods where the robot poses need to be optimized first before the map can be
estimated. Evaluations using simulations and practical 2D laser datasets
demonstrate that the proposed approach can robustly obtain more accurate robot
trajectories and occupancy maps than state-of-the-art techniques with
comparable computational time. Preliminary results in the 3D case further
confirm the potential of the proposed method in practical 3D applications,
achieving more accurate results than existing methods.
comment: Accepted for publication in the IEEE Transactions on Robotics (T-RO),
2025
♻ ☆ Deep Policy Gradient Methods Without Batch Updates, Target Networks, or Replay Buffers
Gautham Vasan, Mohamed Elsayed, Alireza Azimi, Jiamin He, Fahim Shariar, Colin Bellinger, Martha White, A. Rupam Mahmood
Modern deep policy gradient methods achieve effective performance on
simulated robotic tasks, but they all require large replay buffers or expensive
batch updates, or both, making them incompatible for real systems with
resource-limited computers. We show that these methods fail catastrophically
when limited to small replay buffers or during incremental learning, where
updates only use the most recent sample without batch updates or a replay
buffer. We propose a novel incremental deep policy gradient method -- Action
Value Gradient (AVG) and a set of normalization and scaling techniques to
address the challenges of instability in incremental learning. On robotic
simulation benchmarks, we show that AVG is the only incremental method that
learns effectively, often achieving final performance comparable to batch
policy gradient methods. This advancement enabled us to show for the first time
effective deep reinforcement learning with real robots using only incremental
updates, employing a robotic manipulator and a mobile robot.
comment: In The Thirty-eighth Annual Conference on Neural Information
Processing Systems. Source code at https://github.com/gauthamvasan/avg and
companion video at https://youtu.be/cwwuN6Hyew0
♻ ☆ Sketch Interface for Teleoperation of Mobile Manipulator to Enable Intuitive and Intended Operation: A Proof of Concept
Recent advancements in robotics have underscored the need for effective
collaboration between humans and robots. Traditional interfaces often struggle
to balance robot autonomy with human oversight, limiting their practical
application in complex tasks like mobile manipulation. This study aims to
develop an intuitive interface that enables a mobile manipulator to
autonomously interpret user-provided sketches, enhancing user experience while
minimizing burden. We implemented a web-based application utilizing machine
learning algorithms to process sketches, making the interface accessible on
mobile devices for use anytime, anywhere, by anyone. In the first validation,
we examined natural sketches drawn by users for 27 selected manipulation and
navigation tasks, gaining insights into trends related to sketch instructions.
The second validation involved comparative experiments with five grasping
tasks, showing that the sketch interface reduces workload and enhances
intuitiveness compared to conventional axis control interfaces. These findings
suggest that the proposed sketch interface improves the efficiency of mobile
manipulators and opens new avenues for integrating intuitive human-robot
collaboration in various applications.
comment: This paper has been accepted to the the 20th edition of the IEEE/ACM
International Conference on Human-Robot Interaction (HRI'25), which will be
held in Melbourne, Australia on March 4-6, 2025. Project page:
https://toyotafrc.github.io/SketchInterfacePoC-Proj/
♻ ☆ Design of a 3-DOF Hopping Robot with an Optimized Gearbox: An Intermediate Platform Toward Bipedal Robots
This paper presents a 3-DOF hopping robot with a human-like lower-limb joint
configuration and a flat foot, capable of performing dynamic and repetitive
jumping motions. To achieve both high torque output and a large hollow shaft
diameter for efficient cable routing, a compact 3K compound planetary gearbox
was designed using mixed-integer nonlinear programming for gear tooth
optimization. To meet performance requirements within the constrained joint
geometry, all major components-including the actuator, motor driver, and
communication interface-were custom-designed. The robot weighs 12.45 kg,
including a dummy mass, and measures 840 mm in length when the knee joint is
fully extended. A reinforcement learning-based controller was employed, and
robot's performance was validated through hardware experiments, demonstrating
stable and repetitive hopping motions in response to user inputs. These
experimental results indicate that the platform serves as a solid foundation
for future bipedal robot development.
♻ ☆ Learning Novel Skills from Language-Generated Demonstrations ICLR
Ao-Qun Jin, Tian-Yu Xiang, Xiao-Hu Zhou, Mei-Jiang Gui, Xiao-Liang Xie, Shi-Qi Liu, Shuang-Yi Wang, Yue Cao, Sheng-Bin Duan, Fu-Chao Xie, Zeng-Guang Hou
Robots are increasingly deployed across diverse domains to tackle tasks
requiring novel skills. However, current robot learning algorithms for
acquiring novel skills often rely on demonstration datasets or environment
interactions, resulting in high labor costs and potential safety risks. To
address these challenges, this study proposes DemoGen, a skill-learning
framework that enables robots to acquire novel skills from natural language
instructions. DemoGen leverages the vision-language model and the video
diffusion model to generate demonstration videos of novel skills, which
enabling robots to learn new skills effectively. Experimental evaluations in
the MetaWorld simulation environments demonstrate the pipeline's capability to
generate high-fidelity and reliable demonstrations. Using the generated
demonstrations, various skill learning algorithms achieve an accomplishment
rate three times the original on novel tasks. These results highlight a novel
approach to robot learning, offering a foundation for the intuitive and
intelligent acquisition of novel robotic skills. (Project website:
https://aoqunjin.github.io/LNSLGD/)
comment: 10 pages, International Conference on Learning Representations (ICLR)
2025 Workshop on Generative Models for Robot Learning (GenBot)
♻ ★ MoE-Loco: Mixture of Experts for Multitask Locomotion
We present MoE-Loco, a Mixture of Experts (MoE) framework for multitask
locomotion for legged robots. Our method enables a single policy to handle
diverse terrains, including bars, pits, stairs, slopes, and baffles, while
supporting quadrupedal and bipedal gaits. Using MoE, we mitigate the gradient
conflicts that typically arise in multitask reinforcement learning, improving
both training efficiency and performance. Our experiments demonstrate that
different experts naturally specialize in distinct locomotion behaviors, which
can be leveraged for task migration and skill composition. We further validate
our approach in both simulation and real-world deployment, showcasing its
robustness and adaptability.
comment: 9 pages, 10 figures
♻ ★ FaVoR: Features via Voxel Rendering for Camera Relocalization WACV
Camera relocalization methods range from dense image alignment to direct
camera pose regression from a query image. Among these, sparse feature matching
stands out as an efficient, versatile, and generally lightweight approach with
numerous applications. However, feature-based methods often struggle with
significant viewpoint and appearance changes, leading to matching failures and
inaccurate pose estimates. To overcome this limitation, we propose a novel
approach that leverages a globally sparse yet locally dense 3D representation
of 2D features. By tracking and triangulating landmarks over a sequence of
frames, we construct a sparse voxel map optimized to render image patch
descriptors observed during tracking. Given an initial pose estimate, we first
synthesize descriptors from the voxels using volumetric rendering and then
perform feature matching to estimate the camera pose. This methodology enables
the generation of descriptors for unseen views, enhancing robustness to view
changes. We extensively evaluate our method on the 7-Scenes and Cambridge
Landmarks datasets. Our results show that our method significantly outperforms
existing state-of-the-art feature representation techniques in indoor
environments, achieving up to a 39% improvement in median translation error.
Additionally, our approach yields comparable results to other methods for
outdoor scenarios while maintaining lower memory and computational costs.
comment: In Proceedings of the IEEE/CVF Winter Conference on Applications of
Computer Vision (WACV), Tucson, Arizona, US, Feb 28-Mar 4, 2025
♻ ☆ Reachable Sets-based Trajectory Planning Combining Reinforcement Learning and iLQR
The driving risk field is applicable to more complex driving scenarios,
providing new approaches for safety decision-making and active vehicle control
in intricate environments. However, existing research often overlooks the
driving risk field and fails to consider the impact of risk distribution within
drivable areas on trajectory planning, which poses challenges for enhancing
safety. This paper proposes a trajectory planning method for intelligent
vehicles based on the risk reachable set to further improve the safety of
trajectory planning. First, we construct the reachable set incorporating the
driving risk field to more accurately assess and avoid potential risks in
drivable areas. Then, the initial trajectory is generated based on safe
reinforcement learning and projected onto the reachable set. Finally, we
introduce a trajectory planning method based on a constrained iterative
quadratic regulator to optimize the initial solution, ensuring that the planned
trajectory achieves optimal comfort, safety, and efficiency. We conduct
simulation tests of trajectory planning in high-speed lane-changing scenarios.
The results indicate that the proposed method can guarantee trajectory comfort
and driving efficiency, with the generated trajectory situated outside
high-risk boundaries, thereby ensuring vehicle safety during operation.
comment: We sincerely request the withdrawal of this paper. After further
research and review, we have found that certain parts of the content contain
uncertainties and are not sufficient to support the conclusions previously
drawn. To avoid any potential misunderstanding or misguidance to the research
community, we have decided to voluntarily withdraw the manuscript
♻ ☆ Towards Robust Autonomous Landing Systems: Iterative Solutions and Key Lessons Learned
Sebastian Schroder, Yao Deng, Alice James, Avishkar Seth, Kye Morton, Subhas Mukhopadhyay, Richard Han, Xi Zheng
Uncrewed Aerial Vehicles (UAVs) have become a focal point of research, with
both established companies and startups investing heavily in their development.
This paper presents our iterative process in developing a robust autonomous
marker-based landing system, highlighting the key challenges encountered and
the solutions implemented. It reviews existing systems for autonomous landing
processes, and through this aims to contribute to the community by sharing
insights and challenges faced during development and testing.