Robotics 34
☆ BOSS: Benchmark for Observation Space Shift in Long-Horizon Task
Robotics has long sought to develop visual-servoing robots capable of
completing previously unseen long-horizon tasks. Hierarchical approaches offer
a pathway for achieving this goal by executing skill combinations arranged by a
task planner, with each visuomotor skill pre-trained using a specific imitation
learning (IL) algorithm. However, even in simple long-horizon tasks like skill
chaining, hierarchical approaches often struggle due to a problem we identify
as Observation Space Shift (OSS), where the sequential execution of preceding
skills causes shifts in the observation space, disrupting the performance of
subsequent individually trained skill policies. To validate OSS and evaluate
its impact on long-horizon tasks, we introduce BOSS (a Benchmark for
Observation Space Shift). BOSS comprises three distinct challenges: "Single
Predicate Shift", "Accumulated Predicate Shift", and "Skill Chaining", each
designed to assess a different aspect of OSS's negative effect. We evaluated
several recent popular IL algorithms on BOSS, including three Behavioral
Cloning methods and the Visual Language Action model OpenVLA. Even on the
simplest challenge, we observed average performance drops of 67%, 35%, 34%, and
54%, respectively, when comparing skill performance with and without OSS.
Additionally, we investigate a potential solution to OSS that scales up the
training data for each skill with a larger and more visually diverse set of
demonstrations, with our results showing it is not sufficient to resolve OSS.
The project page is: https://boss-benchmark.github.io/
☆ VaViM and VaVAM: Autonomous Driving through Video Generative Modeling
Florent Bartoccioni, Elias Ramzi, Victor Besnier, Shashanka Venkataramanan, Tuan-Hung Vu, Yihong Xu, Loick Chambon, Spyros Gidaris, Serkan Odabas, David Hurych, Renaud Marlet, Alexandre Boulch, Mickael Chen, Éloi Zablocki, Andrei Bursuc, Eduardo Valle, Matthieu Cord
We explore the potential of large-scale generative video models for
autonomous driving, introducing an open-source auto-regressive video model
(VaViM) and its companion video-action model (VaVAM) to investigate how video
pre-training transfers to real-world driving. VaViM is a simple auto-regressive
video model that predicts frames using spatio-temporal token sequences. We show
that it captures the semantics and dynamics of driving scenes. VaVAM, the
video-action model, leverages the learned representations of VaViM to generate
driving trajectories through imitation learning. Together, the models form a
complete perception-to-action pipeline. We evaluate our models in open- and
closed-loop driving scenarios, revealing that video-based pre-training holds
promise for autonomous driving. Key insights include the semantic richness of
the learned representations, the benefits of scaling for video synthesis, and
the complex relationship between model size, data, and safety metrics in
closed-loop evaluations. We release code and model weights at
https://github.com/valeoai/VideoActionModel
comment: Code and model: https://github.com/valeoai/VideoActionModel, project
page: https://valeoai.github.io/vavim-vavam/
☆ A Simulation Pipeline to Facilitate Real-World Robotic Reinforcement Learning Applications
Reinforcement learning (RL) has gained traction for its success in solving
complex tasks for robotic applications. However, its deployment on physical
robots remains challenging due to safety risks and the comparatively high costs
of training. To avoid these problems, RL agents are often trained on
simulators, which introduces a new problem related to the gap between
simulation and reality. This paper presents an RL pipeline designed to help
reduce the reality gap and facilitate developing and deploying RL policies for
real-world robotic systems. The pipeline organizes the RL training process into
an initial step for system identification and three training stages: core
simulation training, high-fidelity simulation, and real-world deployment, each
adding levels of realism to reduce the sim-to-real gap. Each training stage
takes an input policy, improves it, and either passes the improved policy to
the next stage or loops it back for further improvement. This iterative process
continues until the policy achieves the desired performance. The pipeline's
effectiveness is shown through a case study with the Boston Dynamics Spot
mobile robot used in a surveillance application. The case study presents the
steps taken at each pipeline stage to obtain an RL agent to control the robot's
position and orientation.
comment: Paper accepted to be presented at IEEE SysCon 2025
☆ Reduced-Order Model Guided Contact-Implicit Model Predictive Control for Humanoid Locomotion
Humanoid robots have great potential for real-world applications due to their
ability to operate in environments built for humans, but their deployment is
hindered by the challenge of controlling their underlying high-dimensional
nonlinear hybrid dynamics. While reduced-order models like the Hybrid Linear
Inverted Pendulum (HLIP) are simple and computationally efficient, they lose
whole-body expressiveness. Meanwhile, recent advances in Contact-Implicit Model
Predictive Control (CI-MPC) enable robots to plan through multiple hybrid
contact modes, but remain vulnerable to local minima and require significant
tuning. We propose a control framework that combines the strengths of HLIP and
CI-MPC. The reduced-order model generates a nominal gait, while CI-MPC manages
the whole-body dynamics and modifies the contact schedule as needed. We
demonstrate the effectiveness of this approach in simulation with a novel 24
degree-of-freedom humanoid robot: Achilles. Our proposed framework achieves
rough terrain walking, disturbance recovery, robustness under model and state
uncertainty, and allows the robot to interact with obstacles in the
environment, all while running online in real-time at 50 Hz.
☆ Pick-and-place Manipulation Across Grippers Without Retraining: A Learning-optimization Diffusion Policy Approach
Xiangtong Yao, Yirui Zhou, Yuan Meng, Liangyu Dong, Lin Hong, Zitao Zhang, Zhenshan Bing, Kai Huang, Fuchun Sun, Alois Knoll
Current robotic pick-and-place policies typically require consistent gripper
configurations across training and inference. This constraint imposes high
retraining or fine-tuning costs, especially for imitation learning-based
approaches, when adapting to new end-effectors. To mitigate this issue, we
present a diffusion-based policy with a hybrid learning-optimization framework,
enabling zero-shot adaptation to novel grippers without additional data
collection for retraining policy. During training, the policy learns
manipulation primitives from demonstrations collected using a base gripper. At
inference, a diffusion-based optimization strategy dynamically enforces
kinematic and safety constraints, ensuring that generated trajectories align
with the physical properties of unseen grippers. This is achieved through a
constrained denoising procedure that adapts trajectories to gripper-specific
parameters (e.g., tool-center-point offsets, jaw widths) while preserving
collision avoidance and task feasibility. We validate our method on a Franka
Panda robot across six gripper configurations, including 3D-printed fingertips,
flexible silicone gripper, and Robotiq 2F-85 gripper. Our approach achieves a
93.3% average task success rate across grippers (vs. 23.3-26.7% for diffusion
policy baselines), supporting tool-center-point variations of 16-23.5 cm and
jaw widths of 7.5-11.5 cm. The results demonstrate that constrained diffusion
enables robust cross-gripper manipulation while maintaining the sample
efficiency of imitation learning, eliminating the need for gripper-specific
retraining. Video and code are available at https://github.com/yaoxt3/GADP.
comment: Video and code are available at https://github.com/yaoxt3/GADP
☆ Autonomous helicopter aerial refueling: controller design and performance guarantees
In this paper, we present a control design methodology, stability criteria,
and performance bounds for autonomous helicopter aerial refueling. Autonomous
aerial refueling is particularly difficult due to the aerodynamic interaction
between the wake of the tanker, the contact-sensitive nature of the maneuver,
and the uncertainty in drogue motion. Since the probe tip is located
significantly away from the helicopter's center-of-gravity, its position (and
velocity) is strongly sensitive to the helicopter's attitude (and angular
rates). In addition, the fact that the helicopter is operating at high speeds
to match the velocity of the tanker forces it to maintain a particular
orientation, making the docking maneuver especially challenging. In this paper,
we propose a novel outer-loop position controller that incorporates the probe
position and velocity into the feedback loop. The position and velocity of the
probe tip depend both on the position (velocity) and on the attitude (angular
rates) of the aircraft. We derive analytical guarantees for docking performance
in terms of the uncertainty of the drogue motion and the angular acceleration
of the helicopter, using the ultimate boundedness property of the closed-loop
error dynamics. Simulations are performed on a high-fidelity UH60 helicopter
model with a high-fidelity drogue motion under wind effects to validate the
proposed approach for realistic refueling scenarios. These high-fidelity
simulations reveal that the proposed control methodology yields an improvement
of 36% in the 2-norm docking error compared to the existing standard
controller.
☆ Enhanced Probabilistic Collision Detection for Motion Planning Under Sensing Uncertainty
Probabilistic collision detection (PCD) is essential in motion planning for
robots operating in unstructured environments, where considering sensing
uncertainty helps prevent damage. Existing PCD methods mainly used simplified
geometric models and addressed only position estimation errors. This paper
presents an enhanced PCD method with two key advancements: (a) using
superquadrics for more accurate shape approximation and (b) accounting for both
position and orientation estimation errors to improve robustness under sensing
uncertainty. Our method first computes an enlarged surface for each object that
encapsulates its observed rotated copies, thereby addressing the orientation
estimation errors. Then, the collision probability under the position
estimation errors is formulated as a chance-constraint problem that is solved
with a tight upper bound. Both the two steps leverage the recently developed
normal parameterization of superquadric surfaces. Results show that our PCD
method is twice as close to the Monte-Carlo sampled baseline as the best
existing PCD method and reduces path length by 30% and planning time by 37%,
respectively. A Real2Sim pipeline further validates the importance of
considering orientation estimation errors, showing that the collision
probability of executing the planned path in simulation is only 2%, compared to
9% and 29% when considering only position estimation errors or none at all.
☆ Depth-aware Fusion Method based on Image and 4D Radar Spectrum for 3D Object Detection
Safety and reliability are crucial for the public acceptance of autonomous
driving. To ensure accurate and reliable environmental perception, intelligent
vehicles must exhibit accuracy and robustness in various environments.
Millimeter-wave radar, known for its high penetration capability, can operate
effectively in adverse weather conditions such as rain, snow, and fog.
Traditional 3D millimeter-wave radars can only provide range, Doppler, and
azimuth information for objects. Although the recent emergence of 4D
millimeter-wave radars has added elevation resolution, the radar point clouds
remain sparse due to Constant False Alarm Rate (CFAR) operations. In contrast,
cameras offer rich semantic details but are sensitive to lighting and weather
conditions. Hence, this paper leverages these two highly complementary and
cost-effective sensors, 4D millimeter-wave radar and camera. By integrating 4D
radar spectra with depth-aware camera images and employing attention
mechanisms, we fuse texture-rich images with depth-rich radar data in the
Bird's Eye View (BEV) perspective, enhancing 3D object detection. Additionally,
we propose using GAN-based networks to generate depth images from radar spectra
in the absence of depth sensors, further improving detection accuracy.
☆ Robust 4D Radar-aided Inertial Navigation for Aerial Vehicles
While LiDAR and cameras are becoming ubiquitous for unmanned aerial vehicles
(UAVs) but can be ineffective in challenging environments, 4D millimeter-wave
(MMW) radars that can provide robust 3D ranging and Doppler velocity
measurements are less exploited for aerial navigation. In this paper, we
develop an efficient and robust error-state Kalman filter (ESKF)-based
radar-inertial navigation for UAVs. The key idea of the proposed approach is
the point-to-distribution radar scan matching to provide motion constraints
with proper uncertainty qualification, which are used to update the navigation
states in a tightly coupled manner, along with the Doppler velocity
measurements. Moreover, we propose a robust keyframe-based matching scheme
against the prior map (if available) to bound the accumulated navigation errors
and thus provide a radar-based global localization solution with high accuracy.
Extensive real-world experimental validations have demonstrated that the
proposed radar-aided inertial navigation outperforms state-of-the-art methods
in both accuracy and robustness.
☆ Learning Long-Horizon Robot Manipulation Skills via Privileged Action
Long-horizon contact-rich tasks are challenging to learn with reinforcement
learning, due to ineffective exploration of high-dimensional state spaces with
sparse rewards. The learning process often gets stuck in local optimum and
demands task-specific reward fine-tuning for complex scenarios. In this work,
we propose a structured framework that leverages privileged actions with
curriculum learning, enabling the policy to efficiently acquire long-horizon
skills without relying on extensive reward engineering or reference
trajectories. Specifically, we use privileged actions in simulation with a
general training procedure that would be infeasible to implement in real-world
scenarios. These privileges include relaxed constraints and virtual forces that
enhance interaction and exploration with objects. Our results successfully
achieve complex multi-stage long-horizon tasks that naturally combine
non-prehensile manipulation with grasping to lift objects from non-graspable
poses. We demonstrate generality by maintaining a parsimonious reward structure
and showing convergence to diverse and robust behaviors across various
environments. Additionally, real-world experiments further confirm that the
skills acquired using our approach are transferable to real-world environments,
exhibiting robust and intricate performance. Our approach outperforms
state-of-the-art methods in these tasks, converging to solutions where others
fail.
☆ Self-Mixing Laser Interferometry for Robotic Tactile Sensing ICRA2025
Self-mixing interferometry (SMI) has been lauded for its sensitivity in
detecting microvibrations, while requiring no physical contact with its target.
In robotics, microvibrations have traditionally been interpreted as a marker
for object slip, and recently as a salient indicator of extrinsic contact. We
present the first-ever robotic fingertip making use of SMI for slip and
extrinsic contact sensing. The design is validated through measurement of
controlled vibration sources, both before and after encasing the readout
circuit in its fingertip package. Then, the SMI fingertip is compared to
acoustic sensing through three experiments. The results are distilled into a
technology decision map. SMI was found to be more sensitive to subtle slip
events and significantly more robust against ambient noise. We conclude that
the integration of SMI in robotic fingertips offers a new, promising branch of
tactile sensing in robotics.
comment: Accepted for ICRA2025
☆ Rapid Online Learning of Hip Exoskeleton Assistance Preferences
Hip exoskeletons are increasing in popularity due to their effectiveness
across various scenarios and their ability to adapt to different users.
However, personalizing the assistance often requires lengthy tuning procedures
and computationally intensive algorithms, and most existing methods do not
incorporate user feedback. In this work, we propose a novel approach for
rapidly learning users' preferences for hip exoskeleton assistance. We perform
pairwise comparisons of distinct randomly generated assistive profiles, and
collect participants preferences through active querying. Users' feedback is
integrated into a preference-learning algorithm that updates its belief, learns
a user-dependent reward function, and changes the assistive torque profiles
accordingly. Results from eight healthy subjects display distinct preferred
torque profiles, and users' choices remain consistent when compared to a
perturbed profile. A comprehensive evaluation of users' preferences reveals a
close relationship with individual walking strategies. The tested torque
profiles do not disrupt kinematic joint synergies, and participants favor
assistive torques that are synchronized with their movements, resulting in
lower negative power from the device. This straightforward approach enables the
rapid learning of users preferences and rewards, grounding future studies on
reward-based human-exoskeleton interaction.
comment: Copyright 2025 IEEE. Personal use of this material is permitted.
Permission from IEEE must be obtained for all other uses, in any current or
future media, including reprinting/republishing this material for advertising
or promotional purposes, creating new collective works, for resale or
redistribution to servers or lists, or reuse of any copyrighted component of
this work in other works
☆ Exploring Embodied Multimodal Large Models: Development, Datasets, and Future Directions
Embodied multimodal large models (EMLMs) have gained significant attention in
recent years due to their potential to bridge the gap between perception,
cognition, and action in complex, real-world environments. This comprehensive
review explores the development of such models, including Large Language Models
(LLMs), Large Vision Models (LVMs), and other models, while also examining
other emerging architectures. We discuss the evolution of EMLMs, with a focus
on embodied perception, navigation, interaction, and simulation. Furthermore,
the review provides a detailed analysis of the datasets used for training and
evaluating these models, highlighting the importance of diverse, high-quality
data for effective learning. The paper also identifies key challenges faced by
EMLMs, including issues of scalability, generalization, and real-time
decision-making. Finally, we outline future directions, emphasizing the
integration of multimodal sensing, reasoning, and action to advance the
development of increasingly autonomous systems. By providing an in-depth
analysis of state-of-the-art methods and identifying critical gaps, this paper
aims to inspire future advancements in EMLMs and their applications across
diverse domains.
comment: 81 pages, submitted to a journal for review
☆ DynamicGSG: Dynamic 3D Gaussian Scene Graphs for Environment Adaptation
In real-world scenarios, the environment changes caused by agents or human
activities make it extremely challenging for robots to perform various
long-term tasks. To effectively understand and adapt to dynamic environments,
the perception system of a robot needs to extract instance-level semantic
information, reconstruct the environment in a fine-grained manner, and update
its environment representation in memory according to environment changes. To
address these challenges, We propose \textbf{DynamicGSG}, a dynamic,
high-fidelity, open-vocabulary scene graph generation system leveraging
Gaussian splatting. Our system comprises three key components: (1) constructing
hierarchical scene graphs using advanced vision foundation models to represent
the spatial and semantic relationships of objects in the environment, (2)
designing a joint feature loss to optimize the Gaussian map for incremental
high-fidelity reconstruction, and (3) updating the Gaussian map and scene graph
according to real environment changes for long-term environment adaptation.
Experiments and ablation studies demonstrate the performance and efficacy of
the proposed method in terms of semantic segmentation, language-guided object
retrieval, and reconstruction quality. Furthermore, we have validated the
dynamic updating capabilities of our system in real laboratory environments.
The source code will be released
at:~\href{https://github.com/GeLuzhou/Dynamic-GSG}{https://github.com/GeLuzhou/DynamicGSG}.
☆ OccProphet: Pushing Efficiency Frontier of Camera-Only 4D Occupancy Forecasting with Observer-Forecaster-Refiner Framework ICLR2025
Predicting variations in complex traffic environments is crucial for the
safety of autonomous driving. Recent advancements in occupancy forecasting have
enabled forecasting future 3D occupied status in driving environments by
observing historical 2D images. However, high computational demands make
occupancy forecasting less efficient during training and inference stages,
hindering its feasibility for deployment on edge agents. In this paper, we
propose a novel framework, i.e., OccProphet, to efficiently and effectively
learn occupancy forecasting with significantly lower computational requirements
while improving forecasting accuracy. OccProphet comprises three lightweight
components: Observer, Forecaster, and Refiner. The Observer extracts
spatio-temporal features from 3D multi-frame voxels using the proposed
Efficient 4D Aggregation with Tripling-Attention Fusion, while the Forecaster
and Refiner conditionally predict and refine future occupancy inferences.
Experimental results on nuScenes, Lyft-Level5, and nuScenes-Occupancy datasets
demonstrate that OccProphet is both training- and inference-friendly.
OccProphet reduces 58\%$\sim$78\% of the computational cost with a 2.6$\times$
speedup compared with the state-of-the-art Cam4DOcc. Moreover, it achieves
4\%$\sim$18\% relatively higher forecasting accuracy. Code and models are
publicly available at https://github.com/JLChen-C/OccProphet.
comment: Accepted by ICLR2025
☆ Realm: Real-Time Line-of-Sight Maintenance in Multi-Robot Navigation with Unknown Obstacles ICRA 2025
Multi-robot navigation in complex environments relies on inter-robot
communication and mutual observations for coordination and situational
awareness. This paper studies the multi-robot navigation problem in unknown
environments with line-of-sight (LoS) connectivity constraints. While previous
works are limited to known environment models to derive the LoS constraints,
this paper eliminates such requirements by directly formulating the LoS
constraints between robots from their real-time point cloud measurements,
leveraging point cloud visibility analysis techniques. We propose a novel
LoS-distance metric to quantify both the urgency and sensitivity of losing LoS
between robots considering potential robot movements. Moreover, to address the
imbalanced urgency of losing LoS between two robots, we design a fusion
function to capture the overall urgency while generating gradients that
facilitate robots' collaborative movement to maintain LoS. The LoS constraints
are encoded into a potential function that preserves the positivity of the
Fiedler eigenvalue of the robots' network graph to ensure connectivity.
Finally, we establish a LoS-constrained exploration framework that integrates
the proposed connectivity controller. We showcase its applications in
multi-robot exploration in complex unknown environments, where robots can
always maintain the LoS connectivity through distributed sensing and
communication, while collaboratively mapping the unknown environment. The
implementations are open-sourced at
https://github.com/bairuofei/LoS_constrained_navigation.
comment: 8 pages, 9 figures, accepted by IEEE ICRA 2025
☆ CurricuVLM: Towards Safe Autonomous Driving via Personalized Safety-Critical Curriculum Learning with Vision-Language Models
Ensuring safety in autonomous driving systems remains a critical challenge,
particularly in handling rare but potentially catastrophic safety-critical
scenarios. While existing research has explored generating safety-critical
scenarios for autonomous vehicle (AV) testing, there is limited work on
effectively incorporating these scenarios into policy learning to enhance
safety. Furthermore, developing training curricula that adapt to an AV's
evolving behavioral patterns and performance bottlenecks remains largely
unexplored. To address these challenges, we propose CurricuVLM, a novel
framework that leverages Vision-Language Models (VLMs) to enable personalized
curriculum learning for autonomous driving agents. Our approach uniquely
exploits VLMs' multimodal understanding capabilities to analyze agent behavior,
identify performance weaknesses, and dynamically generate tailored training
scenarios for curriculum adaptation. Through comprehensive analysis of unsafe
driving situations with narrative descriptions, CurricuVLM performs in-depth
reasoning to evaluate the AV's capabilities and identify critical behavioral
patterns. The framework then synthesizes customized training scenarios
targeting these identified limitations, enabling effective and personalized
curriculum learning. Extensive experiments on the Waymo Open Motion Dataset
show that CurricuVLM outperforms state-of-the-art baselines across both regular
and safety-critical scenarios, achieving superior performance in terms of
navigation success, driving efficiency, and safety metrics. Further analysis
reveals that CurricuVLM serves as a general approach that can be integrated
with various RL algorithms to enhance autonomous driving systems. The code and
demo video are available at: https://zihaosheng.github.io/CurricuVLM/.
♻ ☆ An Open-Source Reproducible Chess Robot for Human-Robot Interaction Research
Recent advancements in AI have accelerated the evolution of versatile robot
designs. Chess provides a standardized environment for evaluating the impact of
robot behavior on human behavior. This article presents an open-source chess
robot for human-robot interaction (HRI) research, specifically focusing on
verbal and non-verbal interactions. OpenChessRobot recognizes chess pieces
using computer vision, executes moves, and interacts with the human player
through voice and robotic gestures. We detail the software design, provide
quantitative evaluations of the efficacy of the robot, and offer a guide for
its reproducibility. An online survey examining people's views of the robot in
three possible scenarios was conducted with 597 participants. The robot
received the highest ratings in the robotics education and the chess coach
scenarios, while the home entertainment scenario received the lowest scores.
The code and datasets are accessible on GitHub:
https://github.com/renchizhhhh/OpenChessRobot
♻ ☆ Shared Control with Black Box Agents using Oracle Queries
Shared control problems involve a robot learning to collaborate with a human.
When learning a shared control policy, short communication between the agents
can often significantly reduce running times and improve the system's accuracy.
We extend the shared control problem to include the ability to directly query a
cooperating agent. We consider two types of potential responses to a query,
namely oracles: one that can provide the learner with the best action they
should take, even when that action might be myopically wrong, and one with a
bounded knowledge limited to its part of the system. Given this additional
information channel, this work further presents three heuristics for choosing
when to query: reinforcement learning-based, utility-based, and entropy-based.
These heuristics aim to reduce a system's overall learning cost. Empirical
results on two environments show the benefits of querying to learn a better
control policy and the tradeoffs between the proposed heuristics.
comment: Accepted for publication in the 2025 IEEE International Conference on
AI and Data Analytics (ICAD 2025)
♻ ☆ PROSKILL: A formal skill language for acting in robotics
Acting is an important decisional function for autonomous robots. Acting
relies on skills to implement and to model the activities it oversees:
refinement, local recovery, temporal dispatching, external asynchronous events,
and commands execution, all done online. While sitting between planning and the
robotic platform, acting often relies on programming primitives and an
interpreter which executes these skills. Following our experience in providing
a formal framework to program the functional components of our robots, we
propose a new language, to program the acting skills. This language maps
unequivocally into a formal model which can then be used to check properties
offline or execute the skills, or more precisely their formal equivalent, and
perform runtime verification. We illustrate with a real example how we can
program a survey mission for a drone in this new language, prove some formal
properties on the program and directly execute the formal model on the drone to
perform the mission.
♻ ☆ HeRCULES: Heterogeneous Radar Dataset in Complex Urban Environment for Multi-session Radar SLAM ICRA 2025
Hanjun Kim, Minwoo Jung, Chiyun Noh, Sangwoo Jung, Hyunho Song, Wooseong Yang, Hyesu Jang, Ayoung Kim
Recently, radars have been widely featured in robotics for their robustness
in challenging weather conditions. Two commonly used radar types are spinning
radars and phased-array radars, each offering distinct sensor characteristics.
Existing datasets typically feature only a single type of radar, leading to the
development of algorithms limited to that specific kind. In this work, we
highlight that combining different radar types offers complementary advantages,
which can be leveraged through a heterogeneous radar dataset. Moreover, this
new dataset fosters research in multi-session and multi-robot scenarios where
robots are equipped with different types of radars. In this context, we
introduce the HeRCULES dataset, a comprehensive, multi-modal dataset with
heterogeneous radars, FMCW LiDAR, IMU, GPS, and cameras. This is the first
dataset to integrate 4D radar and spinning radar alongside FMCW LiDAR, offering
unparalleled localization, mapping, and place recognition capabilities. The
dataset covers diverse weather and lighting conditions and a range of urban
traffic scenarios, enabling a comprehensive analysis across various
environments. The sequence paths with multiple revisits and ground truth pose
for each sensor enhance its suitability for place recognition research. We
expect the HeRCULES dataset to facilitate odometry, mapping, place recognition,
and sensor fusion research. The dataset and development tools are available at
https://sites.google.com/view/herculesdataset.
comment: 2025 IEEE International Conference on Robotics and Automation (ICRA
2025)
♻ ☆ Feature Aggregation with Latent Generative Replay for Federated Continual Learning of Socially Appropriate Robot Behaviours
It is critical for robots to explore Federated Learning (FL) settings where
several robots, deployed in parallel, can learn independently while also
sharing their learning with each other. This collaborative learning in
real-world environments requires social robots to adapt dynamically to changing
and unpredictable situations and varying task settings. Our work contributes to
addressing these challenges by exploring a simulated living room environment
where robots need to learn the social appropriateness of their actions. First,
we propose Federated Root (FedRoot) averaging, a novel weight aggregation
strategy which disentangles feature learning across clients from individual
task-based learning. Second, to adapt to challenging environments, we extend
FedRoot to Federated Latent Generative Replay (FedLGR), a novel Federated
Continual Learning (FCL) strategy that uses FedRoot-based weight aggregation
and embeds each client with a generator model for pseudo-rehearsal of learnt
feature embeddings to mitigate forgetting in a resource-efficient manner. Our
results show that FedRoot-based methods offer competitive performance while
also resulting in a sizeable reduction in resource consumption (up to 86% for
CPU usage and up to 72% for GPU usage). Additionally, our results demonstrate
that FedRoot-based FCL methods outperform other methods while also offering an
efficient solution (up to 84% CPU and 92% GPU usage reduction), with FedLGR
providing the best results across evaluations.
comment: 8 pages, 4 figures, IEEE RA-L submission
♻ ☆ GaRLIO: Gravity enhanced Radar-LiDAR-Inertial Odometry
Recently, gravity has been highlighted as a crucial constraint for state
estimation to alleviate potential vertical drift. Existing online gravity
estimation methods rely on pose estimation combined with IMU measurements,
which is considered best practice when direct velocity measurements are
unavailable. However, with radar sensors providing direct velocity data-a
measurement not yet utilized for gravity estimation-we found a significant
opportunity to improve gravity estimation accuracy substantially. GaRLIO, the
proposed gravity-enhanced Radar-LiDAR-Inertial Odometry, can robustly predict
gravity to reduce vertical drift while simultaneously enhancing state
estimation performance using pointwise velocity measurements. Furthermore,
GaRLIO ensures robustness in dynamic environments by utilizing radar to remove
dynamic objects from LiDAR point clouds. Our method is validated through
experiments in various environments prone to vertical drift, demonstrating
superior performance compared to traditional LiDAR-Inertial Odometry methods.
We make our source code publicly available to encourage further research and
development. https://github.com/ChiyunNoh/GaRLIO
♻ ☆ Highly dynamic physical interaction for robotics: design and control of an active remote center of compliance
Robot interaction control is often limited to low dynamics or low
flexibility, depending on whether an active or passive approach is chosen. In
this work, we introduce a hybrid control scheme that combines the advantages of
active and passive interaction control. To accomplish this, we propose the
design of a novel Active Remote Center of Compliance (ARCC), which is based on
a passive and active element which can be used to directly control the
interaction forces. We introduce surrogate models for a dynamic comparison
against purely robot-based interaction schemes. In a comparative validation,
ARCC drastically improves the interaction dynamics, leading to an increase in
the motion bandwidth of up to 31 times. We introduce further our control
approach as well as the integration in the robot controller. Finally, we
analyze ARCC on different industrial benchmarks like peg-in-hole, top-hat rail
assembly and contour following problems and compare it against the state of the
art, to highlight the dynamic and flexibility. The proposed system is
especially suited if the application requires a low cycle time combined with a
sensitive manipulation.
comment: 7 pages, 7 figures
♻ ☆ MapNav: A Novel Memory Representation via Annotated Semantic Maps for VLM-based Vision-and-Language Navigation
Lingfeng Zhang, Xiaoshuai Hao, Qinwen Xu, Qiang Zhang, Xinyao Zhang, Pengwei Wang, Jing Zhang, Zhongyuan Wang, Shanghang Zhang, Renjing Xu
Vision-and-language navigation (VLN) is a key task in Embodied AI, requiring
agents to navigate diverse and unseen environments while following natural
language instructions. Traditional approaches rely heavily on historical
observations as spatio-temporal contexts for decision making, leading to
significant storage and computational overhead. In this paper, we introduce
MapNav, a novel end-to-end VLN model that leverages Annotated Semantic Map
(ASM) to replace historical frames. Specifically, our approach constructs a
top-down semantic map at the start of each episode and update it at each
timestep, allowing for precise object mapping and structured navigation
information. Then, we enhance this map with explicit textual labels for key
regions, transforming abstract semantics into clear navigation cues and
generate our ASM. MapNav agent using the constructed ASM as input, and use the
powerful end-to-end capabilities of VLM to empower VLN. Extensive experiments
demonstrate that MapNav achieves state-of-the-art (SOTA) performance in both
simulated and real-world environments, validating the effectiveness of our
method. Moreover, we will release our ASM generation source code and dataset to
ensure reproducibility, contributing valuable resources to the field. We
believe that our proposed MapNav can be used as a new memory representation
method in VLN, paving the way for future research in this field.
♻ ☆ Interactive incremental learning of generalizable skills with local trajectory modulation
The problem of generalization in learning from demonstration (LfD) has
received considerable attention over the years, particularly within the context
of movement primitives, where a number of approaches have emerged. Recently,
two important approaches have gained recognition. While one leverages
via-points to adapt skills locally by modulating demonstrated trajectories,
another relies on so-called task-parameterized models that encode movements
with respect to different coordinate systems, using a product of probabilities
for generalization. While the former are well-suited to precise, local
modulations, the latter aim at generalizing over large regions of the workspace
and often involve multiple objects. Addressing the quality of generalization by
leveraging both approaches simultaneously has received little attention. In
this work, we propose an interactive imitation learning framework that
simultaneously leverages local and global modulations of trajectory
distributions. Building on the kernelized movement primitives (KMP) framework,
we introduce novel mechanisms for skill modulation from direct human corrective
feedback. Our approach particularly exploits the concept of via-points to
incrementally and interactively 1) improve the model accuracy locally, 2) add
new objects to the task during execution and 3) extend the skill into regions
where demonstrations were not provided. We evaluate our method on a bearing
ring-loading task using a torque-controlled, 7-DoF, DLR SARA robot.
comment: Accepted at IEEE Robotics and Automation Letters (RA-L), 16 pages, 19
figures, 6 tables. See
https://github.com/DLR-RM/interactive-incremental-learning for further
information and video
♻ ☆ Humanoid-VLA: Towards Universal Humanoid Control with Visual Integration
Pengxiang Ding, Jianfei Ma, Xinyang Tong, Binghong Zou, Xinxin Luo, Yiguo Fan, Ting Wang, Hongchao Lu, Panzhong Mo, Jinxin Liu, Yuefan Wang, Huaicheng Zhou, Wenshuo Feng, Jiacheng Liu, Siteng Huang, Donglin Wang
This paper addresses the limitations of current humanoid robot control
frameworks, which primarily rely on reactive mechanisms and lack autonomous
interaction capabilities due to data scarcity. We propose Humanoid-VLA, a novel
framework that integrates language understanding, egocentric scene perception,
and motion control, enabling universal humanoid control. Humanoid-VLA begins
with language-motion pre-alignment using non-egocentric human motion datasets
paired with textual descriptions, allowing the model to learn universal motion
patterns and action semantics. We then incorporate egocentric visual context
through a parameter efficient video-conditioned fine-tuning, enabling
context-aware motion generation. Furthermore, we introduce a self-supervised
data augmentation strategy that automatically generates pseudoannotations
directly derived from motion data. This process converts raw motion sequences
into informative question-answer pairs, facilitating the effective use of
large-scale unlabeled video data. Built upon whole-body control architectures,
extensive experiments show that Humanoid-VLA achieves object interaction and
environment exploration tasks with enhanced contextual awareness, demonstrating
a more human-like capacity for adaptive and intelligent engagement.
♻ ☆ ChatVLA: Unified Multimodal Understanding and Robot Control with Vision-Language-Action Model
Zhongyi Zhou, Yichen Zhu, Minjie Zhu, Junjie Wen, Ning Liu, Zhiyuan Xu, Weibin Meng, Ran Cheng, Yaxin Peng, Chaomin Shen, Feifei Feng
Humans possess a unified cognitive ability to perceive, comprehend, and
interact with the physical world. Why can't large language models replicate
this holistic understanding? Through a systematic analysis of existing training
paradigms in vision-language-action models (VLA), we identify two key
challenges: spurious forgetting, where robot training overwrites crucial
visual-text alignments, and task interference, where competing control and
understanding tasks degrade performance when trained jointly. To overcome these
limitations, we propose ChatVLA, a novel framework featuring Phased Alignment
Training, which incrementally integrates multimodal data after initial control
mastery, and a Mixture-of-Experts architecture to minimize task interference.
ChatVLA demonstrates competitive performance on visual question-answering
datasets and significantly surpasses state-of-the-art vision-language-action
(VLA) methods on multimodal understanding benchmarks. Notably, it achieves a
six times higher performance on MMMU and scores 47.2% on MMStar with a more
parameter-efficient design than ECoT. Furthermore, ChatVLA demonstrates
superior performance on 25 real-world robot manipulation tasks compared to
existing VLA methods like OpenVLA. Our findings highlight the potential of our
unified framework for achieving both robust multimodal understanding and
effective robot control.
♻ ☆ VLAS: Vision-Language-Action Model With Speech Instructions For Customized Robot Manipulation ICLR 2025
Vision-language-action models (VLAs) have become increasingly popular in
robot manipulation for their end-to-end design and remarkable performance.
However, existing VLAs rely heavily on vision-language models (VLMs) that only
support text-based instructions, neglecting the more natural speech modality
for human-robot interaction. Traditional speech integration methods usually
involves a separate speech recognition system, which complicates the model and
introduces error propagation. Moreover, the transcription procedure would lose
non-semantic information in the raw speech, such as voiceprint, which may be
crucial for robots to successfully complete customized tasks. To overcome above
challenges, we propose VLAS, a novel end-to-end VLA that integrates speech
recognition directly into the robot policy model. VLAS allows the robot to
understand spoken commands through inner speech-text alignment and produces
corresponding actions to fulfill the task. We also present two new datasets,
SQA and CSI, to support a three-stage tuning process for speech instructions,
which empowers VLAS with the ability of multimodal interaction across text,
image, speech, and robot actions. Taking a step further, a voice
retrieval-augmented generation (RAG) paradigm is designed to enable our model
to effectively handle tasks that require individual-specific knowledge. Our
extensive experiments show that VLAS can effectively accomplish robot
manipulation tasks with diverse speech commands, offering a seamless and
customized interaction experience.
comment: Accepted as a conference paper at ICLR 2025
♻ ☆ CoverLib: Classifiers-equipped Experience Library by Iterative Problem Distribution Coverage Maximization for Domain-tuned Motion Planning
Library-based methods are known to be very effective for fast motion planning
by adapting an experience retrieved from a precomputed library. This article
presents CoverLib, a principled approach for constructing and utilizing such a
library. CoverLib iteratively adds an experience-classifier-pair to the
library, where each classifier corresponds to an adaptable region of the
experience within the problem space. This iterative process is an active
procedure, as it selects the next experience based on its ability to
effectively cover the uncovered region. During the query phase, these
classifiers are utilized to select an experience that is expected to be
adaptable for a given problem. Experimental results demonstrate that CoverLib
effectively mitigates the trade-off between plannability and speed observed in
global (e.g. sampling-based) and local (e.g. optimization-based) methods. As a
result, it achieves both fast planning and high success rates over the problem
domain. Moreover, due to its adaptation-algorithm-agnostic nature, CoverLib
seamlessly integrates with various adaptation methods, including nonlinear
programming-based and sampling-based algorithms.
comment: Accepted for publication in IEEE Transactions on Robotics
♻ ☆ FUNCTO: Function-Centric One-Shot Imitation Learning for Tool Manipulation
Learning tool use from a single human demonstration video offers a highly
intuitive and efficient approach to robot teaching. While humans can
effortlessly generalize a demonstrated tool manipulation skill to diverse tools
that support the same function (e.g., pouring with a mug versus a teapot),
current one-shot imitation learning (OSIL) methods struggle to achieve this. A
key challenge lies in establishing functional correspondences between
demonstration and test tools, considering significant geometric variations
among tools with the same function (i.e., intra-function variations). To
address this challenge, we propose FUNCTO (Function-Centric OSIL for Tool
Manipulation), an OSIL method that establishes function-centric correspondences
with a 3D functional keypoint representation, enabling robots to generalize
tool manipulation skills from a single human demonstration video to novel tools
with the same function despite significant intra-function variations. With this
formulation, we factorize FUNCTO into three stages: (1) functional keypoint
extraction, (2) function-centric correspondence establishment, and (3)
functional keypoint-based action planning. We evaluate FUNCTO against exiting
modular OSIL methods and end-to-end behavioral cloning methods through
real-robot experiments on diverse tool manipulation tasks. The results
demonstrate the superiority of FUNCTO when generalizing to novel tools with
intra-function geometric variations. More details are available at
https://sites.google.com/view/functo.
♻ ☆ Exploring Quasi-Global Solutions to Compound Lens Based Computational Imaging Systems
Recently, joint design approaches that simultaneously optimize optical
systems and downstream algorithms through data-driven learning have
demonstrated superior performance over traditional separate design approaches.
However, current joint design approaches heavily rely on the manual
identification of initial lenses, posing challenges and limitations,
particularly for compound lens systems with multiple potential starting points.
In this work, we present Quasi-Global Search Optics (QGSO) to automatically
design compound lens based computational imaging systems through two parts: (i)
Fused Optimization Method for Automatic Optical Design (OptiFusion), which
searches for diverse initial optical systems under certain design
specifications; and (ii) Efficient Physic-aware Joint Optimization (EPJO),
which conducts parallel joint optimization of initial optical systems and image
reconstruction networks with the consideration of physical constraints,
culminating in the selection of the optimal solution in all search results.
Extensive experimental results illustrate that QGSO serves as a transformative
end-to-end lens design paradigm for superior global search ability, which
automatically provides compound lens based computational imaging systems with
higher imaging quality compared to existing paradigms. The source code will be
made publicly available at https://github.com/LiGpy/QGSO.
comment: Accepted to IEEE Transactions on Computational Imaging (TCI). The
source code will be made publicly available at https://github.com/LiGpy/QGSO
♻ ☆ Stability analysis through folds: An end-loaded elastic with a lever arm
Many physical systems can be modelled as parameter-dependent variational
problems. The associated equilibria may or may not exist realistically and can
only be determined after examining their stability. Hence, it is crucial to
determine the stability and track their transitions. Generally, the stability
characteristics of the equilibria change near folds in the parameter space. The
direction of stability changes is embedded in a specific projection of the
solutions, known as distinguished bifurcation diagrams. In this article, we
identify such projections for variational problems characterized by fixed-free
ends -- a class of problems frequently encountered in mechanics. Using these
diagrams, we study an Elastica subject to an end load applied through a rigid
lever arm. Several instances of snap-back instability are reported, along with
their dependence on system parameters through numerical examples. These
findings have potential applications in the design of soft robot arms and other
actuator designs.
comment: 20 pages, 12 figures
♻ ☆ Hierarchical Equivariant Policy via Frame Transfer
Haibo Zhao, Dian Wang, Yizhe Zhu, Xupeng Zhu, Owen Howell, Linfeng Zhao, Yaoyao Qian, Robin Walters, Robert Platt
Recent advances in hierarchical policy learning highlight the advantages of
decomposing systems into high-level and low-level agents, enabling efficient
long-horizon reasoning and precise fine-grained control. However, the interface
between these hierarchy levels remains underexplored, and existing hierarchical
methods often ignore domain symmetry, resulting in the need for extensive
demonstrations to achieve robust performance. To address these issues, we
propose Hierarchical Equivariant Policy (HEP), a novel hierarchical policy
framework. We propose a frame transfer interface for hierarchical policy
learning, which uses the high-level agent's output as a coordinate frame for
the low-level agent, providing a strong inductive bias while retaining
flexibility. Additionally, we integrate domain symmetries into both levels and
theoretically demonstrate the system's overall equivariance. HEP achieves
state-of-the-art performance in complex robotic manipulation tasks,
demonstrating significant improvements in both simulation and real-world
settings.