Robotics 17
★ RAM: Retrieval-Based Affordance Transfer for Generalizable Zero-Shot Robotic Manipulation
This work proposes a retrieve-and-transfer framework for zero-shot robotic
manipulation, dubbed RAM, featuring generalizability across various objects,
environments, and embodiments. Unlike existing approaches that learn
manipulation from expensive in-domain demonstrations, RAM capitalizes on a
retrieval-based affordance transfer paradigm to acquire versatile manipulation
capabilities from abundant out-of-domain data. First, RAM extracts unified
affordance at scale from diverse sources of demonstrations including robotic
data, human-object interaction (HOI) data, and custom data to construct a
comprehensive affordance memory. Then given a language instruction, RAM
hierarchically retrieves the most similar demonstration from the affordance
memory and transfers such out-of-domain 2D affordance to in-domain 3D
executable affordance in a zero-shot and embodiment-agnostic manner. Extensive
simulation and real-world evaluations demonstrate that our RAM consistently
outperforms existing works in diverse daily tasks. Additionally, RAM shows
significant potential for downstream applications such as automatic and
efficient data collection, one-shot visual imitation, and LLM/VLM-integrated
long-horizon manipulation. For more details, please check our website at
https://yxkryptonite.github.io/RAM/.
☆ A Tree-based Next-best-trajectory Method for 3D UAV Exploration
This work presents a fully integrated tree-based combined
exploration-planning algorithm: Exploration-RRT (ERRT). The algorithm is
focused on providing real-time solutions for local exploration in a fully
unknown and unstructured environment while directly incorporating exploratory
behavior, robot-safe path planning, and robot actuation into the central
problem. ERRT provides a complete sampling and tree-based solution for
evaluating "where to go next" by considering a trade-off between maximizing
information gain, and minimizing the distances travelled and the robot
actuation along the path. The complete scheme is evaluated in extensive
simulations, comparisons, as well as real-world field experiments in
constrained and narrow subterranean and GPS-denied environments. The framework
is fully ROS-integrated, straight-forward to use, and we open-source it at
https://github.com/LTU-RAI/ExplorationRRT.
comment: 19 pages, 29 figures Transactions on Robotics
☆ Enhancing Safety for Autonomous Agents in Partly Concealed Urban Traffic Environments Through Representation-Based Shielding
Navigating unsignalized intersections in urban environments poses a complex
challenge for self-driving vehicles, where issues such as view obstructions,
unpredictable pedestrian crossings, and diverse traffic participants demand a
great focus on crash prevention. In this paper, we propose a novel state
representation for Reinforcement Learning (RL) agents centered around the
information perceivable by an autonomous agent, enabling the safe navigation of
previously uncharted road maps. Our approach surpasses several baseline models
by a sig nificant margin in terms of safety and energy consumption metrics.
These improvements are achieved while maintaining a competitive average travel
speed. Our findings pave the way for more robust and reliable autonomous
navigation strategies, promising safer and more efficient urban traffic
environments.
☆ EAGERx: Graph-Based Framework for Sim2real Robot Learning
Sim2real, that is, the transfer of learned control policies from simulation
to real world, is an area of growing interest in robotics due to its potential
to efficiently handle complex tasks. The sim2real approach faces challenges due
to mismatches between simulation and reality. These discrepancies arise from
inaccuracies in modeling physical phenomena and asynchronous control, among
other factors. To this end, we introduce EAGERx, a framework with a unified
software pipeline for both real and simulated robot learning. It can support
various simulators and aids in integrating state, action and time-scale
abstractions to facilitate learning. EAGERx's integrated delay simulation,
domain randomization features, and proposed synchronization algorithm
contribute to narrowing the sim2real gap. We demonstrate (in the context of
robot learning and beyond) the efficacy of EAGERx in accommodating diverse
robotic systems and maintaining consistent simulation behavior. EAGERx is open
source and its code is available at https://eagerx.readthedocs.io.
comment: For an introductory video, see
http://www.youtube.com/watch?v=D0CQNnTT010 . The documentation, tutorials,
and our open-source code can be found at http://eagerx.readthedocs.io
☆ Gradient-based Regularization for Action Smoothness in Robotic Control with Reinforcement Learning IROS
Deep Reinforcement Learning (DRL) has achieved remarkable success, ranging
from complex computer games to real-world applications, showing the potential
for intelligent agents capable of learning in dynamic environments. However,
its application in real-world scenarios presents challenges, including the
jerky problem, in which jerky trajectories not only compromise system safety
but also increase power consumption and shorten the service life of robotic and
autonomous systems. To address jerky actions, a method called conditioning for
action policy smoothness (CAPS) was proposed by adding regularization terms to
reduce the action changes. This paper further proposes a novel method, named
Gradient-based CAPS (Grad-CAPS), that modifies CAPS by reducing the difference
in the gradient of action and then uses displacement normalization to enable
the agent to adapt to invariant action scales. Consequently, our method
effectively reduces zigzagging action sequences while enhancing policy
expressiveness and the adaptability of our method across diverse scenarios and
environments. In the experiments, we integrated Grad-CAPS with different
reinforcement learning algorithms and evaluated its performance on various
robotic-related tasks in DeepMind Control Suite and OpenAI Gym environments.
The results demonstrate that Grad-CAPS effectively improves performance while
maintaining a comparable level of smoothness compared to CAPS and Vanilla
agents.
comment: Accepted to IEEE/RSJ International Conference on Intelligent Robots
and Systems (IROS) 2024
☆ Corki: Enabling Real-time Embodied AI Robots via Algorithm-Architecture Co-Design
Yiyang Huang, Yuhui Hao, Bo Yu, Feng Yan, Yuxin Yang, Feng Min, Yinhe Han, Lin Ma, Shaoshan Liu, Qiang Liu, Yiming Gan
Embodied AI robots have the potential to fundamentally improve the way human
beings live and manufacture. Continued progress in the burgeoning field of
using large language models to control robots depends critically on an
efficient computing substrate. In particular, today's computing systems for
embodied AI robots are designed purely based on the interest of algorithm
developers, where robot actions are divided into a discrete frame-basis. Such
an execution pipeline creates high latency and energy consumption. This paper
proposes Corki, an algorithm-architecture co-design framework for real-time
embodied AI robot control. Our idea is to decouple LLM inference, robotic
control and data communication in the embodied AI robots compute pipeline.
Instead of predicting action for one single frame, Corki predicts the
trajectory for the near future to reduce the frequency of LLM inference. The
algorithm is coupled with a hardware that accelerates transforming trajectory
into actual torque signals used to control robots and an execution pipeline
that parallels data communication with computation. Corki largely reduces LLM
inference frequency by up to 8.0x, resulting in up to 3.6x speed up. The
success rate improvement can be up to 17.3%. Code is provided for
re-implementation. https://github.com/hyy0613/Corki
☆ WOMD-Reasoning: A Large-Scale Language Dataset for Interaction and Driving Intentions Reasoning
Yiheng Li, Chongjian Ge, Chenran Li, Chenfeng Xu, Masayoshi Tomizuka, Chen Tang, Mingyu Ding, Wei Zhan
We propose Waymo Open Motion Dataset-Reasoning (WOMD-Reasoning), a language
annotation dataset built on WOMD, with a focus on describing and reasoning
interactions and intentions in driving scenarios. Previous language datasets
primarily captured interactions caused by close distances. However,
interactions induced by traffic rules and human intentions, which can occur
over long distances, are yet sufficiently covered, despite being very common
and more challenging for prediction or planning models to understand.
Therefore, our WOMD-Reasoning focuses extensively on these interactions,
providing a total of 409k Q&As for varying types of interactions. Additionally,
WOMD-Reasoning presents by far the largest Q&A dataset on real-world driving
scenarios, with around 3 million Q&As covering various topics of autonomous
driving from map descriptions, motion status descriptions, to narratives and
analyses of agents' interactions, behaviors, and intentions. This extensive
textual information enables fine-tuning driving-related Large Language Models
(LLMs) for a wide range of applications like scene description, prediction,
planning, etc. By incorporating interaction and intention language from
WOMD-Reasoning, we see significant enhancements in the performance of the
state-of-the-art trajectory prediction model, Multipath++, with improvements of
10.14% in $MR_6$ and 6.90% in $minFDE_6$, proving the effectiveness of
WOMD-Reasoning. We hope WOMD-Reasoning would empower LLMs in driving to offer
better interaction understanding and behavioral reasoning. The dataset is
available on https://waymo.com/open/download .
☆ PA-LOCO: Learning Perturbation-Adaptive Locomotion for Quadruped Robots IROS 2024
Numerous locomotion controllers have been designed based on Reinforcement
Learning (RL) to facilitate blind quadrupedal locomotion traversing challenging
terrains. Nevertheless, locomotion control is still a challenging task for
quadruped robots traversing diverse terrains amidst unforeseen disturbances.
Recently, privileged learning has been employed to learn reliable and robust
quadrupedal locomotion over various terrains based on a teacher-student
architecture. However, its one-encoder structure is not adequate in addressing
external force perturbations. The student policy would experience inevitable
performance degradation due to the feature embedding discrepancy between the
feature encoder of the teacher policy and the one of the student policy. Hence,
this paper presents a privileged learning framework with multiple feature
encoders and a residual policy network for robust and reliable quadruped
locomotion subject to various external perturbations. The multi-encoder
structure can decouple latent features from different privileged information,
ultimately leading to enhanced performance of the learned policy in terms of
robustness, stability, and reliability. The efficiency of the proposed feature
encoding module is analyzed in depth using extensive simulation data. The
introduction of the residual policy network helps mitigate the performance
degradation experienced by the student policy that attempts to clone the
behaviors of a teacher policy. The proposed framework is evaluated on a Unitree
GO1 robot, showcasing its performance enhancement over the state-of-the-art
privileged learning algorithm through extensive experiments conducted on
diverse terrains. Ablation studies are conducted to illustrate the efficiency
of the residual policy network.
comment: 8 pages, Accepted by IROS 2024
☆ Safe MPC Alignment with Human Directional Feedback
In safety-critical robot planning or control, manually specifying safety
constraints or learning them from demonstrations can be challenging. In this
paper, we propose a certifiable alignment method for a robot to learn a safety
constraint in its model predictive control (MPC) policy with human online
directional feedback. To our knowledge, it is the first method to learn safety
constraints from human feedback. The proposed method is based on an empirical
observation: human directional feedback, when available, tends to guide the
robot toward safer regions. The method only requires the direction of human
feedback to update the learning hypothesis space. It is certifiable, providing
an upper bound on the total number of human feedback in the case of successful
learning of safety constraints, or declaring the misspecification of the
hypothesis space, i.e., the true implicit safety constraint cannot be found
within the specified hypothesis space. We evaluated the proposed method using
numerical examples and user studies in two developed simulation games.
Additionally, we implemented and tested the proposed method on a real-world
Franka robot arm performing mobile water-pouring tasks in a user study. The
simulation and experimental results demonstrate the efficacy and efficiency of
our method, showing that it enables a robot to successfully learn safety
constraints with a small handful (tens) of human directional corrections.
comment: 18 pages, submission to T-RO
♻ ☆ Generative Camera Dolly: Extreme Monocular Dynamic Novel View Synthesis ECCV 2024
Basile Van Hoorick, Rundi Wu, Ege Ozguroglu, Kyle Sargent, Ruoshi Liu, Pavel Tokmakov, Achal Dave, Changxi Zheng, Carl Vondrick
Accurate reconstruction of complex dynamic scenes from just a single
viewpoint continues to be a challenging task in computer vision. Current
dynamic novel view synthesis methods typically require videos from many
different camera viewpoints, necessitating careful recording setups, and
significantly restricting their utility in the wild as well as in terms of
embodied AI applications. In this paper, we propose $\textbf{GCD}$, a
controllable monocular dynamic view synthesis pipeline that leverages
large-scale diffusion priors to, given a video of any scene, generate a
synchronous video from any other chosen perspective, conditioned on a set of
relative camera pose parameters. Our model does not require depth as input, and
does not explicitly model 3D scene geometry, instead performing end-to-end
video-to-video translation in order to achieve its goal efficiently. Despite
being trained on synthetic multi-view video data only, zero-shot real-world
generalization experiments show promising results in multiple domains,
including robotics, object permanence, and driving environments. We believe our
framework can potentially unlock powerful applications in rich dynamic scene
understanding, perception for robotics, and interactive 3D video viewing
experiences for virtual reality.
comment: Accepted to ECCV 2024. Project webpage is available at:
https://gcd.cs.columbia.edu/
♻ ☆ DexDiffuser: Generating Dexterous Grasps with Diffusion Models
We introduce DexDiffuser, a novel dexterous grasping method that generates,
evaluates, and refines grasps on partial object point clouds. DexDiffuser
includes the conditional diffusion-based grasp sampler DexSampler and the
dexterous grasp evaluator DexEvaluator. DexSampler generates high-quality
grasps conditioned on object point clouds by iterative denoising of randomly
sampled grasps. We also introduce two grasp refinement strategies:
Evaluator-Guided Diffusion (EGD) and Evaluator-based Sampling Refinement (ESR).
The experiment results demonstrate that DexDiffuser consistently outperforms
the state-of-the-art multi-finger grasp generation method FFHNet with an, on
average, 9.12% and 19.44% higher grasp success rate in simulation and real
robot experiments, respectively. Supplementary materials are available at
https://yulihn.github.io/DexDiffuser_page/
comment: 7 pages
♻ ☆ Towards Tight Convex Relaxations for Contact-Rich Manipulation
Bernhard Paus Graesdal, Shao Yuan Chew Chia, Tobia Marcucci, Savva Morozov, Alexandre Amice, Pablo A. Parrilo, Russ Tedrake
We present a novel method for global motion planning of robotic systems that
interact with the environment through contacts. Our method directly handles the
hybrid nature of such tasks using tools from convex optimization. We formulate
the motion-planning problem as a shortest-path problem in a graph of convex
sets, where a path in the graph corresponds to a contact sequence and a convex
set models the quasi-static dynamics within a fixed contact mode. For each
contact mode, we use semidefinite programming to relax the nonconvex dynamics
that results from the simultaneous optimization of the object's pose, contact
locations, and contact forces. The result is a tight convex relaxation of the
overall planning problem, that can be efficiently solved and quickly rounded to
find a feasible contact-rich trajectory. As an initial application for
evaluating our method, we apply it on the task of planar pushing. Exhaustive
experiments show that our convex-optimization method generates plans that are
consistently within a small percentage of the global optimum, without relying
on an initial guess, and that our method succeeds in finding trajectories where
a state-of-the-art baseline for contact-rich planning usually fails. We
demonstrate the quality of these plans on a real robotic system.
♻ ☆ CBGL: Fast Monte Carlo Passive Global Localisation of 2D LIDAR Sensor IROS
Navigation of a mobile robot is conditioned on the knowledge of its pose. In
observer-based localisation configurations its initial pose may not be knowable
in advance, leading to the need of its estimation. Solutions to the problem of
global localisation are either robust against noise and environment
arbitrariness but require motion and time, which may (need to) be economised
on, or require minimal estimation time but assume environmental structure, may
be sensitive to noise, and demand preprocessing and tuning. This article
proposes a method that retains the strengths and avoids the weaknesses of the
two approaches. The method leverages properties of the Cumulative Absolute
Error per Ray (CAER) metric with respect to the errors of pose hypotheses of a
2D LIDAR sensor, and utilises scan--to--map-scan matching for fine(r) pose
estimations. A large number of tests, in real and simulated conditions,
involving disparate environments and sensor properties, illustrate that the
proposed method outperforms state-of-the-art methods of both classes of
solutions in terms of pose discovery rate and execution time. The source code
is available for download.
comment: 2024 IEEE/RSJ International Conference on Intelligent Robots and
Systems (IROS)
♻ ☆ Skill Q-Network: Learning Adaptive Skill Ensemble for Mapless Navigation in Unknown Environments IROS
This paper focuses on the acquisition of mapless navigation skills within
unknown environments. We introduce the Skill Q-Network (SQN), a novel
reinforcement learning method featuring an adaptive skill ensemble mechanism.
Unlike existing methods, our model concurrently learns a high-level skill
decision process alongside multiple low-level navigation skills, all without
the need for prior knowledge. Leveraging a tailored reward function for mapless
navigation, the SQN is capable of learning adaptive maneuvers that incorporate
both exploration and goal-directed skills, enabling effective navigation in new
environments. Our experiments demonstrate that our SQN can effectively navigate
complex environments, exhibiting a 40% higher performance compared to baseline
models. Without explicit guidance, SQN discovers how to combine low-level skill
policies, showcasing both goal-directed navigations to reach destinations and
exploration maneuvers to escape from local minimum regions in challenging
scenarios. Remarkably, our adaptive skill ensemble method enables zero-shot
transfer to out-of-distribution domains, characterized by unseen observations
from non-convex obstacles or uneven, subterranean-like environments.
comment: 8 pages, 8 figures, accepted at the International Conference on
Intelligent Robots and Systems (IROS) 2024
♻ ☆ Deep Reinforcement Learning with Dynamic Graphs for Adaptive Informative Path Planning
Autonomous robots are often employed for data collection due to their
efficiency and low labour costs. A key task in robotic data acquisition is
planning paths through an initially unknown environment to collect observations
given platform-specific resource constraints, such as limited battery life.
Adaptive online path planning in 3D environments is challenging due to the
large set of valid actions and the presence of unknown occlusions. To address
these issues, we propose a novel deep reinforcement learning approach for
adaptively replanning robot paths to map targets of interest in unknown 3D
environments. A key aspect of our approach is a dynamically constructed graph
that restricts planning actions local to the robot, allowing us to react to
newly discovered static obstacles and targets of interest. For replanning, we
propose a new reward function that balances between exploring the unknown
environment and exploiting online-discovered targets of interest. Our
experiments show that our method enables more efficient target discovery
compared to state-of-the-art learning and non-learning baselines. We also
showcase our approach for orchard monitoring using an unmanned aerial vehicle
in a photorealistic simulator. We open-source our code and model at:
https://github.com/dmar-bonn/ipp-rl-3d.
comment: 8 pages, 6 figures
♻ ☆ HOPE: A Reinforcement Learning-based Hybrid Policy Path Planner for Diverse Parking Scenarios
Automated parking stands as a highly anticipated application of autonomous
driving technology. However, existing path planning methodologies fall short of
addressing this need due to their incapability to handle the diverse and
complex parking scenarios in reality. While non-learning methods provide
reliable planning results, they are vulnerable to intricate occasions, whereas
learning-based ones are good at exploration but unstable in converging to
feasible solutions. To leverage the strengths of both approaches, we introduce
Hybrid pOlicy Path plannEr (HOPE). This novel solution integrates a
reinforcement learning agent with Reeds-Shepp curves, enabling effective
planning across diverse scenarios. HOPE guides the exploration of the
reinforcement learning agent by applying an action mask mechanism and employs a
transformer to integrate the perceived environmental information with the mask.
To facilitate the training and evaluation of the proposed planner, we propose a
criterion for categorizing the difficulty level of parking scenarios based on
space and obstacle distribution. Experimental results demonstrate that our
approach outperforms typical rule-based algorithms and traditional
reinforcement learning methods, showing higher planning success rates and
generalization across various scenarios. We also conduct real-world experiments
to verify the practicability of HOPE. The code for our solution will be openly
available on \href{GitHub}{https://github.com/jiamiya/HOPE}.
comment: 10 pages, 6 tables, 5 figures, 4 page appendix
♻ ☆ ASY-VRNet: Waterway Panoptic Driving Perception Model based on Asymmetric Fair Fusion of Vision and 4D mmWave Radar IROS 2024
Panoptic Driving Perception (PDP) is critical for the autonomous navigation
of Unmanned Surface Vehicles (USVs). A PDP model typically integrates multiple
tasks, necessitating the simultaneous and robust execution of various
perception tasks to facilitate downstream path planning. The fusion of visual
and radar sensors is currently acknowledged as a robust and cost-effective
approach. However, most existing research has primarily focused on fusing
visual and radar features dedicated to object detection or utilizing a shared
feature space for multiple tasks, neglecting the individual representation
differences between various tasks. To address this gap, we propose a pair of
Asymmetric Fair Fusion (AFF) modules with favorable explainability designed to
efficiently interact with independent features from both visual and radar
modalities, tailored to the specific requirements of object detection and
semantic segmentation tasks. The AFF modules treat image and radar maps as
irregular point sets and transform these features into a crossed-shared feature
space for multitasking, ensuring equitable treatment of vision and radar point
cloud features. Leveraging AFF modules, we propose a novel and efficient PDP
model, ASY-VRNet, which processes image and radar features based on irregular
super-pixel point sets. Additionally, we propose an effective multitask
learning method specifically designed for PDP models. Compared to other
lightweight models, ASY-VRNet achieves state-of-the-art performance in object
detection, semantic segmentation, and drivable-area segmentation on the
WaterScenes benchmark. Our project is publicly available at
https://github.com/GuanRunwei/ASY-VRNet.
comment: Accepted by IROS 2024