Google DeepMind has introduced a new milestone in robotic intelligence with the incorporation of Gemini AI, significantly improving the robots’ navigation and complex task performance. This achievement, detailed in a research paper by DeepMind’s robotics team, underscores the importance of Gemini 1.5 Pro’s expansive context window in enabling natural language interactions with RT-2 robots.
Training via Multimodal Instruction Navigation
The system is trained through a method called “Multimodal Instruction Navigation with demonstration Tours (MINT).” This training involves manually guiding the robot through environments such as homes or offices, or using a smartphone to record a walkthrough. The robots learn to “watch” these videos to understand their surroundings and respond appropriately. For instance, a robot can locate a charging point for a phone shown to it. The study reports a 90 percent success rate for over 50 user instructions in a space exceeding 9,000 square feet.
How can Gemini 1.5 Pro’s long context window help robots navigate the world? 🤖
A thread of our latest experiments. 🧵 pic.twitter.com/ZRQqQDEw98
— Google DeepMind (@GoogleDeepMind) July 11, 2024
A hierarchical Vision-Language-Action (VLA) navigation policy has been implemented to enable robots to understand both physical spaces and common sense reasoning. This policy helps the AI interpret user commands and navigate accordingly. The AI constructs a topological map by matching visual inputs from its cameras to frames from the demonstration video. This method achieves end-to-end success rates of 86 percent and 90 percent in navigating complex tasks.
Advanced Task Execution
Gemini 1.5 Pro enhances the robots’ ability to execute more nuanced tasks. For instance, if a user surrounded by Coke cans wants to know if there’s any Coke left in the fridge, the robot can check the fridge and report back. This marks a substantial step forward in robotic planning and task performance.
Despite these advancements, processing each instruction takes between 10 to 30 seconds, indicating potential for further optimization. The Google team aims to refine these capabilities for even better performance in the future. While widespread adoption of these advanced robots in homes is still some time away, the current progress suggests they could soon assist in everyday activities like locating keys or wallets.
Real-World Application and Command Testing
In extensive real-world testing, commands such as “Take me to the conference room with the double doors,” “Where can I borrow some hand sanitizer,” and “I want to store something out of sight from public view. Where should I go?” were used to evaluate the robots’ practical abilities. These tests demonstrated their competence in handling intricate reasoning and multimodal user commands effectively.
Last Updated on November 7, 2024 3:37 pm CET