HomeWinBuzzer NewsGoogle Unveils Project Astra, a Multimodel AI Visual Assistant

Google Unveils Project Astra, a Multimodel AI Visual Assistant

Astra will allow users to interact through talking, typing, drawing, photographing, and video.


has unveiled Project Astra, a new application powered by Google´s Gemini AI and smartphone cameras to assist users in their daily activities. Announced during the Google I/O 2024 keynote, this initiative underscores Google's ongoing efforts to develop versatile AI agents capable of providing practical help.
The features are very similar to the version of ChatGPT running on the multimodal GPT-4o model, which was announced just yesterday by OpenAI. Project Astra is part of a broader set of Gemini announcements at 2024. This includes new models like Gemini 1.5 Flash for faster tasks, Veo for generating video from text prompts, and Gemini Nano for local device use. The context window for Gemini Pro is doubling to 2 million tokens, enhancing its ability to follow instructions.

AI-Powered Visual Assistance

Project Astra operates as a camera-based AI application, primarily using a viewfinder as its interface. Users can point their phone cameras at various objects and verbally interact with the AI, named Gemini. For example, when a user asked the AI to identify a sound-making object in an office, Gemini recognized a speaker and provided detailed information about its components, such as identifying the tweeter and explaining its function.
The app also demonstrates creative capabilities. When prompted to create an alliteration for a cup of crayons, Gemini responded with “Creative crayons color cheerfully. They certainly craft colorful creations.”

Wearable Integration and Memory Recall

The demonstration included a segment where the AI remembered the location of items that were no longer in the camera's view. When asked about the location of misplaced glasses, Gemini accurately recalled that they were on a desk near a red apple. The user then wore the glasses, which appeared to be an advanced version of Google Glass, and the perspective shifted to the wearable's view. The glasses scanned the surroundings and provided contextual information, such as suggesting technical improvements to a system diagram on a whiteboard.

The AI's ability to process visual data in real-time and recall past observations is achieved through continuous encoding of video frames, combining video and speech inputs into a timeline of events, and caching this information for efficient recall. This technological advancement allows the AI to respond quickly and accurately, enhancing its practical utility.

Multimodal Interaction

Astra is designed to be multimodal, allowing users to interact through talking, typing, drawing, photographing, and video. Google is also introducing Gemini Live, a voice-only assistant for back-and-forth conversations, and a new feature in Google Lens for web searches via video narration.
Google is currently working on applications like trip planning, where Gemini can help build and edit itineraries. The DeepMind team is still researching how to best integrate multimodal models and balance large general models with smaller, focused ones.

Future Availability and Enhancements

While Project Astra is still in its early stages with no specific launch date, Google's DeepMind CEO Demis Hassabis indicated that some capabilities of the AI would be integrated into Google products, like the Gemini app, later in the year. The company is also working on improving the vocal expressiveness of its AI, aiming to make interactions more natural and conversational. The potential applications of such technology, whether through smartphones or advanced , could significantly enhance user experience and productivity.

Markus Kasanmascheff
Markus Kasanmascheff
Markus is the founder of WinBuzzer and has been playing with Windows and technology for more than 25 years. He is holding a Master´s degree in International Economics and previously worked as Lead Windows Expert for Softonic.com.