Google I/O 2024 has been, to nobody’s surprise, all about AI. For most of Tuesday’s two-hour keynote, Google talked up AI shopping, AI Workspace tools, AI education tools — you get the idea. A lot of it was impressive, but the most intriguing AI-related announcement was for what Google is calling Project Astra, an acronym for “advanced seeing and talking responsive agent.”



Astra represents Google’s efforts to help Gemini understand the real world through multimodal input, both video and audio. I got to see an in-person demo: in a small room fitted with a camera, a microphone, and a bunch of props, Gemini answered questions, made up stories, and played simple games using info from simultaneous audio and video input. We’ve been routinely burned by lofty AI promises for a couple of years now, but if Project Astra’s functionality actually reaches users and works as well as what I saw at I/O, I think it could be a very big deal.



The Project Astra demonstrations

I saw AI party tricks, but the software underpinning them seems promising

The Astra demos I participated in took place in front of a large touchscreen that showed a live feed from a camera pointed downward at a table (we were asked not to take photos, unfortunately). Across four different demonstrations, Google reps placed various objects in view of the camera. A voice that I’ll just call Gemini — the reps pointed out that all the demos were powered by the same Gemini 1.5 model — reacted to both the objects and the presenters’ questions and comments in a convincingly natural way. Aside from a couple of minor hiccups I’ll touch on later, it was all pretty impressive stuff.



Pictionary

The first and most interesting of Google’s Astra demos was a simple Pictionary-style game. On a large touchscreen, a Google rep drew a stick figure, which Gemini quickly identified — it even complimented the presenter’s drawing skills. One of the reps dragged and dropped a skull emoji from an on-screen menu onto the stick figure’s outstretched arm, and with the hint that the rep was “thinking about a play,” Gemini caught on straight away that it was Hamlet.

The presenter removed the skull and drew a second stick figure, along with a shared thought bubble above the pair. She added an alien emoji inside the bubble and told Gemini the image was a TV show, and Gemini put together that it was meant to be The X-Files about as quickly as I did.

Alliteration

In a second Astra demo highlighting Gemini’s linguistic skills, the Google reps placed a series of toy food items in view of the camera: an apple, an ice cream cone, and a hot dog. Gemini talked about each of the objects in alliterative sentences, with strings of words that all started with the same letter. Gemini was able to talk pretty coherently about individual items in this way, but when asked more complex questions (like whether the three items together comprised a healthy lunch), it struggled to actually answer, instead prioritizing the directive to speak in alliterations.


Related

5 new Gemini AI features that could change your life

Now all Google has to do is actually deliver

Free form

In a demo Google called “free form,” a rep placed several stuffed animal toys on the table, one at a time, telling Gemini each one’s name as she did. Gemini was able to remember details about the objects that it was told, like their names, and also visual information it understood from the camera feed, like the type of animal each toy was.

Gemini could answer questions about what it had seen during the demo (like “what was the first thing I showed you?”), but it also hallucinated at one point, misremembering one of the toy’s given names. Still, it got most of the details right, and could respond in real time to both what it was seeing through the camera and the Google reps’ spoken questions and comments.



Storyteller

For the final demo, I was asked to pick one of a bunch of objects arranged on a nearby shelf. I grabbed a toy crab and put it in front of the camera. Gemini started to tell me a pretty intricate story about the crab making its way to a sand castle it spotted on the beach, including details like how the sand felt.

The Google reps placed a fidget spinner on the table next to the crab and cut Gemini off mid-sentence to ask it to include the new toy in the story about the crab. Gemini went on with its story, and sure enough, a few sentences later, the crab stumbled across a fidget spinner on its trip to the sand castle.


Project Astra could be a very big deal

For more reasons than one

A scene from Google I/O 2024 with a large screen that reads "Project Astra"
Source: Google



Project Astra seems like the type of functionality devices like the Rabbit R1 and Humane’s AI Pin are trying to deliver: simple, voice-based interaction with an LLM that can interpret the physical world in convincingly human ways. The advantage here for Google is that Gemini is available on off-the-shelf hardware — in a video demo shown during the I/O keynote, Project Astra is up and running on what looks like a Pixel 8 Pro.

None of the Project Astra demonstrations showcased specific, helpful use cases for Gemini’s growing multimodal understanding of the physical world, but it’s not hard to imagine ways the capabilities I saw could be useful. In the span of a few minutes, Google showed me that, with Astra, Gemini can take in multiple types of information at once and answer questions about what it’s seen and heard. It can also make sense of abstract symbols (like stick figures and emojis) and express ideas in ways that, coming from a person, I’d describe as creative.

Google says that “some of these capabilities” are coming to Gemini “later this year.”


According to Google, developments like this are stepping stones on the way to creating genuine general artificial intelligence — more Iron Man’s JARVIS than Google Assistant. That’s fascinating in itself, but more immediately, this type of functionality could be huge for people who have trouble interpreting their surroundings on their own. Astra’s multimodal understanding built into a wearable camera (in smart glasses, for example) could be an enormous help for people with visual impairments.

Google describes Astra as “our vision for the future of AI assistants,” but we’ll have to wait and see how much of this actually materializes in a consumer product. Google says that “some of these capabilities” are coming to Gemini “later this year.”