Most of Google’s updates to Gemini don’t stand out to me. I’ve yet to see a significant improvement in its hallucination rate, and its ability to summarize the news and weather leaves a lot to be desired. However, a recent update that added video analysis capabilities to Gemini caught my eye as a tool I might use regularly.
Video analysis in Gemini is founded on the AI’s existing ability to summarize YouTube videos. I took this tool for a test run to see just how powerful it is and whether I would use it in everyday life.
How well does Gemini’s video analysis work?
For testing, I selected a variety of videos from my camera roll and asked Gemini different questions each time. Depending on what you ask, Gemini will analyse the video differently, so I asked the most relevant questions about the video.
Test 1: Object recognition
Gemini correctly identified the type of ducks in my video with some prompting, and even managed to correctly identify where the video was taken, thanks to a sign in the background.
The sign only showed the business name, but Gemini managed to identify where the video was recorded to within 100 meters. However, the clues in the video (the business name, Mandarin ducks, and canal) would have also led a human to the correct answer within minutes.
Test 2: Location recognition
I was quite impressed by Google’s ability to identify where my video was, but there were plenty of clues to help it. For my next test, I used a video of an eruption of the Kilauea volcano in Hawaii taken in May. Gemini managed to correctly identify the volcano, but it was unable to identify the date (The video was taken on May 26).
Test 3: Location recognition
Just like with Gemini’s other analysis features, you need to ask it the right question to get the right answer. This video I took of a small parade at Karneval in Cologne last year stumped Gemini.
It was unable to answer me when I asked where the video was taken, but it managed to identify the country with further prompting. Interestingly, this prompt revealed that it recognised that the video was of a Karneval parade, but it couldn’t identify the city.
I tested Gemini again using a video of the main parade of Karneval (which contained significantly more visual clues), but it was still unable to identify that the video was taken in Cologne despite the amount of street signs, shop fronts, and Karneval costumes shown in the video.
Test 3: Audio recognition
I was personally interested in Gemini’s audio recognition. Identifying songs that are currently playing is useful, but picking up a song in the background from an old video is even more helpful for me. Unfortunately, Gemini’s results here were spotty at best. Here are some of my results:
- Incorrectly identified a 22-second recording of ‘Solid Rock’ by Dire Straits as ‘I Know Alone’ by HAIM.
- Incorrectly identified a 15-second recording of ‘Surfing with the Alien’ by Joe Satriani as ‘Can’t Stop’ by the Red Hot Chili Peppers.
- Correctly identified a 57-second recording of ‘Like a Rolling Stone’ by Bob Dylan. It also identified the song from an 11-second recording.
- Incorrectly identified an 11-second recording of ‘Wildflowers’ by Tom Petty as ‘You Belong To Me’ by the Duprees.
I tested Gemini more times with varying lengths of videos. It’s accuracy was positively correlated with the length of the recording, but what surprised me was how incorrect it was.
I highly recommend you compare the tracks above to see how different they are from reality. Honestly, Gemini, how does Tom Petty sound like The Duprees?
Test 4: Explaining what happens in a video
One of the more practical uses of Gemini is to explain what happens in a video if you don’t have time to watch it yourself. I used one of my favourite videos, a clip of my friend’s cats fighting. Gemini had a fascinating take on this clip.
While you can clearly see the black and white cat attack and then chase away the black cat, Gemini concluded that the cats began to fight (notably using the passive voice here, although there was clearly an aggressor), then the black cat chased the black and white cat away.
Gemini’s take here is misleading and would leave the user with a completely incorrect understanding of the situation.
However, a follow-up question prompted Gemini to correctly identify the aggressor in the video. This is a funny example involving a harmless interaction between cats, but it’s a great example of how Gemini can mislead users. What about if you used Gemini to analyze a video of people fighting?
Gemini’s video analysis is as unreliable as the rest of the AI’s services
The first test I did of Gemini’s video analysis was the Kilauea volcano eruption. This impressed me, but in most of my subsequent tests, Gemini failed to deliver. It needed hard data like signs to accurately identify locations, and its song recognition is inferior to Google’s Song Search tool (which is also included in the Gemini app).
I found the most interesting test was Gemini analyzing the cat fight, as it drew the wrong conclusions from the video despite clear video evidence. I managed to get it to correctly analyze the video after multiple prompts, but this took longer than watching the video. In conclusion, I’ll stick to watching and analyzing videos myself and shelve Gemini again.