OpenAI launched ChatGPT in 2022, revolutionizing the world of technology. ChatGPT is a conversationalist AI used as a chatbot and virtual assistant within the web and plug-in API for many applications. You send a prompt, and ChatGPT responds to it; you can ask it anything, like the top book series of the month, or ask it to create a rap song using your favorite Marvel characters.




The usage of LLM-based (large language models) AIs has also been at the center of many topics since its debut. However, it has also enabled new tech to evolve and software to flourish on many budget smartphones and flagship devices. ChatGPT’s arrival has awakened other top tech giants to bring their LLM-based AIs and tools to the public. One crucial feature is the ability to transcribe images and have them perceive them as an option over text. Previously, this feature was reserved for premium users, but OpenAI has rolled it into its latest GPT update.

Related

What are large language models?

Large language models (LLMs) are the basis for AI chatbots and much more. Here’s what’s going on behind the scenes




What is ChatGPT vision?

If you’re familiar with generative AI, you likely already have heard of ChatGPT. ChatGPT began in 2022 with the public release of GPT-3.5 and later brought its experimental paid version, GPT-4. According to an OpenAI paper published in 2023, GPT-4V, the feature “enables users to instruct GPT-4 to analyze image inputs provided by the user.” OpenAI completed the training for GPT-4V in March 2022.

figure 2 from openAI paper published on GPT-V in 2023
Source: OpenAI

GPT-4V underwent many iterations before the feature became publicly ready. It was tested and analyzed for disinformation risks, stereotyping, and ungrounded interferences. The developers did not want the vision feature to be misused or provide misinformation regarding safety and sensitive topics.




How can you access ChatGPT vision?

ChatGPT vision, also known as GPT-4 with vision (GPT-4V), was initially rolled out as a premium feature for ChatGPT Plus users ($20 per month). OpenAI has brought its vision feature to all free users with GPT-4o (called Omni). But it is currently being released in batches.

There’s a limit usage for free users, but Plus users will have five times over the limit applied over the free tier. Also, to access ChatGPT, users were previously required to sign up for a free account. Since then, OpenAI has changed its policy; anyone can begin using ChatGPT without creating an account. However, having an account still adds benefits. Benefits include saving and reviewing chat history and attaching images. So, if you plan to use the vision feature, it is advised to sign up for an account.


How to use ChatGPT vision

To get started with GPT-4o, log into chat.openai.com or open the mobile app and select Try it now when prompted.



red rectangle outline over try it now option in introducing GPT-4o window

From there, you can attach an image from your computer or copy an image address from one you’ve found. ChatGPT will invite you to ask questions or directly ask while adding an image.

ChatGPT is not perfect; it makes many mistakes. In the prompt below, with three anime characters placed in an image (image credits: Screenrant), ChatGPT incorrectly guessed one of the three, meaning the answer was only 66% correct.

It guessed Naruto, Goku, and Luffy. But in this image, Luffy isn’t present. Instead, we have Sailor Moon.

sample prompt of using an image to find information related to the image in chatgpt


Even if the feature isn’t perfect, you can still use it for a handful of applications related to images. You can ask ChatGPT to tell you details (make educated guesses) you can only see from a photo. Below, we’ve tried some prompts to see how well ChatGPT can process these requests.

Using GPT-4o vision for learning recipes

We sent this image to ChatGPT-4o and asked if it could discern the recipe (ingredients used) and calorie information based on the image.

close up of a taco salad in a bowl mixed
Source: Food.com

ChatGPT could discern that this was a taco salad and mentioned the typical ingredients. It also broke down the calories based on the ingredients used. The response was:



  • Calories: 655
  • Ground beef
  • Lettuce
  • Cherry tomatoes
  • Shredded cheese
  • Tortilla chips or Doritos
  • Black beans or pinto beans
  • Salsa or a similar dressing

The actual answer, according to a user on Food.com:

  • Calories: 855.3
  • Ground beef
  • Taco seasoning
  • Iceberg lettuce, chopped
  • Roma tomatoes, diced
  • Green onions, chopped
  • Red kidney beans or black beans drained
  • Large black olives, sliced
  • Cheddar cheese, shredded
  • Catalina dressing
  • Plain Doritos, crumbled into big chunks

Though the answer for ingredients was more generalized than expected, it still provided a rough idea of what the item was and the expected calorie count. Calories will change depending on sauce and portion size, which is difficult to guess from a photo.

Using GPT-4o vision to transcribe handwritten notes into text

Transcribing written notes takes a lot of time, especially when you wish to keep copies digitally. One cool feature of ChatGPT’s vision is asking the AI to rewrite handwritten text images into typed notes.



We asked ChatGPT to send the text version of a slide:

a slide of handwritten notes for chemistry

ChatGPT’s answer:

transcribed handwritten notes into text form in chatgpt

The results were impressive, even detecting handwritten symbols. The AI recognized symbols outside English, which was the case when writing the net charge.

Using GPT-4o vision to solve Captchas

Captchas help filter out bots by creating distorted and difficult-to-discern images, usually filled with letters and numbers. However, solving the Captcha can sometimes prove tricky. We tested whether you can receive help from ChatGPT to solve one.


We pulled an example of a Captcha on Cloudflare’s learning page.

an example of a captcha showing eight characters
Source: Cloudflare

We asked ChatGPT if it could provide the characters in the image (without mentioning it has letters and numbers). The results were not accurate. ChatGPT answered “v6T9JBCD.” The AI thinking the letter “v” was in the image is understandable since the squiggles in the image have a “v” shape, but it was surprising that the letter “S” was not considered at all.


What else can you do with GPT vision?

Uploading images and asking ChatGPT to interpret, analyze, and answer your questions is only one part of its capabilities. You can also ask the AI to produce images based on descriptions and specified instructions. For example, you can take a screenshot of an image and ask how it should look or ask ChatGPT to produce an image from scratch with Dall-E 3.


ChatGPT’s vision feature interprets a mixture of imagery sets, too. Often, we don’t have perfect pictures, and some images contain both text and illustrations. You can use ChatGPT to interpret an infographic and ask it questions. Or even ask it to reproduce it so you can understand it better.

It can also help your day-to-day life; you can take a picture or a video, upload it to the AI, and ask for help. It becomes handy when operating an object and the instructions are in another language.




ChatGPT with vision is still learning

AI can only improve as we feed it more visual data. The more images and questions we ask, the better the AI interprets them with realism and consistency. This is similar to training a human brain: the more we expose ourselves to different topics, the more (and better equipped) we become to handle them. You can apply these principles to machine learning.

In the May 2024 update, OpenAI explains its plans with ChatGPT’s visual learning. Eventually, they want users to be able to converse with the AI using real-time videos and improve its Voice Mode function so you can talk directly to AI more naturally. If AI continues to interest you, you can try some impressive apps on the Google Play Store.

Related

The 5 best AI apps for your Android phone or tablet

Cut through the clutter to find the best AI apps for your Android.