OpenAI has unveiled a new AI model that is designed to analyze audio, visual and text, and provide answers based on what it "sees/hears".
The company behind the immensely popular AI tool ChatGPT announced its latest flagship model called GPT-4o (omni), which OpenAI describes as being a step towards a "much more natural human-computer interaction". The new AI model is expected to match the performance of GPT-4 Turbo at processing text and code input, while simultaneously being faster and 50% cheaper with its API, making it a more affordable choice for third-party app integration.
More specifically, users will be able to submit a query by voice about what the AI agent is able to "see" on the devices screen, and an example of this would be asking the AI what game two people can play. OpenAI demonstrated this with two people that verbally asked the AI "what game can we play". The AI used the smartphone camera to "see" the two people sitting in front of it and suggested playing rock, paper, scissors. The quick demonstration showed the AI model being able to fluently interact with the individuals and also be extremely responsive to interruptions and new commands.
The individuals then ask the AI "who won?" and the AI responded, "It's a tie", demonstrating it can see using the device's camera.
"It can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time(opens in a new window) in a conversation," writes OpenAI