You can have pictures and voice commands prompt it
OpenAI: A Text-to-Speech Model for Creating Human-like Audio from a Few Seconds of Speech with ChatGPT
Most of the changes to the bot are about what it can answer: questions, information, and improved models. This time, though, it’s tweaking the way you use ChatGPT itself. A new version of the service will allow you to prompt the bot by either speaking aloud or uploading a photo, rather than just typing in a sentence. The new features are rolling out to those who pay for ChatGPT in the next two weeks, and everyone else will get it “soon after” according to OpenAI.
A new text-to-speech model is being rolled out by Openai and is said to be able to create “human-like audio from just text and a few seconds of sample speech.” You’ll be able to choose ChatGPT’s voice from five options, but OpenAI seems to think the model has vastly more potential than that. While keeping the sound of the narrator’s voice, OpenAI is working with the service to translate the podcast into other languages. OpenAI has the potential to be a part of the synthetic voice industry.
It’s possible to build a synthetic voice with less than a few seconds of audio, which may lead to problematic use cases. “These capabilities also present new risks, such as the potential for malicious actors to impersonate public figures or commit fraud,” the company says in a blog post announcing the new features. The model is going to be more controlled and restricts to specific use cases as a result of that reason.
How Multimodal AI Can Learn to Describe Natural Language Using Google Lens, GPT-4, And Other Search Engines: The Case for Gemini
The image search, meanwhile, is a bit like Google Lens. You snap a photo of whatever you’re interested in, and ChatGPT will try to suss out what you’re asking about and respond accordingly. The drawing tool in the app can be used to make your question clear or to type questions into the app. Rather than doing a search, you can prompt the bot and refine the answer if you want, which is helpful because you can do two searches at the same time. This is a lot like what the internet giant is doing with multi-dimensional searches.
The most recent GPT-4 is a language model created using lots of text from various sources around the web. While animal and human intelligence use a variety of types of sensory data, including audio and visual, creating more advancedai may require feeding the algorithms that do that as well.
Google’s next major AI model, Gemini, is widely rumored to be “multimodal,” meaning it will be able to handle more than just text, perhaps allowing video, images, and voice inputs. “From a model performance standpoint, intuitively we would expect multimodal models to outperform models trained on a single modality,” says Trevor Darrell, a professor at UC Berkeley and a cofounder of Prompt AI, a startup working on combining natural language with image generation and manipulation. “If we build a model using just language, no matter how powerful it is, it will only learn language.”