You can have pictures and voice commands prompt it

September 25, 2023September 25, 2023

OpenAI: A Text-to-Speech Model for Creating Human-like Audio from a Few Seconds of Speech with ChatGPT

Most of the changes to the bot are about what it can answer: questions, information, and improved models. This time, though, it’s tweaking the way you use ChatGPT itself. A new version of the service will allow you to prompt the bot by either speaking aloud or uploading a photo, rather than just typing in a sentence. The new features are rolling out to those who pay for ChatGPT in the next two weeks, and everyone else will get it “soon after” according to OpenAI.

A new text-to-speech model is being rolled out by Openai and is said to be able to create “human-like audio from just text and a few seconds of sample speech.” You’ll be able to choose ChatGPT’s voice from five options, but OpenAI seems to think the model has vastly more potential than that. While keeping the sound of the narrator’s voice, OpenAI is working with the service to translate the podcast into other languages. OpenAI has the potential to be a part of the synthetic voice industry.

It’s possible to build a synthetic voice with less than a few seconds of audio, which may lead to problematic use cases. “These capabilities also present new risks, such as the potential for malicious actors to impersonate public figures or commit fraud,” the company says in a blog post announcing the new features. The model is going to be more controlled and restricts to specific use cases as a result of that reason.

How Multimodal AI Can Learn to Describe Natural Language Using Google Lens, GPT-4, And Other Search Engines: The Case for Gemini

The image search, meanwhile, is a bit like Google Lens. You snap a photo of whatever you’re interested in, and ChatGPT will try to suss out what you’re asking about and respond accordingly. The drawing tool in the app can be used to make your question clear or to type questions into the app. Rather than doing a search, you can prompt the bot and refine the answer if you want, which is helpful because you can do two searches at the same time. This is a lot like what the internet giant is doing with multi-dimensional searches.

The most recent GPT-4 is a language model created using lots of text from various sources around the web. While animal and human intelligence use a variety of types of sensory data, including audio and visual, creating more advancedai may require feeding the algorithms that do that as well.

Google’s next major AI model, Gemini, is widely rumored to be “multimodal,” meaning it will be able to handle more than just text, perhaps allowing video, images, and voice inputs. “From a model performance standpoint, intuitively we would expect multimodal models to outperform models trained on a single modality,” says Trevor Darrell, a professor at UC Berkeley and a cofounder of Prompt AI, a startup working on combining natural language with image generation and manipulation. “If we build a model using just language, no matter how powerful it is, it will only learn language.”

InOpenAI, podcast

Does the president or congress have war powers?

There are two important things that come from the U.S. strikes on Iran

The U.S. launched attacks on Iran and there are 4 things that have been said

There are 4 things that came out of the U.S. airstrikes on Iran

Here’s what to watch as Iran’s nuclear facilities are attacked by the U.S

The first images from the Vera C. Rubin Observatory are stunningly beautiful

Experts say that Iran’s nuclear program is not destroyed by Satellites

Trump takes a huge Gamble after military strike avoided his Predecessors

Only a few dozen children from Gaza have been treated at a cancer center in Jordan

You can have pictures and voice commands prompt it

OpenAI: A Text-to-Speech Model for Creating Human-like Audio from a Few Seconds of Speech with ChatGPT

How Multimodal AI Can Learn to Describe Natural Language Using Google Lens, GPT-4, And Other Search Engines: The Case for Gemini