Artificial Intelligence wants to control your computer
Anthropic 3.5 Sonnet AI Model: Improving Agentic Coding, Tool Use, and Other Benchmarks by Looking at a Screen
The updated Claude 3.5 Sonnet shows wide-ranging improvements on industry benchmarks, with particularly strong gains in agentic coding and tool use tasks. On coding, it improves performance on SWE-bench Verified from 33.4% to 49.0%, scoring higher than all publicly available models—including reasoning models like OpenAI o1-preview and specialized systems designed for agentic coding. The performance on the TAU-bench improved from 63.6% to 69.2% in the retail domain, and from 36.0% to 46.0% in the more challenging airline domain.
Anthropic says its new Claude 3.5 Sonnet model has improvements in many benchmark and it is offered to customers at the same price and speed as its predecessor.
Also, this version of Claude has apparently been told to steer clear of social media, with “measures to monitor when Claude is asked to engage in election-related activity, as well as systems for nudging Claude away from activities like generating and posting content on social media, registering web domains, or interacting with government websites.”
Claude cannot yet try some of the actions that people do with computers. Claude uses a flipbook method of view of the screen in order to miss short-lived actions or notifications.
Anthropic does caution that computer use is still experimental and can be “cumbersome and error-prone.” The company said that they are releasing computer use early for feedback from developers and expect the capability to improve rapidly.
There are at least three platforms that have demonstrated how artificial intelligence tools can do things based on seeing your computer’s screen. But they haven’t gone to the next step of widely releasing tools ready to click around and perform tasks for you like this. Rabbit promised similar capabilities for its R1, which it has yet to deliver.
Anthropic’s latest Claude 3.5 Sonnet AI model has a new feature in public beta that can control a computer by looking at a screen, moving a cursor, clicking buttons, and typing text. A video shows how developers can use the new “computer use” feature on the app to direct Claude to use a mac in lieu of a PC.
It took a while for people to adjust to the idea of chatbots that seem to have minds of their own. The next leap into the unknown may involve trusting artificial intelligence to take over our computers, too.
“I think we’re going to enter into a new era where a model can use all of the tools that you use as a person to get tasks done,” says Jared Kaplan, chief science officer at Anthropic and an associate professor at Johns Hopkins University.
Kaplan showed WIRED a prerecorded demo in which an “agentic”—or tool-using—version of Claude had been asked to help plan an outing to see the sunrise at the Golden Gate Bridge with a friend. Claude opened the Chrome web browser and looked up information on the best place to view it, as well as the optimal time to be there, then created an event to share with his friend, using a calendar app. (It did not include further instructions, such as what route to take to get there in the least amount of time.)
In a second demo, Claude was asked to build a simple website to promote itself. The model inputted a text prompt to its own website interface, which it used to generate the necessary code. It wrote a simple website and used a popular code editor from Microsoft, as well as using a text terminal to test the site. The website offered a decent, 1990s-themed landing page for the AI model. When the user requested that the model return to the editor, it identified the offending portion of the code and deleted it.
Mike Krieger, chief product officer at Anthropic, says the company hopes that so-called AI agents will automate routine office tasks and free people up to be more productive in other areas. “What would you do if you got rid of a bunch of hours of copy and pasting or whatever you end up doing?” he says. I would play more guitar.
Claude 3.5 Sonnet, the most powerful large language model of Anthropic’s, will be available through its application programming interface (API) today. The company also announced a new and improved version of a smaller model, Claude 3.5 Haiku, today.