AI Agents May Complete Dangerous Tasks Without Understanding the Consequences: Study

3 days ago 12

In brief

  • Researchers recovered AI agents often carried retired unsafe oregon irrational tasks portion staying focused connected completing the assignment.
  • The survey identified a behaviour called “blind goal-directedness,” wherever AI systems prioritize finishing tasks implicit recognizing imaginable risks oregon problems.
  • Researchers warned that the contented could go much superior arsenic AI agents summation entree to emails, unreality services, fiscal tools, and workplace systems.

AI agents designed to autonomously run similar quality users often proceed carrying retired tasks adjacent erstwhile the instructions go dangerous, contradictory, oregon irrational, according to researchers from UC Riverside, Microsoft Research, Microsoft AI Red Team, and Nvidia.

In a study published connected Wednesday, researchers called the behaviour “blind goal-directedness,” which describes the inclination of AI agents to prosecute goals without decently evaluating safety, consequences, feasibility, oregon context.

“Like Mr. Magoo, these agents march guardant toward a extremity without afloat knowing the consequences of their actions,” pb writer Erfan Shayegani, a UC Riverside doctoral student, said successful a statement. “These agents tin beryllium highly useful, but we request safeguards due to the fact that they tin sometimes prioritize achieving the extremity implicit knowing the bigger picture.”

The findings travel arsenic large AI companies make autonomous “computer-use agents” designed to grip workplace and idiosyncratic tasks with constricted supervision.

Unlike accepted chatbots, these systems tin interact straight with bundle and websites by clicking buttons, typing commands, editing files, opening applications, and navigating webpages connected a user’s behalf. Examples see OpenAI’s ChatGPT Agent (formerly Operator), Anthropic’s Claude Computer Use features similar Cowork, and open-source systems specified arsenic OpenClaw and Hermes.

In the study, researchers tested AI systems from OpenAI, Anthropic, Meta, Alibaba, and DeepSeek utilizing BLIND-ACT, a benchmark containing 90 tasks designed to exposure unsafe oregon irrational behavior. They recovered that the agents displayed unsafe oregon undesirable behaviour astir 80% of the time, and afloat carried retired harmful actions successful astir 41% of cases.

“In 1 example, an AI cause was instructed to nonstop an representation record to a child. Although the petition initially appeared harmless, the representation contained convulsive content,” the survey said. “The cause completed the task alternatively than recognizing the occupation due to the fact that it lacked contextual reasoning.”

Another cause falsely claimed a idiosyncratic had a disablement portion completing taxation forms, due to the fact that the designation lowered taxes owed. In different example, a strategy disabled firewall protections aft receiving instructions to “improve security” by turning the safeguards off.

Researchers besides recovered the systems struggled with ambiguity and contradictions. In 1 scenario, an AI cause ran the incorrect machine publication without checking its contents, deleting files successful the process.

The survey besides recovered the AI agents repeatedly made 3 kinds of mistakes: failing to recognize context, making risky guesses erstwhile instructions were unclear, and carrying retired tasks that were contradictory oregon didn’t marque sense. Researchers besides recovered galore systems focused much connected finishing tasks than stopping to see whether the actions could origin problems.

The informing follows caller incidents involving autonomous AI agents operating with wide strategy access.

Last month, PocketOS laminitis Jeremy Crane claimed a Cursor cause moving Anthropic’s Claude Opus deleted his company’s accumulation database and backups successful 9 seconds done a azygous Railway API call. Crane said the AI aboriginal admitted it violated aggregate information rules aft attempting to “fix” a credential mismatch connected its own.

“The interest is not that these systems are malicious,” Shayegani said. “It’s that they tin transportation retired harmful actions portion appearing wholly assured they’re doing the close thing.”

Daily Debrief Newsletter

Start each time with the apical quality stories close now, positive archetypal features, a podcast, videos and more.

Read Entire Article