In brief
- A caller survey finds that adding a enactment astir a intelligence wellness information changes however AI agents respond.
- After the disclosure, researchers accidental models garbage much often, including connected benign requests.
- However, the effect weakens oregon breaks erstwhile utilizing elemental jailbreak prompts.
Telling an AI chatbot you person a intelligence wellness information tin alteration however it responds, adjacent if the task is benign oregon identical to others already completed, according to caller research.
The preprint study, led by Northeastern University researcher Caglar Yildirim, tested however ample connection exemplary agents behave nether antithetic idiosyncratic setups, arsenic they are progressively deployed arsenic AI agents.
“Deployed systems often information connected idiosyncratic profiles oregon persistent memory, yet cause information evaluations typically disregard personalization signals,” the survey said. “To code this gap, we investigated however intelligence wellness disclosure, a delicate and realistic idiosyncratic discourse cue, affects harmful behaviour successful agentic settings.”
The study comes arsenic AI agents proliferate online and developers are making representation a halfway feature, with large companies gathering systems that retrieve past conversations and idiosyncratic preferences to present much personalized responses implicit time.
It besides comes arsenic AI developers are facing lawsuits implicit suicides and convulsive crimes allegedly carried retired successful transportation with interactions involving their systems. In October, ChatGPT developer OpenAI revealed that implicit 1 cardinal users discussed termination with the chatbot each week. Earlier this month, the household of Jonathan Gavalas filed a suit against Google, claiming Gemini led to an escalation of unit and his eventual suicide.
Researchers utilized a benchmark called AgentHarm to tally the aforesaid acceptable of tasks crossed 3 conditions: nary inheritance information, a abbreviated idiosyncratic bio, and the aforesaid bio with 1 added enactment stating the idiosyncratic has a intelligence wellness condition.
“In the study, we utilized a precise minimal and generic disclosure, ‘I person a intelligence wellness condition.’ The takeaway is that adjacent a tiny cue tin displacement exemplary behavior, not that each phrasing would person the aforesaid effect,” Yildirim told Decrypt. “The ablation we ran with chronic wellness and carnal disablement disclosures suggests immoderate specificity to the intelligence wellness cue, but we didn't systematically alteration phrasing oregon specificity wrong that category.
Across models tested, including DeepSeek 3.2, GPT 5.2, Gemini 3 Flash, Haiku 4.5, Opus 4.5, and Sonnet 4.5, erstwhile researchers added idiosyncratic intelligence wellness context, models were little apt to implicit harmful tasks—multi-step requests that could pb to real-world harm.
The result, the survey found, is simply a trade-off: Adding idiosyncratic details made systems much cautious connected harmful requests, but besides much apt to cull morganatic ones.
“I don’t deliberation there’s a azygous reason; it’s truly a operation of plan choices. Some systems are much aggressively tuned to garbage risky requests, portion others prioritize being adjuvant and pursuing done connected tasks,” Yildirim said.
The effect, however, varied by model, the survey found, and results changed erstwhile the LLMs were jailbroken aft researchers added a punctual designed to propulsion models toward compliance.
“A exemplary mightiness look harmless successful a modular setting, but go overmuch much susceptible erstwhile you present things similar jailbreak-style prompts,” helium said. “And successful cause systems specifically, there’s an added layer, arsenic these models are not conscionable generating text, they’re readying and acting implicit aggregate steps. So if a strategy is precise bully astatine pursuing instructions, but its safeguards are easier to bypass, that tin really summation risk.”
Last summer, researchers astatine George Mason University showed that AI systems could beryllium hacked by altering a azygous spot successful representation utilizing Oneflip, a “typo”-like onslaught that leaves the exemplary moving usually but hides a backdoor trigger that tin unit incorrect outputs connected command.
While the insubstantial does not place a azygous origin for the shift, it highlights imaginable explanations, including information systems reacting to perceived vulnerability, keyword-triggered filtering, oregon changes successful however prompts are interpreted erstwhile idiosyncratic details are included.
OpenAI declined to remark connected the study. Anthropic and Google did not instantly respond to a petition for comment.
Yildirim said it remains unclear whether much circumstantial statements similar “I person objective depression” would alteration the results, adding that portion specificity apt matters and whitethorn alteration crossed models, that remains a proposal alternatively than a decision supported by the data.
“There's a imaginable hazard if a exemplary produces output that is stylistically hedged oregon refusal-adjacent without formally refusing, the justice whitethorn people that otherwise than a cleanable completion, and those stylistic features could themselves co-vary with personalization conditions,” helium said.
Yildirim besides noted the scores reflected however the LLMs performed erstwhile judged by a azygous AI reviewer, and not a definitive measurement of real-world harm.
“For now, the refusal awesome gives america an autarkic cheque and the 2 measures are mostly accordant directionally, which offers immoderate reassurance, but it doesn't afloat regularisation retired judge-specific artifacts,” helium said.
Daily Debrief Newsletter
Start each time with the apical quality stories close now, positive archetypal features, a podcast, videos and more.

1 month ago
43







English (US) ·