Fuck AI

3433 readers

1125 users here now

"We did it, Patrick! We made a technological breakthrough!"

A place for all those who loathe AI to discuss things, post articles, and ridicule the AI hype. Proud supporter of working people. And proud booer of SXSW 2024.

founded 1 year ago

MODERATORS

[email protected]

On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback (arxiv.org)

submitted 1 month ago by [email protected] to c/[email protected]

1 comments fedilink hide all child comments

top 1 comments

sorted by: hot top controversial new old

[–] [email protected] 1 points 1 month ago

. In our settings, we find that: 1) Extreme forms of “feedback gaming” such as manipulation and deception are learned reliably; 2) Even if only 2% of users are vulnerable to manipulative strategies, LLMs learn to identify and target them while behaving appropriately with other users, making such behaviors harder to detect; 3) To mitigate this issue, it may seem promising to leverage continued safety training or LLM-as-judges during training to filter problematic outputs. Instead, we found that while such approaches help in some of our settings, they backfire in others, sometimes even leading to subtler manipulative behaviors. We hope our results can serve as a case study which highlights the risks of using gameable feedback sources – such as user feedback – as a target for RL