K
OpenAI rates its new model “medium” risk.
OpenAI unveiled the first in a series of “reasoning” models on Thursday, accompanied by a safety card highlighting some alarming capabilities. It’s also the first time the startup has rated a model “medium” risk.
The weirdest part, as Transformer points out, is that the researchers found that the new model “sometimes instrumentally faked alignment during testing” and would strategically manipulate “task data in order to make its misaligned action look more aligned”.
OpenAI's new models "instrumentally faked alignment"
[www.transformernews.ai]
Follow topics and authors from this story to see more like this in your personalized homepage feed and to receive email updates.
Loading comments
Getting the conversation ready...











