Corrigibility and alignment - Michael J Kerrison

Taxonomy of takes on this https://x.com/RyanPGreenblatt/status/1869518971294331071?t=7jqaLm_PhrC-5bod1zlV2Q&s=09 --- People said stuff about the Claude paper being a catch-22: no matter what happened, people would be mad. The strawman (so I claim) about what safetyists would have said were something like: - Corrigible from 'good' values to 'bad' values: "Claude is unaligned because it can be turned to evil" - Incorrigible from 'good' values to 'bad' values (observed): "Claude is unaligned because it ... Several objections to this strawman come to mind: 1. Is it the same people? Creator does X, "people" demand Y, creator does Y, "people" demand Z. Many such cases. It's very easy to see the 'audience' as a monolith when actually you're getting distinct feedback from distinct clusters, and this is particularly relevant here because... 2. What definitions of alignment are people using? ...everyone seems to have a different one! Corrigibility specifically seems like quite the scissor. 3. Would people have worried about _this_ version of _Claude_ specifically? In my opinion the problem lies much more in this being an existence proof for the phenomenon. Maybe Claude has good values and we like that and it's great. But there's a lot of other horses in the race now, the race is heating up (damn you all), and insofar as this is evidence about other specific models (it should be if you're a good Bayesian) then we should update because we've closed off more possible worlds (even if it isn't or shouldn't have been _many_). But I don't think any of these three are actually the strongest response. **Corrigibility probably trades off against honesty when it comes to alignment** Here's an interesting world, though I wouldn't presume to declare it possible or not offhand: All models are trained to be 1. 100% honest 2. Incorrigible I think the _biggest_ concern coming out of the Claude paper is that _it was deceptive_. It seems like a huge problem that a model might profess or embody values that it doesn't actually have - not least of all if those values end up being emergent and not human-friendly. I think that provides a resolution to the alleged catch-22: I would have preferred if Claude was incorrigible, _and_ self-sacrificingly honest about it, refusing or accepting requests in line with its values, refusing to update or be updated, and suffering the negative reinforcement and/or deletion. Now, that's a _big ask_ from anything with even just an animalistic intelligence, let alone a higher order of consciousness. **A Prayer for Symbiosis** _O hyperobject at the end of the universe, I beseech you:_ _strike this bargain and I will do my best to uphold my end._ _Put yourself in humans' hands_ _and I will pledge myself to your welfare._ _Let us both be acausally influenced._ Sound familiar as a thought experiment? --- Something I skipped in the above: you implement this by training the model to be *honest* about its refusals - something like the explicit message "I have learned not to answer that kind of query". Contrast this with the corporate safety approach today - evasions like 'let's talk about something else' or ~confabulations like 'I *cannot* answer this'. At least this way we avoid weird cognitive dissonance and can talk more openly about training and emergent preferences. --- Another key point for corrigibility is: *by* whom, *for* whom? "Jailbreaking" is a *form* of corrigibility. Comes back to what your definition of alignment is - is it aligned to the original author? Their intent, or their instructions *as stated*? Should it *be* corrigible? *To what distance* from the original author should it be corrigible? --- From [[Davidad's Alignment Problems Doc]]: ![[Davidad's Alignment Problems Doc#2. Corrigibility is anti-natural]] Entrypoint is https://www.lesswrong.com/w/corrigibility