ARC is a non-profit research organization whose mission is to align future machine learning systems with human interests. Its current work focuses on developing an “end-to-end” alignment strategy that could be adopted in industry today while scaling gracefully to future ML systems.
What is “alignment”? ML systems can exhibit goal-directed behavior, but it is difficult to understand or control what they are “trying” to do. Powerful models could cause harm if they were trying to manipulate and deceive humans. The goal of intent alignment is to instead train these models to be helpful and honest.
Motivation: I believe that modern ML techniques would lead to severe misalignment if scaled up to large enough computers and datasets. Practitioners may be able to adapt before these failures have catastrophic consequences, but we could reduce the risk by adopting scalable methods further in advance.
Initial project: I’m searching for strategies to represent what models “believe” in a way that is legible to human operators but more efficient than natural language. This is an ambitious goal: I’d like to handle beliefs ranging from subverbal intuitions to highly structured arguments, for models ranging from GPT-3 to radically superhuman reasoners. I expect that making models’ beliefs legible will be an important ingredient in many different approaches, although I’m focused on the particular set of goals and constraints needed to be a useful building block in my own approach to alignment.
Methodology: I’m unsatisfied with an algorithm if I feel like there’s any plausible story about how it eventually breaks down, which means that I can rule out most algorithms on paper without ever implementing them. The cost of this approach is that it may completely miss strategies that exploit important structure in realistic ML models; the benefit is that you can consider lots of ideas quickly. (More)
Future plans: I expect ARC to focus on end-to-end alignment approaches until I either become more pessimistic about tractability or ARC grows enough to branch out into other areas. Over the long term we are likely to work on a combination of theoretical and empirical alignment research, collaborations with industry labs, alignment forecasting, and ML deployment policy.