ARC is a non-profit research organization whose mission is to align future machine learning systems with human interests. Its current work focuses on developing an “end-to-end” alignment strategy that could be adopted in industry today while scaling gracefully to future ML systems.
What is “alignment”? ML systems can exhibit goal-directed behavior, but it is difficult to understand or control what they are “trying” to do. Powerful models could cause harm if they were trying to manipulate and deceive humans. The goal of intent alignment is to instead train these models to be helpful and honest.
Motivation: We believe that modern ML techniques would lead to severe misalignment if scaled up to large enough computers and datasets. Practitioners may be able to adapt before these failures have catastrophic consequences, but we could reduce the risk by adopting scalable methods further in advance.
Initial project: We’re trying to train ML systems to answer some questions by straightforwardly “translating” their beliefs into natural language rather than by reasoning about what a human wants to hear. This is an ambitious goal: we’d like to handle beliefs ranging from subverbal intuitions to highly structured arguments, for models ranging from GPT-3 to radically superhuman reasoners. We’ve explored a number of approaches to this problem and are now writing up a report explaining why we think it is important and tractable.
Methodology: We’re unsatisfied with an algorithm if we can see any plausible story about how it eventually breaks down, which means that we can rule out most algorithms on paper without ever implementing them. The cost of this approach is that it may completely miss strategies that exploit important structure in realistic ML models; the benefit is that you can consider lots of ideas quickly. (More)
Future plans: I expect ARC to focus on end-to-end alignment approaches until I either become more pessimistic about tractability or ARC grows enough to branch out into other areas. Over the long term we are likely to work on a combination of theoretical and empirical alignment research, collaborations with industry labs, alignment forecasting, and ML deployment policy.