Xander Davies


Hi, I’m Xander. I’m a member of the technical staff at the UK AI Safety Institute, where I work on AI safety and security. I previously studied computer science at Harvard, where I founded and led the Harvard AI Safety Team, a student group aimed at supporting students in conducting research to reduce risks from advanced AI. I’ve also worked with David Krueger’s lab at Cambridge University, David Bau’s lab at Northeastern University, and Redwood Research. I enjoy writing, chess, listening to music, and playing piano. If you’d like to discuss any of this, feel free to email me at alexander_davies [at] college [dot] harvard [dot] edu.


Google Scholar

* = Equal Contribution.

Casper, S.*, Davies, X.*, Shi, C., Gilbert, T.K., Scheurer, J., Rando, J., Freedman, R., Korbak, T., Lindner, D., Freire, P., Wang, T., Marks, S., Segerie, C.-R., Carroll, M., Peng, A., Christoffersen, P., Damani, M., Slocum, S., Anwar, U., Siththaranjan, A., Nadeau, M., Michaud, E.J., Pfau, J., Krasheninnikov, D., Chen, X., Langosco, L., Hase, P., Bıyık, E., Dragan, A., Krueger, D., Sadigh, D., Hadfield-Menell, D. (2023). Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback. arXiv preprint arXiv:2307.15217.

Davies, X.*, Nadeau, M.*, Prakash, N.*, Shaham, T, & Bau, D. (2023). Discovering Variable Binding Circuitry with Desiderata. In 2023 ICML Workshop on Deployment Challenges for Generative AI.

Li, M.*, Davies, X.*, & Nadeau, M.* (2023). Circuit Breaking: Removing Model Behaviors with Targeted Ablation. In 2023 ICML Workshop on Deployment Challenges for Generative AI.

Davies, X.*, Langosco, L.*, & Krueger, D. (2022). Unifying Grokking and Double Descent. In 2022 NeurIPS ML Safety Workshop.

Bricken, T., Davies, X., Singh, D., Krotov, D., & Kreiman, G. (2023). Sparse Distributed Memory is a Continual Learner. In ICLR 2023.

Writing & Press

Harvard Tech Review: Alexander Davies – 2023 Top Ten Seniors In Innovation

Harvard Crimson: Undergraduates Ramp Up Harvard AI Safety Team Amid Concerns Over Increasingly Powerful AI Models

Update on Harvard AI Safety Team and MIT AI Alignment

Toy Grokking with a Linear Model

Gradient Descent’s Implicit Bias on Separable Data

Announcing the Harvard AI Safety Team