Andreas Stuhlmüller

I'm cofounder and CEO of Elicit, the AI research assistant. Our mission at Elicit is to scale up good reasoning. Previously I was cofounder of Ought, a non-profit doing research on using machine learning to support deliberation. Before that, I was a researcher in Noah Goodman's Computation & Cognition lab at Stanford and a Ph.D. student in the Department of Brain and Cognitive Sciences at MIT, working with Josh Tenenbaum.

The question that drives my work is: How can we better solve our problems? Machines are potentially much better at answering hard questions if such questions can be posed formally. For example, given a formal model that describes how diseases cause symptoms and a formalized prior belief about disease rates, machines can reason precisely about what observed symptoms imply.

In its various aspects, this question touches on mathematics, computer science, psychology, and philosophy:

Read more about my interests, publications and projects.


Factored Verification: Detecting and Reducing Hallucination in Summaries of Academic Papers abstract bibtex
Charlie George, Andreas Stuhlmüller (2023)
Published at the Second Workshop on Information Extraction from Scientific Publications (WIESP) at IJCNLP-AACL 2023

Hallucination plagues even frontier LLMs--but how bad is it really for summarizing academic papers? We evaluate Factored Verification, a simple automated method for detecting hallucinations in abstractive summaries. This method sets a new SotA on hallucination detection in the summarization task of the HaluEval benchmark, achieving 76.2% accuracy. We then use this method to estimate how often language models hallucinate when summarizing across multiple academic papers and find 0.62 hallucinations in the average ChatGPT (16k) summary, 0.84 for GPT-4, and 1.55 for Claude 2. We ask models to self-correct using Factored Critiques and find that this lowers the number of hallucinations to 0.49 for ChatGPT, 0.46 for GPT-4, and 0.95 for Claude 2. The hallucinations we find are often subtle, so we advise caution when using models to synthesize academic papers.

Iterated Decomposition: Improving Science Q&A by Supervising Reasoning Processes abstract bibtex
Justin Reppert, Ben Rachbach, Charlie George, Luke Stebbing Jungwon Byun, Maggie Appleton, Andreas Stuhlmüller (2023)
Ought Technical Report. arXiv:2301.01751 [cs.CL]

Language models (LMs) can perform complex reasoning either end-to-end, with hidden latent state, or compositionally, with transparent intermediate state. Composition offers benefits for interpretability and safety, but may need workflow support and infrastructure to remain competitive. We describe iterated decomposition, a human-in-the-loop workflow for developing and refining compositional LM programs. We improve the performance of compositions by zooming in on failing components and refining them through decomposition, additional context, chain of thought, etc. To support this workflow, we develop ICE, an open-source tool for visualizing the execution traces of LM programs. We apply iterated decomposition to three real-world tasks and improve the accuracy of LM programs over less compositional baselines: describing the placebo used in a randomized controlled trial (25% to 65%), evaluating participant adherence to a medical intervention (53% to 70%), and answering NLP questions on the Qasper dataset (38% to 69%). These applications serve as case studies for a workflow that, if automated, could keep ML systems interpretable and safe even as they scale to increasingly complex tasks.

Supervise Process, Not Outcomes abstract bibtex
Andreas Stuhlmüller and Jungwon Byun (2022)
Ought Blog Post.

We can think about machine learning systems on a spectrum from process-based to outcome-based: Process-based systems are built on human-understandable task decompositions, with direct supervision of reasoning steps. Outcome-based systems are built on end-to-end optimization, with supervision of final results. This post explains why Ought is devoted to process-based systems. The argument is: (1) In the short term, process-based ML systems have better differential capabilities: They help us apply ML to tasks where we don’t have access to outcomes. These tasks include long-range forecasting, policy decisions, and theoretical research. (2) In the long term, process-based ML systems help avoid catastrophic outcomes from systems gaming outcome measures and are thus more aligned. (3) Both process- and outcome-based evaluation are attractors to varying degrees: Once an architecture is entrenched, it’s hard to move away from it. This lock-in applies much more to outcome-based systems. (4) Whether the most powerful ML systems will primarily be process-based or outcome-based is up in the air. (5) So it’s crucial to push toward process-based training now.

Factored Cognition Primer bibtex
Andreas Stuhlmüller, Justin Reppert, and Luke Stebbing (2022)
Web Book.

RAFT: A Real-World Few-Shot Text Classification Benchmark abstract bibtex
Neel Alex, Eli Lifland, Lewis Tunstall, Abhishek Thakur, Pegah Maham, C. Jess Riedel, Emmie Hine, Carolyn Ashurst, Paul Sedille, Alexis Carlier, Michael Noetel, and Andreas Stuhlmüller (2021)
Advances in Neural Information Processing Systems (NeurIPS) 35.

Large pre-trained language models have shown promise for few-shot learning, completing text-based tasks given only a few task-specific examples. Will models soon solve classification tasks that have so far been reserved for human research assistants? Existing benchmarks are not designed to measure progress in applied settings, and so don't directly answer this question. The RAFT benchmark (Real-world Annotated Few-shot Tasks) focuses on naturally occurring tasks and uses an evaluation setup that mirrors deployment. Baseline evaluations on RAFT reveal areas current techniques struggle with: reasoning over long texts and tasks with many classes. Human baselines show that some classification tasks are difficult for non-expert humans, reflecting that real-world value sometimes depends on domain expertise. Yet even non-expert human baseline F1 scores exceed GPT-3 by an average of 0.11. The RAFT datasets and leaderboard will track which model improvements translate into real-world benefits at

Evaluating Arguments One Step at a Time abstract bibtex
William Saunders, Ben Rachbach, Owain Evans, Zachary Miller, Jungwon Byun, and Andreas Stuhlmüller (2020)
Ought Technical Report.

We’re studying factored cognition: under what conditions can a group of people accomplish complex cognitive tasks if each person only has minimal context?
In a recent experiment, we focused on dividing up the task of evaluating arguments. We created short, structured arguments for claims about movie reviews. We then tried to distinguish valid from invalid arguments by showing each participant only one step of the argument, not the review or the other steps.
In this experiment, we found that:
- Factored evaluation of arguments can distinguish some valid from invalid arguments by identifying implausible steps in arguments for false claims.
- However, experiment participants disagreed a lot about whether steps were valid or invalid. This method is therefore brittle in its current form, even for arguments which only have 1–5 steps.
- More diverse argument and evidence types (besides direct quotes from the text), larger trees, and different participant guidelines should improve results.
In this technical progress update, we describe these findings in depth.

Predicting Human Deliberative Judgments with Machine Learning abstract bibtex
Owain Evans, Andreas Stuhlmüller, Chris Cundy, Ryan Carey, Zachary Kenton, Thomas McGrath, Andrew Schreiber (2018)
FHI Technical Report.

Machine Learning (ML) has been successful in automating a range of cognitive tasks that humans solve effortlessly and quickly. Yet many realworld tasks are difficult and slow: people solve them by an extended process that involves analytical reasoning, gathering external information, and discussing with collaborators. Examples include medical advice, judging a criminal trial, and providing personalized recommendations for rich content such as books or academic papers. There is great demand for automating tasks that require deliberative judgment. Current ML approaches can be unreliable: this is partly because such tasks are intrinsically difficult (even AI-complete) and partly because assembling datasets of deliberative judgments is expensive (each label might take hours of human work). We consider addressing this data problem by collecting fast judgments and using them to help predict deliberative (slow) judgments. Instead of having a human spend hours on a task, we might instead collect their judgment after 30 seconds or 10 minutes. These fast judgments are combined with a smaller quantity of slow judgments to provide training data. The resulting prediction problem is related to semi-supervised learning and collaborative filtering. We designed two tasks for the purpose of testing ML algorithms on predicting human deliberative judgments. One task involves Fermi estimation (back-of-the-envelope estimation) and the other involves judging the veracity of political statements. We collected a dataset of 25,000 judgments from more than 800 people. We define an ML prediction task for predicting deliberative judgments given a training set that also contains fast judgments. We tested a variety of baseline algorithms on this task. Unfortunately our dataset has serious limitations. Additional work is required to create a good testbed for predicting human deliberative judgments. This technical report explains the motivation for our project (which might be built on in future work) and explains how further work can avoid our mistakes. Our dataset and code is available at

Evaluating Compositionality in Sentence Embeddings abstract bibtex
Ishita Dasgupta, Demi Guo, Andreas Stuhlmüller, Samuel J. Gershman, Noah D. Goodman (2018)
Proceedings of the 40th Annual Conference of the Cognitive Science Society.

An important frontier in the quest for human-like AI is compositional semantics: how do we design systems that understand an infinite number of expressions built from a finite vocabulary? Recent research has attempted to solve this problem by using deep neural networks to learn vector space embeddings of sentences, which then serve as input to supervised learning problems like paraphrase detection and sentiment analysis. Here we focus on 'natural language inference' (NLI) as a critical test of a system's capacity for semantic compositionality. In the NLI task, sentence pairs are assigned one of three categories: entailment, contradiction, or neutral. We present a new set of NLI sentence pairs that cannot be solved using only word-level knowledge and instead require some degree of compositionality. We use state of the art sentence embeddings trained on NLI (InferSent, Conneau et al. (2017)), and find that performance on our new dataset is poor, indicating that the representations learned by this model fail to capture the needed compositionality. We analyze some of the decision rules learned by InferSent and find that they are largely driven by simple heuristics at the word level that are ecologically valid in the SNLI dataset on which InferSent is trained. Further, we find that augmenting the training dataset with our new dataset improves performance on a held-out test set without loss of performance on the SNLI test set. This highlights the importance of structured datasets in better understanding, as well as improving the performance of, AI systems.

Learning Physical Parameters from Dynamic Scenes abstract bibtex
Tomer D. Ullman, Andreas Stuhlmüller, Noah D. Goodman and Joshua B. Tenenbaum (2018)
Cognitive Psychology

Humans acquire their most basic physical concepts early in development, and continue to enrich and expand their intuitive physics throughout life as they are exposed to more and varied dynamical environments. We introduce a hierarchical Bayesian framework to explain how people can learn physical parameters at multiple levels. In contrast to previous Bayesian models of theory acquisition (Tenenbaum et al., 2011), we work with more expressive probabilistic program representations suitable for learning the forces and properties that govern how objects interact in dynamic scenes unfolding over time. We compare our model to human learners on a challenging task of estimating multiple physical parameters in novel microworlds given short movies. This task requires people to reason simultaneously about multiple interacting physical laws and properties. People are generally able to learn in this setting and are consistent in their judgments. Yet they also make systematic errors indicative of the approximations people might make in solving this computationally demanding problem with limited computational resources. We propose two approximations that complement the top-down Bayesian approach. One approximation model relies on a more bottom-up feature-based inference scheme. The second approximation combines the strengths of the bottom-up and top-down approaches, by taking the feature-based inference as its point of departure for a search in physical-parameter space.

Trial without Error: Towards Safe Reinforcement Learning via Human Intervention abstract bibtex
William Saunders, Girish Sastry, Andreas Stuhlmüller, Owain Evans (2018)
Proc. of the 17th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2018)

AI systems are increasingly applied to complex tasks that involve interaction with humans. During training, such systems are potentially dangerous, as they haven't yet learned to avoid actions that could cause serious harm. How can an AI system explore and learn without making a single mistake that harms humans or otherwise causes serious damage? For model-free reinforcement learning, having a human "in the loop" and ready to intervene is currently the only way to prevent all catastrophes. We formalize human intervention for RL and show how to reduce the human labor required by training a supervised learner to imitate the human's intervention decisions. We evaluate this scheme on Atari games, with a Deep RL agent being overseen by a human for four hours. When the class of catastrophes is simple, we are able to prevent all catastrophes without affecting the agent's learning (whereas an RL baseline fails due to catastrophic forgetting). However, this scheme is less successful when catastrophes are more complex: it reduces but does not eliminate catastrophes and the supervised learner fails on adversarial examples found by the agent. Extrapolating to more challenging environments, we show that our implementation would not scale (due to the infeasible amount of human labor required). We outline extensions of the scheme that are necessary if we are to train model-free agents without a single catastrophe.

Dialog Markets abstract bibtex
Andreas Stuhlmüller (2016)
Technical Report

People frequently ask Google questions like “What career is right for me?” “How can I find a boyfriend?” or “How can I make more money?” While Google can answer self-contained questions like “What is the capital of Turkey?” it cannot answer vague questions, even though these tend to be the questions people care about the most. Answering these questions requires background information about the person who asked. Good advice is tailored to the particulars of the asker’s situation. A natural way to elicit such background is through a dialog in which follow-up questions are asked. If you ask a doctor “Why do I have stomach pain?” they won’t just give you an answer. They will ask clarifying questions: “When did it start?” — “Where is it?” — “Is it constant or intermittent?” And only then they suggest possible causes. In this set of notes, I propose dialog markets, a mechanism for creating high-quality conversations that resolve vague questions. A dialog starts when an individual asks a question and pledges a reward that will be used to pay for helpful contributions. Humans and machines then contribute follow-up questions and other responses. The core problem that the market mechanism needs to solve is how to distribute reward over contributions such that we incentivize conversations that are as valuable as possible for the asker. We can view this as a particular set of questions that we can delegate to the market: “How much should the asker pay for this contribution?” This view gives rise to the grounding problem for dialog markets: incentives will be aligned if answers to this meta-level question are good, but that in turn depends on aligned incentives. I outline skome approaches towards addressing this problem. Finally, I suggest an implementation plan: I discuss how to ensure that early dialogs have sufficiently high quality to get the project off the ground and name a few potential initial applications.

Modeling Agents with Probabilistic Programs abstract bibtex
Owain Evans, Andreas Stuhlmüller, John Salvatier, and Daniel Filan (electronic)

This book describes and implements models of rational agents for (PO)MDPs and Reinforcement Learning. One motivation is to create richer models of human planning, which capture human biases and bounded rationality. Agents are implemented as differentiable functional programs in a probabilistic programming language based on Javascript. Agents plan by recursively simulating their future selves or by simulating their opponents in multi-agent games. Our agents and environments run directly in the browser and are easy to modify and extend. The book assumes basic programming experience but is otherwise self-contained. It includes short introductions to "planning as inference", MDPs, POMDPs, inverse reinforcement learning, hyperbolic discounting, myopic planning, and multi-agent planning.

Agent-Agnostic Human-in-the-Loop Reinforcement Learning abstract bibtex
David Abel, John Salvatier, Andreas Stuhlmüller, and Owain Evans (2016)
Future of Interactive Learning Machines Workshop at NIPS 2016

Providing Reinforcement Learning agents with expert advice can dramatically improve various aspects of learning. To this end, prior work has developed teaching protocols that enable agents to learn efficiently in complex environments. In many of these methods, the teacher’s guidance is tailored to agents with a particular representation or underlying learning scheme, offering effective but highly specialized teaching procedures. In this work, we introduce protocol programs, an agent-agnostic schema for Human-in-the-Loop Reinforcement Learning. Our goal is to incorporate the beneficial properties of a human teacher into Reinforcement Learning without making strong assumptions about the inner workings of the agent. We show how to represent existing approaches such as action pruning, reward shaping, and training in simulation as special cases of our schema, and conduct preliminary experiments evaluating the effectiveness of protocols in simple domains.

Learning the Preferences of Ignorant, Inconsistent Agents abstract bibtex
Owain Evans, Andreas Stuhlmüller, and Noah D. Goodman (2016)
Proceedings of the 30th AAAI Conference on Artificial Intelligence (AAAI-2016)

An important use of machine learning is to learn what people value. What posts or photos should a user be shown? Which jobs or activities would a person find rewarding? In each case, observations of people's past choices can inform our inferences about their likes and preferences. If we assume that choices are approximately optimal according to some utility function, we can treat preference inference as Bayesian inverse planning. That is, given a prior on utility functions and some observed choices, we invert an optimal decision-making process to infer a posterior distribution on utility functions. However, people often deviate from approximate optimality. They have false beliefs, their planning is sub-optimal, and their choices may be temporally inconsistent due to hyperbolic discounting and other biases. We demonstrate how to incorporate these deviations into algorithms for preference inference by constructing generative models of planning for agents who are subject to false beliefs and time inconsistency. We explore the inferences these models make about preferences, beliefs, and biases. We present a behavioral experiment in which human subjects perform preference inference given the same observations of choices as our model. Results show that human subjects (like our model) explain choices in terms of systematic deviations from optimal behavior and suggest that they take such deviations into account when inferring preferences.

C3: Lightweight Incrementalized MCMC for Probabilistic Programs using Continuations and Callsite Caching abstract bibtex
Daniel Ritchie, Andreas Stuhlmüller, and Noah D. Goodman (2016)
Proceedings of the 19th International Conference on Artificial Intelligence and Statistics (AISTATS)

Lightweight, source-to-source transformation approaches to implementing MCMC for probabilistic programming languages are popular for their simplicity, support of existing deterministic code, and ability to execute on existing fast runtimes. However, they are also slow, requiring a complete re-execution of the program on every Metropolis Hastings proposal. We present a new extension to the lightweight approach, C3, which enables efficient, incrementalized re-execution of MH proposals. C3 is based on two core ideas: transforming probabilistic programs into continuation passing style (CPS), and caching the results of function calls. We show that on several common models, C3 reduces proposal runtime by 20-100x, in some cases reducing runtime complexity from linear in model size to constant. We also demonstrate nearly an order of magnitude speedup on a complex inverse procedural modeling application.

Learning the Preferences of Bounded Agents bibtex
Owain Evans, Andreas Stuhlmüller, and Noah D. Goodman (2015)
Poster at the NIPS 2015 Workshop on Bounded Optimality

Coarse-to-Fine Sequential Monte Carlo for Probabilistic Programs abstract bibtex
Andreas Stuhlmüller, Robert X.D. Hawkins, N. Siddharth, and Noah D. Goodman (2015)
Technical Report. arXiv:1509.02962 [cs.AI]

Many practical techniques for probabilistic inference require a sequence of distributions that interpolate between a tractable distribution and an intractable distribution of interest. Usually, the sequences used are simple, e.g., based on geometric averages between distributions. When models are expressed as probabilistic programs, the models themselves are highly structured objects that can be used to derive annealing sequences that are more sensitive to domain structure. We propose an algorithm for transforming probabilistic programs to coarse-to-fine programs which have the same marginal distribution as the original programs, but generate the data at increasing levels of detail, from coarse to fine. We apply this algorithm to an Ising model, its depth-from-disparity variation, and a factorial hidden Markov model. We show preliminary evidence that the use of coarse-to-fine models can make existing generic inference algorithms more efficient.

Why do you ask? Good questions provoke informative answers abstract bibtex
Robert X. D. Hawkins, Andreas Stuhlmüller, Judith Degen, and Noah D. Goodman (2015)
Proceedings of the Thirty-Seventh Annual Conference of the Cognitive Science Society

What makes a question useful? What makes an answer appropriate? In this paper, we formulate a family of increasingly sophisticated models of question-answer behavior within the Rational Speech Act framework. We compare these models based on three different pieces of evidence: first, we demonstrate how our answerer models capture a classic effect in psycholinguistics showing that an answerer’s level of informativeness varies with the inferred questioner goal, while keeping the question constant. Second, we jointly test the questioner and answerer components of our model based on empirical evidence from a question-answer reasoning game. Third, we examine a special case of this game to further distinguish among the questioner models. We find that sophisticated pragmatic reasoning is needed to account for some of the data. People can use questions to provide cues to the answerer about their interest, and can select answers that are informative about inferred interests.

Modeling Cognition with Probabilistic Programs: Representations and Algorithms abstract bibtex
Andreas Stuhlmüller (2015)
Ph.D. thesis, Massachusetts Institute of Technology

This thesis develops probabilistic programming as a productive metaphor for understanding cognition, both with respect to mental representations and the manipulation of such representations.

In the first half of the thesis, I demonstrate the representational power of probabilistic programs in the domains of concept learning and social reasoning. I provide examples of richly structured concepts, defined in terms of systems of relations, subparts, and recursive embeddings, that are naturally expressed as programs and show initial experimental evidence that they match human generalization patterns. I then proceed to models of reasoning about reasoning, a domain where the expressive power of probabilistic programs is necessary to formalize our intuitive domain understanding due to the fact that, unlike previous formalisms, probabilistic programs allow conditioning to be represented a model, not just applied to a model. I illustrate this insight with programs that model nested reasoning in game theory, artificial intelligence, and linguistics.

In the second half, I develop three inference algorithms with the dual intent of showing how to efficiently compute the marginal distributions defined by probabilistic programs, and providing building blocks for process-level accounts of human cognition. First, I describe a Dynamic Programming algorithm for computing the marginal distribution of discrete probabilistic programs by compiling to systems of equations and show that it can make inference in models of "reasoning about reasoning" tractable by merging and reusing subcomputations. Second, I introduce the setting of amortized inference and show how learning inverse models lets us leverage samples generated by other inference algorithms to compile probabilistic models into fast recognition functions. Third, I develop a generic approach to coarse-to-fine inference in probabilistic programs and provide evidence that it can speed up inference in models with large state spaces that have appropriate hierarchical structure.

Finally, I substantiate the claim that probabilistic programming is a productive metaphor by outlining new research questions that have been opened up by this line of investigation.

The Design and Implementation of Probabilistic Programming Languages abstract bibtex
Noah D. Goodman and Andreas Stuhlmüller (electronic)

Probabilistic programming languages (PPLs) unify techniques for the formal description of computation and for the representation and use of uncertain knowledge. PPLs have seen recent interest from the artificial intelligence, programming languages, cognitive science, and natural languages communities. This book explains how to implement PPLs by lightweight embedding into a host language. We illustrate this by designing and implementing WebPPL, a small PPL embedded in Javascript. We show how to implement several algorithms for universal probabilistic inference, including priority-based enumeration with caching, particle filtering, and Markov chain Monte Carlo. We use program transformations to expose the information required by these algorithms, including continuations and stack addresses. We illustrate these ideas with examples drawn from semantic parsing, natural language pragmatics, and procedural graphics.

Learning physics from dynamical scenes abstract bibtex
Tomer D. Ullman, Andreas Stuhlmüller, Noah D. Goodman, and Joshua B. Tenenbaum (2014)
Proceedings of the Thirty-Sixth Annual Conference of the Cognitive Science society

Humans acquire their most basic physical concepts early in development, but continue to enrich and expand their intuitive physics throughout life as they are exposed to more and varied dynamical environments. We introduce a hierarchical Bayesian framework to explain how people can learn physical theories across multiple timescales and levels of abstraction. In contrast to previous Bayesian models of theory acquisition (Tenenbaum, Kemp, Griffiths, & Goodman, 2011), we work with more expressive probabilistic program representations suitable for learning the forces and properties that govern how objects interact in dynamic scenes unfolding over time. We compare our model and human learners on a challenging task of inferring novel physical laws in microworlds given short movies. People are generally able to perform this task and behave in line with model predictions. Yet they also make systematic errors suggestive of how a top-down Bayesian approach to learning might be complemented by a more bottomup feature-based approximate inference scheme, to best explain theory learning at an algorithmic level.

Learning Stochastic Inverses abstract bibtex
Andreas Stuhlmüller, Jessica Taylor, and Noah D. Goodman (2013)
Advances in Neural Information Processing Systems (NIPS) 27

We describe a class of algorithms for amortized inference in Bayesian networks. In this setting, we invest computation upfront to support rapid online inference for a wide range of queries. Our approach is based on learning an inverse factorization of a model's joint distribution: a factorization that turns observations into root nodes. Our algorithms accumulate information to estimate the local conditional distributions that constitute such a factorization. These stochastic inverses can be used to invert each of the computation steps leading to an observation, sampling backwards in order to quickly find a likely explanation. We show that estimated inverses converge asymptotically in number of (prior or posterior) training samples. To make use of inverses before convergence, we describe the Inverse MCMC algorithm, which uses stochastic inverses to make block proposals for a Metropolis-Hastings sampler. We explore the efficiency of this sampler for a variety of parameter regimes and Bayes nets.

Knowledge and implicature: Modeling language understanding as social cognition abstract bibtex
Noah D. Goodman and Andreas Stuhlmüller (2013)
Topics in Cognitive Science

Is language understanding a special case of social cognition? To help evaluate this view, we can formalize it as the rational speech-act theory: listeners assume that speakers choose their utterances approximately optimally, and listeners interpret an utterance by using Bayesian inference to "invert" this model of the speaker. We apply this framework to model scalar implicature ("some" implies "not all", and "N" implies "not more than N"). This model predicts an interaction between the speaker's knowledge state and the listener's interpretation. We test these predictions in two experiments, and find good fit between model predictions and human judgements.

Reasoning about Reasoning by Nested Conditioning: Modeling Theory of Mind with Probabilistic Programs abstract bibtex
Andreas Stuhlmüller and Noah D. Goodman (2013)
Journal of Cognitive Systems Research

A wide range of human reasoning patterns can be explained as conditioning in probabilistic models; however, conditioning has traditionally been viewed as an operation applied to such models, not represented in such models. We describe how probabilistic programs can explicitly represent conditioning as part of a model. This enables us to describe reasoning about others' reasoning using nested conditioning. Much of human reasoning is about the beliefs, desires, and intentions of other people; we use probabilistic programs to formalize these inferences in a way that captures the flexibility and inherent uncertainty of reasoning about other agents. We express examples from game theory, artificial intelligence, and linguistics as recursive probabilistic programs and illustrate how this representation language makes it easy to explore new directions in each of these fields. We discuss the algorithmic challenges posed by these kinds of models and describe how Dynamic Programming techniques can help address these challenges.

Learning Stochastic Inverses for Adaptive Inference in Probabilistic Programs abstract bibtex
Andreas Stuhlmüller and Noah D. Goodman (2012)
Poster at the NIPS 2012 Workshop on Probabilistic Programming

We describe an algorithm for adaptive inference in probabilistic programs. During sampling, the algorithm accumulates information about the local probability distributions that compose the program's overall distribution. We use this information to construct targeted samples: given a value for an intermediate expression, we stochastically invert each of the steps giving rise to this value, sampling {\em backwards} in order to assign values to random choices such that we get a likely parse of the intermediate value. We propose this algorithm for importance sampling and as a means of constructing blocked proposals for a Metropolis-Hastings sampler.

A Dynamic Programming Algorithm for Inference in Recursive Probabilistic Programs abstract bibtex
Andreas Stuhlmüller and Noah D. Goodman (2012)
Second Statistical Relational AI workshop at UAI 2012 (StaRAI-12)

We describe a dynamic programming algorithm for computing the marginal distribution of discrete probabilistic programs. This algorithm takes a functional interpreter for an arbitrary probabilistic programming language and turns it into an efficient marginalizer. Because direct caching of sub-distributions is impossible in the presence of recursion, we build a graph of dependencies between sub-distributions. This factored sum-product network makes (potentially cyclic) dependencies between subproblems explicit, and corresponds to a system of equations for the marginal distribution. We solve these equations by fixed-point iteration in topological order. We illustrate this algorithm on examples used in teaching probabilistic models, computational cognitive science research, and game theory.

Knowledge and implicature: Modeling language understanding as social cognition abstract bibtex
Noah D. Goodman and Andreas Stuhlmüller (2012)
Proceedings of the Thirty-Fourth Annual Conference of the Cognitive Science Society
2012 Cognitive Science Society computational modeling prize for Language

Is language understanding a special case of social cognition? To help evaluate this view, we can formalize it as the rational speech-act theory: listeners assume that speakers choose their utterances approximately optimally, and listeners interpret an utterance by using Bayesian inference to "invert" this model of the speaker. We apply this framework to model scalar implicature ("some" implies "not all", and "N" implies "not more than N"). This model predicts an interaction between the speaker's knowledge state and the listener's interpretation. We test these predictions in two experiments, and find good fit between model predictions and human judgements.

Why blame Bob? Probabilistic generative models, counterfactual reasoning, and blame attribution abstract bibtex
John McCoy*, Tomer Ullman*, Andreas Stuhlmüller, Tobias Gerstenberg, and Joshua B. Tenenbaum (2012)
Proceedings of the Thirty-Fourth Annual Conference of the Cognitive Science Society

We consider an approach to blame attribution based on counterfactual reasoning in probabilistic generative models. In this view, people intervene on each variable within their model and assign blame in proportion to how much a change to a variable would have improved the outcome. This approach raises two questions: First, what structure do people use to represent a given situation? Second, how do they choose what alternatives to consider when intervening on an event? We use a series of coin-tossing scenarios to compare empirical data to different models within the proposed framework. The results suggest that people sample their intervention values from a prior rather than deterministically switching the value of a variable. The results further suggest that people represent scenarios differently when asked to reason about their own blame attributions, compared with the blame attributions they believe others will assign.

Inducing Probabilistic Programs by Bayesian Program Merging abstract bibtex
Irvin Hwang, Andreas Stuhlmüller, and Noah D. Goodman (2011)
Technical Report. arXiv:1110.5667 [cs.AI]

This report outlines an approach to learning generative models from data. We express models as probabilistic programs, which allows us to capture abstract patterns within the examples. By choosing our language for programs to be an extension of the algebraic data type of the examples, we can begin with a program that generates all and only the examples. We then introduce greater abstraction, and hence generalization, incrementally to the extent that it improves the posterior probability of the examples given the program. Motivated by previous approaches to model merging and program induction, we search for such explanatory abstractions using program transformations. We consider two types of transformation: Abstraction merges common subexpressions within a program into new functions (a form of anti-unification). Deargumentation simplifies functions by reducing the number of arguments. We demonstrate that this approach finds key patterns in the domain of nested lists, including parameterized sub-functions and stochastic recursion.

Nonstandard Interpretations of Probabilistic Programs for Efficient Inference abstract bibtex
David Wingate, Noah D. Goodman, Andreas Stuhlmüller, and Jeffrey M. Siskind (2011)
Advances in Neural Information Processing Systems (NIPS) 24

Probabilistic programming languages allow modelers to specify a stochastic process using syntax that resembles modern programming languages. Because the program is in machine-readable format, a variety of techniques from compiler design and program analysis can be used to examine the structure of the distribution represented by the probabilistic program. We show how \emph{nonstandard interpretations} of probabilistic programs can be used to craft efficient inference algorithms: information about the structure of a distribution (such as gradients or dependencies) is generated as a monad-like side computation while executing the program. These interpretations can be easily coded using special-purpose objects and operator overloading. We implement two examples of nonstandard interpretations in two different languages, and use them as building blocks to construct inference algorithms: automatic differentiation, which enables gradient based methods, and provenance tracking, which enables efficient construction of global proposals.

Lightweight Implementations of Probabilistic Programming Languages Via Transformational Compilation abstract bibtex
David Wingate, Andreas Stuhlmüller, and Noah D. Goodman (2011)
Proceedings of the 14th international conference on Artificial Intelligence and Statistics

We describe a general method of transforming arbitrary programming languages into probabilistic programming languages with straightforward MCMC inference engines. Random choices in the program are "named" with information about their position in an execution trace; these names are used in conjunction with a database holding values of random variables to implement MCMC inference in the space of execution traces. We encode naming information using lightweight source-to-source compilers. Our method enables us to reuse existing infrastructure (compilers, profilers, etc.) with minimal additional code, implying fast models with low development overhead. We illustrate the technique on two languages, one functional and one imperative: Bher, a compiled version of the Church language which eliminates interpretive overhead of the original MIT-Church implementation, and Stochastic Matlab, a new open-source language.

Learning Structured Generative Concepts abstract bibtex
Andreas Stuhlmüller, Joshua B. Tenenbaum, and Noah D. Goodman (2010)
Proceedings of the 32nd Annual Conference of the Cognitive Science Society

Many real world concepts, such as "car", "house", and "tree", are more than simply a collection of features. These objects are richly structured, defined in terms of systems of relations, subparts, and recursive embeddings. We describe an approach to concept representation and learning that attempts to capture such structured objects. This approach builds on recent probabilistic approaches, viewing concepts as generative processes, and on recent rule-based approaches, constructing concepts inductively from a language of thought. Concepts are modeled as probabilistic programs that describe generative processes; these programs are described in a compositional language. In an exploratory concept learning experiment, we investigate human learning from sets of tree-like objects generated by processes that vary in their abstract structure, from simple prototypes to complex recursions. We compare human categorization judgements to predictions of the true generative process as well as a variety of exemplar-based heuristics.

Learning Compositional Representations abstract bibtex
Andreas Stuhlmüller (2009)
Bachelor's thesis at the University of Osnabrück

How could an agent learn and reason in a complexly structured, stochastic world? This problem is at the center of both artificial intelligence and psychology. One candidate answer to this question is that both learning and reasoning can be explained as probabilistic inference over a language-like hypothesis space for generative models. The goal of this thesis is to describe what makes this approach plausible and to demonstrate the learning of generative models from structured observations in a simple world consisting of stochastic, tree-generating programs. In order to make program inference feasible, I derive a new class of adaptive importance sampling algorithms for probabilistic programs that let us compute the likelihood of structured observations under a given generative model. Using this algorithm in combination with Markov Chain Monte Carlo methods, I quantitatively show that we can reliably infer generative models from observations in our demonstration world.


I am one of the authors of WebPPL, a probabilistic programming language for the web.

I run Forest, a repository for generative models.

I contributed to MIT-Church, the original implementation of the Church probabilistic programming language, and to Bher, a lightweight compiler for Church.

I wrote Cosh, a Church implementation based on exact inference with aggressive sharing of subcomputations.


Probabilistic Programming Languages

Probabilistic programs express generative models, i.e., formal descriptions of processes that generate data. To the extent that we can mirror mechanisms “out there in the world” in a formal language, we can pose queries within this language and expect the answer to line up with what happens in the real world. For example, if we can accurately describe how genes give rise to proteins, this helps us in predicting which genetic changes lead to desirable results and which lead to harm.

Traditionally, Bayesian networks have been used to formalize and reason about such processes. While extremely useful, this formalism is limited in its expressive power compared to programming languages in general. Probabilistic programming languages like Church provide a substrate for computational modeling that brings together Bayesian statistics and Turing-complete programming languages.

Get started with the Probabilistic Models of Cognition tutorial.

Universal Inference Algorithms

A useful system for probabilistic programming needs to provide more than the ability to formally describe mechanisms: It must answer questions about such mechanisms, i.e., infer the consequences of conditioning these mechanisms on propositions of interest. Using Bayesian terminology, a probabilistic program specifies a prior distribution over return values. When we observe a value for a particular variable within the program, the goal of inference is to compute the posterior distribution on the unobserved variables. For example, given a program expressing a prior over diseases and symptoms, we may condition this program on the values of observed symptoms and compute the posterior distribution on diseases.

The problem of efficiently computing the posterior distribution provides a rich set of research directions: How do existing Bayes net inference algorithms like Metropolis-Hastings, importance sampling, and belief propagation generalize to the setting of probabilistic programs? How can we improve upon existing algorithms by making use of additional structure exposed by probabilistic programs? How can compilation techniques like flow analysis and abstract interpretation help? How can the task of inference in a particular program be formalized itself and treated as a meta-inference problem?

Read The Design and Implementation of Probabilistic Programming Languages and learn how to build a probabilistic programming system with priority-based enumeration, particle filtering, and MCMC inference algorithms.

Structure of Concepts

What is the structure of thought? The language of thought hypothesis proposes that thoughts are expressions in a mental language, and that thinking consists of syntactic manipulation of these expressions. The probabilistic language of thought hypothesis suggests that these representations have probabilistic meaning. If we treat concepts as stochastic programs and thinking as approximate Bayesian inference, can we predict human behavior in experimental settings better than using other approaches, e.g., prototype theory? If we learn more about the primitives and composition mechanisms used in the brain to represent abstract concepts, can this inform how we construct tools for reasoning?

Read my CogSci 2010 paper with Josh Tenenbaum and Noah Goodman on concept learning experiments and modeling.

Decision Theory

When we attempt to solve any particular problem, we do so because we believe that, overall, this is what we should be doing; we believe that, taking everything we know into account, we prefer to solve this problem over doing anything else. This judgement can be mistaken: We can make errors in our reasoning, fail at introspection, and forget to consider consequences. Using our intuitions, we attempt to solve a particular decision problem—the problem that defines the correct answer to the question what should be done?—and our solution is only approximately correct.

In the long run, we want to improve at solving this decision problem. Autonomous intelligent systems, if ever built, need to contribute to solving this decision problem if the outcome is to be of moral worth. This seems to require that we improve upon our intuitive understanding of what this decision problem is. Currently, the most tractable approach to this seemingly impossible problem appears to be the study of decision theory and of related notions such as choice, control, and consequentialism.

Open source projects in JavaScript, Python and Scheme.
Shared photos and location updates.
Publications as indexed by Google Scholar.

Oakland, CA