Complications in applied work often prevent researchers from obtaining unique point estimates of target quantities using cheaply available data—at best, ranges of possibilities, or sharp bounds, can be reported. To make progress, researchers frequently collect more information by (1) re-cleaning existing datasets, (2) gathering secondary datasets, or (3) pursuing entirely new designs. Common examples include manually correcting missingness, recontacting attrited units, validating proxies with ground-truth data, finding new instrumental variables, and conducting follow-up experiments. These auxiliary tasks are costly, forcing tradeoffs with (4) larger samples from the original approach. Researchers’ data-collection strategies, or choices over these tasks, are often based on convenience or intuition. In this work, we show how to provably identify the most cost-efficient data-collection strategy for a given research problem.
We quantify the quality of existing data using the width of the confidence regions on the sharp bounds, which captures two sources of uncertainty: statistical uncertainty due to finite samples of the variables measured, and fundamental uncertainty because some variables are not measured at all. We then show how to compute the expected information gain, defined as the expected amount by which each data-collection task will narrow these bounds by addressing one or both sources of uncertainty. Finally, we select the task with the greatest information efficiency, or gain per unit cost. Leveraging recent advances in automatic bounding (Duarte et al., 2022), we prove efficiency is computable for essentially any discrete causal system, estimand, and auxiliary data task.
Based on this theoretical framework, we develop a method for optimal adaptive allocation of data-collection resources. Users first input a causal graph, estimand, and past data. They then enumerate distributions from which future samples can be drawn, fixed and per-sample costs, and any prior beliefs. Our method automatically derives and sequentially updates the optimal data-collection strategy.
Dean Knox is a computational social scientist and an assistant professor in the Operations, Information, and Decisions Department and the Statistics and Data Science Department at the Wharton School of the University of Pennsylvania. He studies topics ranging from policing to causal inference and machine learning, often using data previously thought to be too messy or unstructured to study.
Dean is an Andrew Carnegie Fellow and the inaugural recipient of Science’s NOMIS early career award for interdisciplinary research. His research has appeared in Science, the Proceedings of the National Academy of Sciences, Nature Human Behavior, the Journal of the American Statistical Association, and the American Political Science Review.