The Harvard Data Science Initiative Competitive Research Fund provides targeted seed and bridge funding to Harvard faculty who propose novel methods, innovations, or solutions to data science challenges. Since 2017 the HDSI has provided over $1 Million in funding across the University.
The American Communities Computable Newspaper Database
Melissa Dell (Harvard Faculty of Arts and Sciences)
The American Communities Computable Newspaper Database uses recent advances in deep learning, computer vision, and natural language processing to create a computable database for over 7,000 historical newspapers in all 50 states, across over 12 million newspaper editions. The database will provide straightforward-to-use outputs from natural language processing analyses conducted on full article texts, image captions, and headlines, that can be used to elucidate how the media has influenced American society.
Discovery of higher temperature superconductors by machine-learning strongly correlated descriptors of materials
Xin Li (Harvard John A. Paulson School of Engineering and Applied Sciences)
The project aims to make a search and design of higher temperature superconductors by a combination high-throughput ab initio simulations and machine learning analyses that focuses on novel descriptors of ultrafast electron dynamics. The project may also lead to an understanding of the unconventional superconducting mechanism.
Refining pre-disaster strategic preparedness: A machine learning model for identification of communities facing the highest health risks from an impending tropical cyclone.
Rachel Nethery (Harvard T.H. Chan School of Public Health)
Climate change is expected to increase the intensity of tropical cyclones (TC), thus they represent an escalating risk to human health over this century. Motivated by our large database of historic TC exposures and Medicare health records, we will create a predictive machine learning tool that provides information in real-time about the areas of highest health risk and the types of health risks anticipated for an impending TC threatening the United States, with the goal of maximizing the protective impact of strategic preparedness efforts.
A theory of how deep networks generalize beyond their training set
Cengiz Pehlevan (Harvard John A. Paulson School of Engineering and Applied Sciences)
Deep networks are extremely successful in finding statistical patterns in data that generalize to previously unseen samples, yet how they can do so remains an open question. We will develop a theory of generalization in deep networks adopting mathematical methods from the statistical physics of disordered systems.
The Causal Impact of Conflict on Gender Roles and Gender Discrimination: Big Data, Rumors, and Nature of War
Marcia Castro (Harvard T.H. Chan School of Public Health) and Jocelyn Finlay (Harvard T.H. Chan School of Public Health)
Ambient Noise Seismology using Cloud Computing
Marine Denolle (Harvard Faculty of Arts and Sciences)
Discovering the Foundations of Deep Learning: What if we could understand how computers learn?
Stratos Idreos (Harvard John A. Paulson School of Engineering and Applied Sciences)
Machine Learning Classification of Astrophysical Transient Events: The First Data-Driven Tests in Anticipation of the Large Synoptic Survey Telescope
Edo Berger (Harvard Faculty of Arts and Sciences)
Modeling Health System Resilience in Natural Disasters (California Wildfires)
Satchit Balsari (Harvard T.H. Chan School of Public Health)
Data-Driven Scientific Discovery In the Era of Large Astronomical Surveys
Douglas Finkbeiner (Harvard Faculty of Arts and Sciences)
Toward the development of city-level disease monitoring and forecasting platforms combining disparate mathematical modeling techniques and novel data sources.
Caroline Buckee (Harvard T.H. Chan School of Public Health) and Mauricio Santillana (Harvard Medical School)
Planning proposal to establish a Harvard Initiative in Functional Neuroscience (HIFUN)
Adam Cohen (Harvard Faculty of Arts and Sciences)
The ultimate goal, in the development of models with biomedical applications, is to provide accurate predictions for fully independent samples, originating from institutions and processed by laboratories that did not generate the training datasets. Our work forges a new path for addressing a major problem in biomedical research, where the availability of high dimensional personal-level information is opening vast opportunities for developing predictive and prognostic algorithms to support increasingly personalized medical care, but general strategies to ensure generalizability of these algorithms beyond the populations used for training are still lacking. While our motivating examples derive from the analysis of gene expression data in oncology, and our synthetic data experiments are modeled on procedures for generating synthetic data that have otherwise been employed in this field, our concepts and tools would be applicable to most areas of biomedicine and to a broad variety of labels.
This work – domain generalization to unseen populations – provides a new dimension of fairness, representation of potentially small, geographically remote populations, not anticipated in earlier work
Scalable algorithms for Bayesian inference with multiple sources of streaming data
Pierre Jacob (Harvard Faculty of Arts and Sciences)
In various domains, data sets take the form of collections of time series, which calls for specific methodological research in data science. In this project, we consider various statistical questions arising with two types of data sets: one contains recordings of the activity of multiple neuronal units in the brain of mice during experiments, and the other contains partial genetic measurements collected on large populations of malaria parasites. In the first case, the project considers how to cluster time series in meaningful groups; in the second case how long should each time series be in order to obtain the desired estimation accuracy.
Smartphone-based Digital Phenotyping of Structure and Content of Communication
JP Onnela (Harvard T.H. Chan School of Public Health)
Smartphone-based digital phenotyping makes use of active data (e.g., surveys) and passive data (e.g., location data, activity data, communication data) to learn about social, behavioral, and cognitive phenotypes in free-living or naturalistic settings. In this project, we will develop new statistical methods methods for quantifying longitudinal changes in the social structure of communication networks in individuals with central nervous system disorder.
New frontiers in statistical modeling and data science for uncertainty quantification in climate science
Natesh Pillai (Harvard Faculty of Arts and Sciences)
Accurate quantication of the long-term changes in temperature and careful analysis of the resulting implications constitute one of the pressing problems of our times. Long, high-quality records of temperature provide an important basis for our understanding of climate variability and change. However, even a perfunctory glance at most of the available temperature records will indicate a huge amount of uncertainty from various sources including measurement error, preprocessing of the data before it is made publicly available, etc. Thus to meaningfully contribute to the global warming debate, it is imperative that we provide a scientically rigorous uncertainty quantication of the temperature, incorporating careful statistical modeling, domain knowledge and technical expertise for handling large, complex data. This proposal is a call-to-arms to address this urgent issue.
Learning from Noisy and Strategically-Generated Data
Yiling Chen (Harvard John A. Paulson School of Engineering and Applied Sciences)
This project focuses on developing methods to obtain stronger insights from data in the presence of strategic behavior. It develops algorithms that are efficient for learning when data are strategically generated. It also develops incentive-aligned methods to elicit high-quality data for the subsequent learning.
Improving Health Care System Performance: Computational Health Economics with Normative Data for Payment Calibration
Sherri Rose (Harvard Medical School)
In the conventional framework for designing health plan payment models, the regulator chooses variables to be used as risk adjustors, the risk adjustment weights, and other policy parameters, but the data from which estimates are derived are taken as given. This implicitly assumes the observed spending patterns are optimal, despite our knowledge that they are not. The work funded by this Harvard DSI grant, published in the Journal of Health Economics, developed new data transformation methods for the preprocessing phase to address this issue and induce a fairer and more efficient health care system.
Inference, Design of experiments, and Experimentation in an Automated Loop (IDEAL) for Aging Research
Jeffrey Miller (Harvard T.H. Chan School of Public Health)
Optimizing Sepsis Management with Reinforcement Learning
Finale Doshi-Velez (Harvard John A. Paulson School of Engineering and Applied Sciences)
Off-policy evaluation considers the task of estimating the value of some proposed treatment policy given data collected from some behavior policy (e.g. estimating how well an alternate treatment strategy may perform given data from doctors treating patients using current approaches). The main outcomes of this grant were (a) to describe a set of common pitfalls for when off-policy evaluation may fail in real settings, and provide guidelines to identify when those may be happening, and (b) to develop novel approaches for improved off-policy evaluation.
Deep Learning in Particle Physics
Matthew Schwartz (Harvard Faculty of Arts and Sciences)
This award supported research on applying modern machine learning techniques to the analysis of collider data in particle physics. Techniques were developed for signal and background discrimination, data purification, and interpretable data-driven weakly-supervised and unsupervised learning.