Competitive Research Fund 

The Harvard Data Science Initiative Competitive Research Fund provides targeted seed and bridge funding to Harvard faculty who propose novel methods, innovations, or solutions to data science challenges. Since 2017 the HDSI has provided over $2 Million in funding across the University.

The application period for the 2023 funding cycle is now closed. We will accept applications for the 2024 funding cycle in early 2024.

2023 Projects

Rational design of polymeric materials using machine learning
Joanna Aizenberg (Harvard John A. Paulson School of Engineering and Applied Sciences)

The invention or design of new materials is one of the key important factors to solve emergent biomedical, energy and climate problems. However, the development of new materials still heavily depends on intuition-derivation and time-consuming trial-and-test strategies due to the intrinsic complexity and nonlinearity of this design process. In particular, structure-property relationships – a vital paradigm in materials science – are often nonlinear, and the pattern is likely to change with length and time scales, posing a huge challenge for rational design. For example, liquid crystal polymeric materials, which combine the ordering properties of liquid crystals and rubbery elasticity from the polymer network, are promising for various biomedical and robotic applications. However, current liquid crystal polymeric materials still cannot fully satisfy all the requirements, as materials design involves understanding from molecular scale to macroscopic scale with high complexity. To address this, we propose to develop an approach to inverse design materials across different length scales using machine learning methods. In particular, we will develop a model to first map the molecular structure to the resulting material-based property space, and then understand how its incorporation into a macroscopic material via fabrication pathways may alter the property space. By decoupling these effects, we are able to screen for novel materials that may be synthesized and applied for various applications invariant of its post-processing steps – which can further be used as a fine-tuning process to obtain the final observed functions. The methods developed here are generalizable to solve other polymer design problems from the molecular to macroscopic scales.

Tracking the Footprint of Mining Extraction in the Brazilian Amazon with a Near Real-Time Artisanal Mining Alert System
Marcia Castro (Harvard T.H. Chan School of Public Health)

The pressure exerted by artisanal miners over Indigenous Lands (ILs) and Conservation Units (CUs) of the Brazilian Amazon is almost certain to maintain its rising momentum for the years to come. In the previous ten years, the area mined inside ILs has grown by 625%, whilst inside CUs it has grown by 352%. This proposal gives life to a near real-time “garimpo” detection system – called SAG. When fully implemented, SAG will track the evolution of artisanal mining activity in the last 38+ years, while simultaneously producing high-resolution early warning alerts of new mining sites arising inside or near ILs and CUs. SAG’s data may trigger actions to curb further negative consequences (e.g., protect indigenous rights, prevent disease outbreaks, enforce the legislation, and ultimately contribute to reducing the invasion of ILs and UCs by mining operations. All SAG’s orbital inputs are free and publicly available (derived from NASA, ESA and Planet public datasets).

Interplay of Symmetry and Scaling in Machine Learning
Melanie Weber (Harvard John A. Paulson School of Engineering and Applied Sciences)

The scale and size of machine learning systems has grown rapidly in recent years, creating an ever increasing demand for high-quality training data and large-scale computing resources. Understanding the relationship between the size of the model (number of parameters), the size and structure of the training data, and the resources needed to train a model to a certain accuracy are crucial for the design of energy- and cost-conscious training strategies. In this project, we investigate how exploiting geometric structure in data can improve trainability and generalization in Machine Learning models, and contribute to the resource-efficient development of large-scale Machine Learning systems.

2022 Projects

Designing and Evaluating Reinforcement Learning Algorithms to Reduce Pretrial Incarceration
Sharad Goel and Todd Rogers (Harvard Kennedy School)

To reduce pretrial incarceration, and help individuals attend mandatory court dates, we will design reinforcement learning algorithms that identify individuals who would most likely benefit from a free ride to court. Research suggests that these rides will help a specific subset of individuals who struggle with access to transportation. For this population, a free ride via Lyft or Uber may reduce the chances that a judge chooses to send them to jail after a missed court date. We will evaluate this intervention in a randomized field experiment, in collaboration with our partner—the Santa Clara County Public Defender Office—which serves 20,000 clients a year in California’s sixth most populous county. Our intervention will also incorporate recent research developed by our team that minimizes racial disparities in the allocation of these benefits. We anticipate this experiment, and the research findings we generate, will help hundreds of individuals in Santa Clara County avoid jail incarceration each year.

Amend: Rewriting the Constitution
Jill Lepore (Harvard Faculty of Arts and Sciences)

Amend (amendmentsproject.org) aims to compile, classify, and analyze the text of proposed amendments to the U.S. Constitution from 1787 to 2021 (with periodic updates thereafter) for the purposes of producing an indexed archive for the public, and for scholars, and developing a podcast. The project builds on raw, incomplete, and inaccessible data compiled by the National Archives in 2016 and extends that collection through original archival research designed to discover proposals made by historically disenfranchised and poorly enfranchised groups. Through a collaboration with the team behind Constitute, a public-facing collection of the world’s written constitutions, Amend will provide access to this collection of proposed amendments to an interested general public that will include researchers and teachers, elected officials, political parties, non- profit organizations, constitutional reformers and, especially, students from kindergarteners to senior learners. The project aims to offer a radically new historical argument about the nature of constitutional change, advance civic education, and support emerging proposals for constitutional reform. The project aligns with HDSI’s evidence-based policy research theme, as it advances our understanding of democracy and governance and the mechanisms of structural political inequality.

Self-Supervision for Label-Efficient Medical Image Interpretation
Pranav Rajpurkar (Harvard Medical School)

In the field of medical AI, model development often relies on supervised learning, a training regimen requiring large, labeled training datasets. However obtaining high-quality labels is particularly time-consuming and expensive in the medical field, so access to labeled data has been a major obstacle to progress. Here, we propose to develop multiple label-efficient approaches for medical AI model development, building on cutting- edge advances in self-supervised and multimodal learning. We focus on application to chest X-ray (CXR) interpretation, the most common medical imaging modality with over 2 billion studies a year. In addition to developing multiple new high-performing models for CXR interpretation, our work has the potential to dramatically reduce or eliminate the need for costly labeled training data across medical AI tasks.

2021 Projects

New Techniques for Representation Learning
Boaz Barak (Harvard John A. Paulson School of Engineering and Applied Sciences)

This is methodological research in learning representation. Representation learning is arguably one of the least understood aspects of deep learning and posits challenges for robustness, transparency, and casual inference. We will obtain precise measures to quantify the representation corresponding to different parts of neural networks, and to connect them to one another. We will give rigorous bounds on out-of-sample performance of representation-learning based classifiers. We will investigate the use of the neural tangent kernel to interpolate between the “extract representation” and “fine tune” approaches for transfer learning, therby mitigating “catastrophic forgetting”. The project will involve both extensive experimentation and proving rigorous bounds.

Detection and Parameter Estimation of Gravitational Wave Events using Deep Learning
Edo Berger (Harvard Faculty of Arts and Sciences)

In his 1917 revolutionary General Theory of Relativity, Einstein theorized that massive celestial objects can generate ripples in spacetime called gravitational waves. This prediction was confirmed a century later with the direct detection of gravitational waves, produced by a pair of colliding black holes 1.3 billion light years from Earth. This was a watershed moment for both physics and astrophysics: after centuries of exploring the universe only through light, we now possess a fundamental new probe of the universe. However, this potential is being limited by current detection and parameter estimation techniques, which are simplified, slow, and not scalabale to the increasing rate of events being found. Recently, we began initial development and testing of a deep learning based pipeline aimed at real- time source detection and characterization. Using convolutional neural networks we have found that we can distinguish different classes of astrophysical sources from real Gaussian and non-Gaussian gravitational wave detector noise, and can moreover provide estimates of the source masses that are competitive with current techniques but take seconds instead of days. Here we request funding support for several critical development of this initial work that will bring the pipeline to fuller functionality as we seek external funding to enable full deployment in the next 2-3 years.

Improving polygenic risk prediction in underrepresented populations through transfer and federated learning
Rui Duan (Harvard T.H. Chan School of Public Health)

Polygenic risk scores (PRS) have shown promising potentials for early disease detection, prevention and intervention. However, as existing genome-wide studies were predominantly conducted in European-ancestry (EA) populations, the performance of PRS is much poor in non-EA populations compared with EA populations, which may exacerbate existing health disparities. The goal of our proposed work is to develop novel transfer and federated learning methods to improve the performance of PRS in underrepresented populations, based on two strategies:1) leveraging existing knowledge learned from diverse populations; 2) incorporating available datasets and enable multicenter collaboration through efficient and safe information-sharing strategies.We will (1) develop a transfer learning framework for constructing PRS in an underrepresented population, leveraging individual-level data and available GWAS summary statistics from multiple populations; (2) develop communication-efficient federated learning algorithms to increase the training sample size of an underrepresented population by incorporating multiple biobanks with communication efficiency and individual-level privacy protection. Completion of this project will lead to novel tools that produce high quality PRS with better predictive performance for underrepresented non-EA populations, which can ultimately help advance precision medicine and reduce the disparity in PRS-based research.

Identifying variables and race surrogates responsible for race and ethnic disparities in US mortality
Chirag Patel (Harvard Medical School)

The research project involves identifying the major predictors of the racial/ethnic “survival paradox”. The survival paradox is described as the phenomenon by which an ethnic/racial subgroup is predicted to be “sicker” than another, but paradoxically, on average, lives longer. It is unclear if the paradox is a phenomenon that is due to sampling bias– the bias by which scientists sample their data – or an actual biological phenomenon. We have tentatively observed the paradox in a large epidemiological survey, namely the US Center for Disease Control and Prevention National Health and Nutrition Examination Survey (NHANES). In this proposal, we will leverage machine learning to identify if other cohort data exhibit the survival paradox and employ machine learning methods to identify the indicators of the paradox, such as differences in nutrition, environmental exposure, and sampling selection.

2020 Projects

The American Communities Computable Newspaper Database
Melissa Dell (Harvard Faculty of Arts and Sciences)

The American Communities Computable Newspaper Database uses recent advances in deep learning, computer vision, and natural language processing to create a computable database for over 7,000 historical newspapers in all 50 states, across over 12 million newspaper editions. The database will provide straightforward-to-use outputs from natural language processing analyses conducted on full article texts, image captions, and headlines, that can be used to elucidate how the media has influenced American society.

Discovery of higher temperature superconductors by machine-learning strongly correlated descriptors of materials
Xin Li (Harvard John A. Paulson School of Engineering and Applied Sciences)

Project Website
The project aims to make a search and design of higher temperature superconductors by a combination high-throughput ab initio simulations and machine learning analyses that focuses on novel descriptors of ultrafast electron dynamics. The project may also lead to an understanding of the unconventional superconducting mechanism.

Refining pre-disaster strategic preparedness: A machine learning model for identification of communities facing the highest health risks from an impending tropical cyclone.
Rachel Nethery (Harvard T.H. Chan School of Public Health)

Climate change is expected to increase the intensity of tropical cyclones (TC), thus they represent an escalating risk to human health over this century. Motivated by our large database of historic TC exposures and Medicare health records, we will create a predictive machine learning tool that provides information in real-time about the areas of highest health risk and the types of health risks anticipated for an impending TC threatening the United States, with the goal of maximizing the protective impact of strategic preparedness efforts.

A theory of how deep networks generalize beyond their training set
Cengiz Pehlevan (Harvard John A. Paulson School of Engineering and Applied Sciences)

Deep networks are extremely successful in finding statistical patterns in data that generalize to previously unseen samples, yet how they can do so remains an open question. We will develop a theory of generalization in deep networks adopting mathematical methods from the statistical physics of disordered systems. 

2019 Projects

The Causal Impact of Conflict on Gender Roles and Gender Discrimination: Big Data, Rumors, and Nature of War
Marcia Castro (Harvard T.H. Chan School of Public Health) and Jocelyn Finlay (Harvard T.H. Chan School of Public Health)

Ambient Noise Seismology using Cloud Computing
Marine Denolle (Harvard Faculty of Arts and Sciences)

Discovering the Foundations of Deep Learning: What if we could understand how computers learn?
Stratos Idreos (Harvard John A. Paulson School of Engineering and Applied Sciences)

Neural networks are increasingly more prevalent in applications that have strong potential to improve human life. However, they are extremely hard and complex to design. We present the Deep Collider, a fine-grained and holistic experimental infrastructure that helps derive the first principles of neural network design and which helps decipher critical design problems: the first series of results contradict conventional wisdom about several critical design decisions

Machine Learning Classification of Astrophysical Transient Events: The First Data-Driven Tests in Anticipation of the Large Synoptic Survey Telescope
Edo Berger (Harvard Faculty of Arts and Sciences)

The primary aim of the project was to address a pressing need in time-domain astrophysics – the classification of optical transients based on their photometric data alone.  We developed, trained, published, utilized, and made public several machine learning algorithms designed for real-time identification of rare transient events, as well as for population-level studies.  These classification algorithms outperform previous work, and are currently implemented as part of on-going campaigns using some of the world’s largest telescopes.

Modeling Health System Resilience in Natural Disasters (California Wildfires)
Satchit Balsari (Harvard T.H. Chan School of Public Health)

Data-Driven Scientific Discovery In the Era of Large Astronomical Surveys
Douglas Finkbeiner (Harvard Faculty of Arts and Sciences)

2018 Projects

Toward the development of city-level disease monitoring and forecasting platforms combining disparate mathematical modeling techniques and novel data sources.
Caroline Buckee (Harvard T.H. Chan School of Public Health) and Mauricio Santillana (Harvard Medical School)

Planning proposal to establish a Harvard Initiative in Functional Neuroscience (HIFUN)
Adam Cohen (Harvard Faculty of Arts and Sciences)

Representation via Representations
Cynthia Dwork (Harvard John A. Paulson School of Engineering and Applied Sciences) and Giovanni Parmigiani (Harvard T.H. Chan School of Public Health)

The ultimate goal, in the development of models with biomedical applications, is to provide accurate predictions for fully independent samples, originating from institutions and processed by laboratories that did not generate the training datasets. Our work forges a new path for addressing a major problem in biomedical research, where the availability of high dimensional personal-level information is opening vast opportunities for developing predictive and prognostic algorithms to support increasingly personalized medical care, but general strategies to ensure generalizability of these algorithms beyond the populations used for training are still lacking.  While our motivating examples derive from the analysis of gene expression data in oncology, and our synthetic data experiments are modeled on procedures for generating synthetic data that have otherwise been employed in this field, our concepts and tools would be applicable to most areas of biomedicine and to a broad variety of labels. 

This work – domain generalization to unseen populations – provides a new dimension of fairness, representation of potentially small, geographically remote populations, not anticipated in earlier work

Scalable algorithms for Bayesian inference with multiple sources of streaming data
Pierre Jacob (Harvard Faculty of Arts and Sciences)

In various domains, data sets take the form of collections of time series, which calls for specific methodological research in data science. In this project, we consider various statistical questions arising with two types of data sets: one contains recordings of the activity of multiple neuronal units in the brain of mice during experiments, and the other contains partial genetic measurements collected on large populations of malaria parasites. In the first case, the project considers how to cluster time series in meaningful groups; in the second case how long should each time series be in order to obtain the desired estimation accuracy.

Smartphone-based Digital Phenotyping of Structure and Content of Communication
JP Onnela (Harvard T.H. Chan School of Public Health)

Smartphone-based digital phenotyping makes use of active data (e.g., surveys) and passive data (e.g., location data, activity data, communication data) to learn about social, behavioral, and cognitive phenotypes in free-living or naturalistic settings. In this project, we will develop new statistical methods methods for quantifying longitudinal changes in the social structure of communication networks in individuals with central nervous system disorder.

New frontiers in statistical modeling and data science for uncertainty quantification in climate science
Natesh Pillai  (Harvard Faculty of Arts and Sciences)

Accurate quantication of the long-term changes in temperature and careful analysis of the resulting implications constitute one of the pressing problems of our times. Long, high-quality records of temperature provide an important basis for our understanding of climate variability and change. However, even a perfunctory glance at most of the available temperature records will indicate a huge amount of uncertainty from various sources including measurement error, preprocessing of the data before it is made publicly available, etc. Thus to meaningfully contribute to the global warming debate, it is imperative that we provide a scientically rigorous uncertainty quantication of the temperature, incorporating careful statistical modeling, domain knowledge and technical expertise for handling large, complex data. This proposal is a call-to-arms to address this urgent issue.

2017 Projects

Learning from Noisy and Strategically-Generated Data
Yiling Chen (Harvard John A. Paulson School of Engineering and Applied Sciences)

This project focuses on developing methods to obtain stronger insights from data in the presence of strategic behavior.  It develops algorithms that are efficient for learning when data are strategically generated. It also develops incentive-aligned methods to elicit high-quality data for the subsequent learning. 

Improving Health Care System Performance: Computational Health Economics with Normative Data for Payment Calibration
Sherri Rose (Harvard Medical School)

In the conventional framework for designing health plan payment models, the regulator chooses variables to be used as risk adjustors, the risk adjustment weights, and other policy parameters, but the data from which estimates are derived are taken as given. This implicitly assumes the observed spending patterns are optimal, despite our knowledge that they are not. The work funded by this Harvard DSI grant, published in the Journal of Health Economics, developed new data transformation methods for the preprocessing phase to address this issue and induce a fairer and more efficient health care system.

Inference, Design of experiments, and Experimentation in an Automated Loop (IDEAL) for Aging Research
Jeffrey Miller (Harvard T.H. Chan School of Public Health)

Optimizing Sepsis Management with Reinforcement Learning
Finale Doshi-Velez (Harvard John A. Paulson School of Engineering and Applied Sciences)​​​​​​​

Off-policy evaluation considers the task of estimating the value of some proposed treatment policy given data collected from some behavior policy (e.g. estimating how well an alternate treatment strategy may perform given data from doctors treating patients using current approaches). The main outcomes of this grant were (a) to describe a set of common pitfalls for when off-policy evaluation may fail in real settings, and provide guidelines to identify when those may be happening, and (b) to develop novel approaches for improved off-policy evaluation.

Deep Learning in Particle Physics
Matthew Schwartz  (Harvard Faculty of Arts and Sciences)​​​​​​​

This award supported research on applying modern machine learning techniques to the analysis of collider data in particle physics. Techniques were developed for signal and background discrimination, data purification, and interpretable data-driven weakly-supervised and unsupervised learning.