Postdoctoral Research Fund

The Harvard Data Science Initiative Postdoctoral Fellow Research Fund incentivizes and supports cross-disciplinary collaboration between data scientists at the postdoctoral level.

2021 Request For Proposals

2021 Projects


Who is most vulnerable? Causal inference and machine learning approaches to estimate health care costs of air pollution in the United States.
Falco Joannnes Bargagli Stoffi (Harvard T.H. Chan School of Public Health)

The research project will have four integrated components: the creation of a novel national data set linking data on air pollution, health outcomes and the related costs for over 60 millions Americans and their individual and zip code-level characteristics (aim 1); the development and application of Bayesian machine learning for the estimation of individual-level health events and the related health costs attributable to exposure to air pollution (aim 2); the development of interpretable ML approaches for the detection of the subgroups that will suffer the highest costs (aim 3); the dissemination of software codes, data and web applications to make our results reproducible and available to a wider public (aim 4). We anticipate the output including at least two peer- reviewed articles, data sets, reproducible codes, and the dissemination of the research in leading national and international conferences.

A Spatial Approach to the Internal Migration of Minorities: How Service Delivery and Employment Opportunities Affect Relocation Patterns
Tugba Bozcaga (Harvard Kennedy School)

Open-source software to improve SARS-CoV-2 surveillance
James Hay (Harvard T.H. Chan School of Public Health)

Testing to find infections and track the spread of the virus is central to tackling the SARS-CoV-2 pandemic, helping decision makers know when and what interventions to put in place. While most people think of COVID-19 tests as being only positive or negative, most routine tests quantify how much virus genetic material there is in each sample. We have done research to develop a new analytical tool to harness this information, which is otherwise discarded, to generate unbiased, accurate estimates for how quickly infection numbersare increasing or decreasing. However, uptake of scientific discoveries by the right people (local, state, and regional public health departments) is often hindered due to the perceived inaccessibility of complicated methods. This project aims to develop and deploy an open-source software tool with a user interface to make this novel method accessible to non-specialist users, adding a much-needed option to the public health surveillance toolkit.

Can Machine Learning in the Classroom Bring More Diversity to STEM?
Haewon Jeong (Harvard John A. Paulson School of Engineering and Applied Sciences)

The lack of diversity in STEM fields is a longstanding societal challenge. Through interdisciplinary collaboration between social science and machine learning (ML), I want to find a data-driven approach that can tackle the challenge in early STEM education. Specifically, this project asks two interrelated questions: (i) Can we detect factors that lead to gender and racial biases in STEM education using machine learning (ML) tools? (ii) Can ML-based decision making employed in classrooms propagate bias and make unfair decisions that can discourage minorities from further STEM education? To study these questions, I will closely collaborate withProf. Nilanjana Dasgupta, a social psychologist at UMass Amherst, who has been studying implicit biases, stereotypes, and STEM education for many years. We will make use of a unique dataset produced by Prof. Dasgupta and her team during a five-year longitudinal field study (2015-2019) at ten U.S. middle schools.

SEAS Press Release

2020 Projects


Exploring new methods to investigate adherence to treatment for drug-resistant tuberculosis
Stephanie Law (Harvard Medical School)

Every year, an estimated 10 million people contract tuberculosis (TB) and nearly 2 million people die from TB-related causes worldwide. The emergence of drug-resistant TB (DR-TB) is one of the major threats to controlling the global TB epidemic; at least 5% of all TB cases and 15% of all TB deaths are due to drug-resistant isolates. DR-TB treatment is extremely difficult and expensive, and has low treatment success rates (56% globally) and high mortality rates (40% to over 70%). Although poor treatment adherence is likely a major contributor to these low success rates, there is little research on patient adherence patterns and interventions that improve adherence. Patient-provider relationships, particularly the aspect of trust, can influence patient treatment adherence thereby impacting health outcomes in TB patients. This research project will explore DR-TB treatment adherence patterns, identify DR-TB patients at highest risk of poor adherence, and evaluate whether provider trust mediates the effect of patient adherence risk factors on DR-TB treatment adherence patterns. Specifically, this will entail novel application of latent-class trajectory models, machine learning algorithms, and causal mediation methods to investigate treatment adherence in DR-TB patients. The findings will inform targeted strategies to improve adherence among high-risk patients, provide new methodologies to analyze and investigate TB adherence, and guide future interventions aimed at improving provider trust.

Relevant links:


Regularized Maximum Likelihood Imaging Techniques: A New Method for Detecting Planets
Richard Teague (Harvard Faculty of Arts and Sciences)

The Atacama Large (sub-)Millimeter Array (ALMA) has revolutionized our understanding of planet formation. In particular it has allowed for the highest resolution images of the planet formation environment, the protoplanetary disk, to date, revealing an exceptional level of structure. It has long been believed that these structures can be created by still forming protoplanets which carve out gaps and rings in the dust of these systems. Despite this extensive evidence of protoplanets, little success has been had when trying to detect them with traditional means (i.e. looking for their heat signatures directly). Recently, we have shown that these protoplanets will cause small disturbances in the gas around them, driving waves and ripples in the gas, similar to a fish swimming through water. Although we have started to observe the first of these signatures, we are still limited by the quality of the images we can take with ALMA. When reconstructing an image from multiple telescopes (an interferometer), we only sample different parts of the sky and we have to make assumptions about what we missed, a problem which the Event Horizon Telescope spent a considerable amount of time considering. This project aims at exploring a new group of methods, ‘regularized maximum likelihood techniques’, to reconstruct these images with a focus on recovering images that maximize the changes to detect these embedded planets. Application of these methods to the growing archive of observations promises to present an entirely new approach to planet hunting.

Relevant links:


Learning Peptide-specific T Cell Receptors in Human Cancers by DeepNeural Network and Structural Modeling
Songpeng Zu (Harvard Faculty of Arts and Sciences)

T cells recognize cancer-specific peptides with the complementary de- terminant regions (CDRs) of T-cell receptors (TCRs). In this project, we explore the functional representation of the T-cell receptor repertoires by variational auto-encoder learned from more than six million CDR sequences. Then by applying the multiple-instance learning strategy on over two million CDR sequences from about 8,000 individuals across 30 tumor types, we discovered that the T cell repertoires from bulk RNA-Sequencing dataset can be used for both cancer detection and cancer type classification, which sheds light on the corresponding clinical application.

2019 Projects


Machine learning methods for integrating biological multi-omic datasets to decipher parasite development in the malaria mosquito
Duo Peng (Harvard T.H. Chan School of Public Health)

Duo Peng and colleagues identified mosquito metabolic pathways that favor the early development of Plasmodium falciparum—the deadliest human malaria parasite—in the Anopheles gambiae mosquito. Duo performed high-depth transcriptome sequencing of mosquitoes during early stages of malaria parasite infection. Using the data collected, he developed a machine learning model that predicts parasite load in mosquitoes using expression values of mosquito genes. The machine learning model uses the Extreme Gradient Boosting algorithm framework. The fine-tuned model can explain 65% of the variation of parasite load using mosquito gene expression values while the model is rigorously safeguarded against overfitting. According to the model, three genes involved in fatty acid metabolism and one gene involved in the amino acid metabolism strongly shape malaria development in mosquitoes. Duo Peng and colleagues are currently validating these gene candidates experimentally. Confirmed genes and related metabolic pathways can serve as targets of malaria transmission control programs to block the transmission of this devastating pathogen.

Assessing the Air Pollution Effect on Hospital Admissions in the U.S: A Matching Approach for Big Data
Maayan Yitshak Sade (Harvard T.H. Chan School of Public Health)

Numerous epidemiological studies have concluded that exposure to particulate air pollution (PM2.5-particulate matter smaller than 2.5µm in diameter) increases the risk of mortality, of cardiovascular, respiratory, neurological and psychological morbidity, and shortens life expectancy. Randomized controlled trials (RCT) are not feasible when studying population based air pollution effects, therefore the majority of the evidence relies on classical regression methods with adjustment for possible confounders. Unlike RCT’s, where the randomization assures that the exposure of interest is independent from all other parameters at the time of randomization, classical observational approaches are prone to confounding bias. Propensity score (PS) matching, is a causal modeling approach which overcomes this limitation by mimicking the randomization process. We are using this method to assess the causal impact of high daily levels of PM2.5 on hospital admissions across the U.S. More specifically, we answer the question: how many cardiovascular hospital admissions could be prevented by lowering air pollution levels?

Personalizing mental health care: Bringing machine learning support into the clinic through user-centered design
Maia Jacobs (Harvard John A. Paulson School of Engineering and Applied Sciences)

The promise of machine learning (ML) in medicine is alluring, but few tools are actually being used in clinical practice. One area of healthcare that researchers have expected to benefit from the implementation of DSTs, but has yet to adopt such technological support, is major depressive disorder (MDD). Towards the goal of translating ML predictions to real- world decision support tools, we explore to what extent clinical practice could be improved if clinicians were presented with recommendations produced by such models. Using a series of experiments and co-design sessions with healthcare providers, we found that the implementation of ML tools with high accuracy rates may be insufficient to improve treatment selection accuracy, while also demonstrating the risk of overreliance when clinicians are shown incorrect treatment recommendations. Our findings also indicate that current trends in explainable AI may be inappropriate for clinical environments, and we consider paths towards designing these tools for real-world medical systems. Collectively, this work demonstrates the importance of human-computer interaction and data science collaborations in designing ML tools for clinical decision-making.