Postdoctoral Fellow Research Fund

The Harvard Data Science Initiative Postdoctoral Fellow Research Fund incentivizes and supports cross-disciplinary collaboration between data scientists at the postdoctoral level.

2023 Projects

When does the Past Matter? Using Deep Learning to Investigate Experimental Evidence on the Sunk Cost Effect
Manual Hoffman (Harvard Business School)

The sunk cost bias is the inclination to pursue goals that are no longer beneficial or profitable because of the amount of time, effort, or resources already invested in them. Despite recognizing that continuing may not yield desirable outcomes, people often feel compelled to persist due to their past investments. This tendency to overlook present circumstances and potential alternatives can hinder logical decision-making and generate significant costs and damage across various contexts. For example, government projects often continue well after they are shown to be wasteful. Similarly, leaders often continue to sacrifice lives in military operations that have been proven to be counterproductive. Although there is a wealth of anecdotal evidence regarding the sunk cost effect, quantifying this phenomenon empirically poses challenges due to individual differences.

This research project will use machine learning methods to identify the factors that influence the strength of the sunk cost effect from experimental data, enabling us to consider the diversity and variations within a given group or population. We will use an innovative approach that harnesses the power of deep neural networks to enhance precision, while avoiding potential bias caused by limited data samples. The findings can inform policies that attempt to reduce bias among decision-makers across a wide variety of settings where decisions are influenced by sunk costs.

Optimizing Vaccine Efficacy Analysis for Specific Populations through Federated Learning and Cross-Institutional EHR Data Harmonization
Doudou Zhou (Harvard T.H. Chan School of Public Health)

This research proposal aims to enhance our understanding of COVID-19 vaccine responses in specific populations, particularly those with underlying health conditions. Current data from authorized vaccine phase 3 trials is insufficient and lacks generalizability for these populations.

By applying federated learning, or collaborative learning, — a decentralized approach to training machine learning models — across multiple institutional and cohort studies, we intend to identify risk factors and their interactions. This approach will allow for a comprehensive understanding of the vaccine’s impact and will enhance the statistical reliability and generalizability of our analysis.

The task of harmonizing data across different institutions, especially within Electronic Health Record (EHR) systems, presents considerable challenges. These hurdles arise from variations in data formats, terminologies, and concerns regarding data privacy. We propose to address this by collaborating with Bordeaux University Hospital (BUH) in France, harmonizing their EHR data with three U.S. institutions. 

The expected outcomes include (1) the creation of an innovative, resilient, scalable, and privacy-preserving EHR data harmonization framework that employs advanced language models and graph neural networks. This framework will enable comprehensive multi-institutional studies on different clinical problems, and (2) an in-depth analysis of COVID-19 vaccine efficacy on specific populations. Our project stands to make a substantial contribution to global healthcare systems and the ongoing response to the COVID-19 pandemic.

Data science approaches to understand global nutrient supplies from aquatic foods 
Jessica Mason (Harvard T.H. Chan School of Public Health)

Policies that promote access and consumption of safe and sustainably sourced aquatic foods could help tackle malnutrition and decrease diet-related noncontagious diseases around the globe. 

However, a lack of understanding of the heavy metal and nutrient composition of most aquatic foods, as well as their variability within and among species (e.g., fish, invertebrates, or seaweed), has delayed the necessary policy shifts toward effectively managing fisheries and aquaculture production for food and nutrition security in the context of environmental change.

This interdisciplinary project, which involves new and existing collaborations within and outside Harvard, aims to combine multiple existing datasets on aquatic species to predict location and time-specific nutrient and heavy metal concentrations of aquatic species based on life-history traits and environmental conditions.

Overall, this project will provide nutrient and heavy metal concentration estimates for all living aquatic species, and increase our understanding of the safe nutrient supply potential from aquatic foods with ongoing environmental change.

2022 Projects

A step closer from discovery to clinic: an integrative breast cancer risk prediction tool for African American women.
Tian Gu (Harvard T.H. Chan School of Public Health)

Polygenic risk score (PRS) has shown significant clinical potential for breast cancer risk stratification through aggregated genetic risk effects. Yet, the disparate performance across ancestry groups has hindered PRS from being implemented to refine current risk estimates based on clinal information. Particularly, although African American (AA) women have been suffering from a higher incidence of an aggressive form of breast cancer at a younger age than others, few PRS models have been developed targeting AA women due to their limited representation in large-scale clinical and genomics studies. Moreover, the existing risk prediction models based on clinical information or PRS derived from European ancestry have been found to have poor generalizability in African ancestry. Therefore, we propose to develop an integrative data integration, risk factor detection, and prediction tool that aims to improve the AA women’s breast cancer risk prediction by combining PRS and clinical risk factors, leveraging shared information from multi-ancestry groups, and accounting for population heterogeneity. Our proposal includes (i) using statistical methods to identify and quantify the shared risk factors, both clinical and polygenic, across ancestry groups and further detect the AA-specific risk factors; (ii) using machine learning approaches to develop an integrative breast cancer risk prediction tool for AA women that accounts for clinical and polygenic risk factors, leverages information from other ancestry groups, and quantifies the prediction uncertainties; and (iii) developing a user-friendly platform for health professionals to assess the breast cancer risk in AA women with open-access and reproducible codes.

Revealing Biological Functions behind the Visual Patterns in Single- Cell Spatial Omics Using XAI as an Interactive Visual Exploration Method
Qianwen Wang (Harvard Medical School)

Visual patterns of tissues and cells in microscopy images can unravel valuable insights to understand human bodies and treat diseases (e.g., histopathology).

This proposal employs an XAI technique that is commonly used to explain natural image models to extract visual patterns for analyzingmultiplexed single-cell images. We will closely collaborate with participants in the Human BioMolecular Atlas Program (HuBMAP) and develop an interactive visual exploration tool to identify visual patterns in spatial omics images and identify thebiological events behind these visual patterns.

Using Geospatial Data to Understand Ethnoracial Mental Health Disparities: The Impact of Greenspace and Air Quality on PTSD Symptom TrajectoriesUsing Geospatial Data to Understand Ethnoracial Mental Health Disparities: The Impact of Greenspace and Air Quality on PTSD Symptom Trajectories
Elizabeth Webb (McLean Hospital)

Each year, up to 30 million people are hospitalized in the United States as the result of a traumatic injury and approximately 30% of these individuals will develop PTSD. Ethnoracially minoritized individuals disproportionately experience more severe and chronic PTSD compared to their white counterparts. A proposed driver of these health disparities is the environment; exposure to environmental factors, including air quality and greenspace, varies across people and places, with inequities attributed to structural racism. This project will test whether air quality and greenspace impact how an individual recovers after a traumatic event. Importantly, we will examine whether differences in exposure to air quality and greenspace helps explain ethnoracial differences in PTSD symptoms. Given that differential access to greenspace (“tree inequity”) and air quality are components of structural racism and potential contributor of ethnoracial mental health disparities, this project applies a health equity focus and intentionally bridges public health and neuroscience.

Accelerating Discovery of Liquid Crystal Polymeric Materials with Extreme Properties using Machine Learning
Haichao Wu (Harvard John A. Paulson School of Engineering and Applied Sciences​​​​​​​)

Each year, up to 30 million people are hospitalized in the United States as the result of a traumatic injury and approximately 30% of these individuals will develop PTSD. Ethnoracially minoritized individuals disproportionately experience more severe and chronic PTSD compared to their white counterparts. A proposed driver of these health disparities is the environment; exposure to environmental factors, including air quality and greenspace, varies across people and places, with inequities attributed to structural racism. This project will test whether air quality and greenspace impact how an individual recovers after a traumatic event. Importantly, we will examine whether differences in exposure to air quality and greenspace helps explain ethnoracial differences in PTSD symptoms. Given that differential access to greenspace (“tree inequity”) and air quality are components of structural racism and potential contributor of ethnoracial mental health disparities, this project applies a health equity focus and intentionally bridges public health and neuroscience.

2021 Projects

Who is most vulnerable? Causal inference and machine learning approaches to estimate health care costs of air pollution in the United States.
Falco Joannnes Bargagli Stoffi (Harvard T.H. Chan School of Public Health)

The research project will have four integrated components: the creation of a novel national data set linking data on air pollution, health outcomes and the related costs for over 60 millions Americans and their individual and zip code-level characteristics (aim 1); the development and application of Bayesian machine learning for the estimation of individual-level health events and the related health costs attributable to exposure to air pollution (aim 2); the development of interpretable ML approaches for the detection of the subgroups that will suffer the highest costs (aim 3); the dissemination of software codes, data and web applications to make our results reproducible and available to a wider public (aim 4). We anticipate the output including at least two peer- reviewed articles, data sets, reproducible codes, and the dissemination of the research in leading national and international conferences.

A Spatial Approach to the Internal Migration of Minorities: How Service Delivery and Employment Opportunities Affect Relocation Patterns
Tugba Bozcaga (Harvard Kennedy School)

Open-source software to improve SARS-CoV-2 surveillance
James Hay (Harvard T.H. Chan School of Public Health)

We developed new metrics for tracking epidemic trends based on routinely collected, but currently discarded semi-quantitative outputs from RT-qPCR testing. During the award, we developed an open-source software package of our existing method, and initiated new collaborations with hospitals and public health departments to use outpatient hospital testing as a sentinel surveillance population. This work provides an additional approach to estimating SARS-CoV-2 infection incidence in the absence of widespread community testing.

Can Machine Learning in the Classroom Bring More Diversity to STEM?
Haewon Jeong (Harvard John A. Paulson School of Engineering and Applied Sciences)

The lack of diversity in STEM fields is a longstanding societal challenge. Through interdisciplinary collaboration between social science and machine learning (ML), I want to find a data-driven approach that can tackle the challenge in early STEM education. Specifically, this project asks two interrelated questions: (i) Can we detect factors that lead to gender and racial biases in STEM education using machine learning (ML) tools? (ii) Can ML-based decision making employed in classrooms propagate bias and make unfair decisions that can discourage minorities from further STEM education? To study these questions, I will closely collaborate withProf. Nilanjana Dasgupta, a social psychologist at UMass Amherst, who has been studying implicit biases, stereotypes, and STEM education for many years. We will make use of a unique dataset produced by Prof. Dasgupta and her team during a five-year longitudinal field study (2015-2019) at ten U.S. middle schools.

SEAS Press Release

2020 Projects

Exploring new methods to investigate adherence to treatment for drug-resistant tuberculosis
Stephanie Law (Harvard Medical School)

Every year, an estimated 10 million people contract tuberculosis (TB) and nearly 2 million people die from TB-related causes worldwide. The emergence of drug-resistant TB (DR-TB) is one of the major threats to controlling the global TB epidemic; at least 5% of all TB cases and 15% of all TB deaths are due to drug-resistant isolates. DR-TB treatment is extremely difficult and expensive, and has low treatment success rates (56% globally) and high mortality rates (40% to over 70%). Although poor treatment adherence is likely a major contributor to these low success rates, there is little research on patient adherence patterns and interventions that improve adherence. Patient-provider relationships, particularly the aspect of trust, can influence patient treatment adherence thereby impacting health outcomes in TB patients. This research project will explore DR-TB treatment adherence patterns, identify DR-TB patients at highest risk of poor adherence, and evaluate whether provider trust mediates the effect of patient adherence risk factors on DR-TB treatment adherence patterns. Specifically, this will entail novel application of latent-class trajectory models, machine learning algorithms, and causal mediation methods to investigate treatment adherence in DR-TB patients. The findings will inform targeted strategies to improve adherence among high-risk patients, provide new methodologies to analyze and investigate TB adherence, and guide future interventions aimed at improving provider trust.

Relevant links:

Regularized Maximum Likelihood Imaging Techniques: A New Method for Detecting Planets
Richard Teague (Harvard Faculty of Arts and Sciences)

The Atacama Large (sub-)Millimeter Array (ALMA) has revolutionized our understanding of planet formation. In particular it has allowed for the highest resolution images of the planet formation environment, the protoplanetary disk, to date, revealing an exceptional level of structure. It has long been believed that these structures can be created by still forming protoplanets which carve out gaps and rings in the dust of these systems. Despite this extensive evidence of protoplanets, little success has been had when trying to detect them with traditional means (i.e. looking for their heat signatures directly). Recently, we have shown that these protoplanets will cause small disturbances in the gas around them, driving waves and ripples in the gas, similar to a fish swimming through water. Although we have started to observe the first of these signatures, we are still limited by the quality of the images we can take with ALMA. When reconstructing an image from multiple telescopes (an interferometer), we only sample different parts of the sky and we have to make assumptions about what we missed, a problem which the Event Horizon Telescope spent a considerable amount of time considering. This project aims at exploring a new group of methods, ‘regularized maximum likelihood techniques’, to reconstruct these images with a focus on recovering images that maximize the changes to detect these embedded planets. Application of these methods to the growing archive of observations promises to present an entirely new approach to planet hunting.

Relevant links:

Learning Peptide-specific T Cell Receptors in Human Cancers by DeepNeural Network and Structural Modeling
Songpeng Zu (Harvard Faculty of Arts and Sciences)

T cells recognize cancer-specific peptides with the complementary de- terminant regions (CDRs) of T-cell receptors (TCRs). In this project, we explore the functional representation of the T-cell receptor repertoires by variational auto-encoder learned from more than six million CDR sequences. Then by applying the multiple-instance learning strategy on over two million CDR sequences from about 8,000 individuals across 30 tumor types, we discovered that the T cell repertoires from bulk RNA-Sequencing dataset can be used for both cancer detection and cancer type classification, which sheds light on the corresponding clinical application.

2019 Projects

Machine learning methods for integrating biological multi-omic datasets to decipher parasite development in the malaria mosquito
Duo Peng (Harvard T.H. Chan School of Public Health)

Duo Peng and colleagues identified mosquito metabolic pathways that favor the early development of Plasmodium falciparum—the deadliest human malaria parasite—in the Anopheles gambiae mosquito. Duo performed high-depth transcriptome sequencing of mosquitoes during early stages of malaria parasite infection. Using the data collected, he developed a machine learning model that predicts parasite load in mosquitoes using expression values of mosquito genes. The machine learning model uses the Extreme Gradient Boosting algorithm framework. The fine-tuned model can explain 65% of the variation of parasite load using mosquito gene expression values while the model is rigorously safeguarded against overfitting. According to the model, three genes involved in fatty acid metabolism and one gene involved in the amino acid metabolism strongly shape malaria development in mosquitoes. Duo Peng and colleagues are currently validating these gene candidates experimentally. Confirmed genes and related metabolic pathways can serve as targets of malaria transmission control programs to block the transmission of this devastating pathogen.

Assessing the Air Pollution Effect on Hospital Admissions in the U.S: A Matching Approach for Big Data
Maayan Yitshak Sade (Harvard T.H. Chan School of Public Health)

Numerous epidemiological studies have concluded that exposure to particulate air pollution (PM2.5-particulate matter smaller than 2.5µm in diameter) increases the risk of mortality, of cardiovascular, respiratory, neurological and psychological morbidity, and shortens life expectancy. Randomized controlled trials (RCT) are not feasible when studying population based air pollution effects, therefore the majority of the evidence relies on classical regression methods with adjustment for possible confounders. Unlike RCT’s, where the randomization assures that the exposure of interest is independent from all other parameters at the time of randomization, classical observational approaches are prone to confounding bias. Propensity score (PS) matching, is a causal modeling approach which overcomes this limitation by mimicking the randomization process. We are using this method to assess the causal impact of high daily levels of PM2.5 on hospital admissions across the U.S. More specifically, we answer the question: how many cardiovascular hospital admissions could be prevented by lowering air pollution levels?

Personalizing mental health care: Bringing machine learning support into the clinic through user-centered design
Maia Jacobs (Harvard John A. Paulson School of Engineering and Applied Sciences)

The promise of machine learning (ML) in medicine is alluring, but few tools are actually being used in clinical practice. One area of healthcare that researchers have expected to benefit from the implementation of DSTs, but has yet to adopt such technological support, is major depressive disorder (MDD). Towards the goal of translating ML predictions to real- world decision support tools, we explore to what extent clinical practice could be improved if clinicians were presented with recommendations produced by such models. Using a series of experiments and co-design sessions with healthcare providers, we found that the implementation of ML tools with high accuracy rates may be insufficient to improve treatment selection accuracy, while also demonstrating the risk of overreliance when clinicians are shown incorrect treatment recommendations. Our findings also indicate that current trends in explainable AI may be inappropriate for clinical environments, and we consider paths towards designing these tools for real-world medical systems. Collectively, this work demonstrates the importance of human-computer interaction and data science collaborations in designing ML tools for clinical decision-making.