Postdoctoral Research Fund

The Harvard Data Science Initiative Postdoctoral Fellow Research Fund incentivizes and supports cross-disciplinary collaboration between data scientists at the postdoctoral level.

2021 Request For Proposals

2020 Projects


How to see through the skull?: Predicting brain shift for safer neurosurgery
Adel Djellouli (Harvard John A. Paulson School of Engineering and Applied Sciences) & Nazim Haouchine (Harvard Medical School)

A craniotomy is the surgical removal of part of the bone from the skull to expose the brain to surgeons so they can remove a malignant brain tumor. Through this project we will develop a method to predict the geometry and appearance of the brain surface before opening the skull. Surgeons will be able to see through the patient's skull to optimize where and how they will expose the brain. Using machine learning techniques, we can display a predicted image of the deformed vessels, accounting for brain shift, and synthesize it appearance using Neural Style Transfer so it can be easily interpreted by neurosurgeons. We believe this solution will lead to safer neurosurgical procedures in the operating theater.  

Exploring new methods to investigate adherence to treatment for drug-resistant tuberculosis
Stephanie Law (Harvard Medical School)

Every year, an estimated 10 million people contract tuberculosis (TB) and nearly 2 million people die from TB-related causes worldwide. The emergence of drug-resistant TB (DR-TB) is one of the major threats to controlling the global TB epidemic; at least 5% of all TB cases and 15% of all TB deaths are due to drug-resistant isolates. DR-TB treatment is extremely difficult and expensive, and has low treatment success rates (56% globally) and high mortality rates (40% to over 70%). Although poor treatment adherence is likely a major contributor to these low success rates, there is little research on patient adherence patterns and interventions that improve adherence. Patient-provider relationships, particularly the aspect of trust, can influence patient treatment adherence thereby impacting health outcomes in TB patients. This research project will explore DR-TB treatment adherence patterns, identify DR-TB patients at highest risk of poor adherence, and evaluate whether provider trust mediates the effect of patient adherence risk factors on DR-TB treatment adherence patterns. Specifically, this will entail novel application of latent-class trajectory models, machine learning algorithms, and causal mediation methods to investigate treatment adherence in DR-TB patients. The findings will inform targeted strategies to improve adherence among high-risk patients, provide new methodologies to analyze and investigate TB adherence, and guide future interventions aimed at improving provider trust.

Relevant links:


Regularized Maximum Likelihood Imaging Techniques: A New Method for Detecting Planets
Richard Teague (Harvard Faculty of Arts and Sciences)

The Atacama Large (sub-)Millimeter Array (ALMA) has revolutionized our understanding of planet formation. In particular it has allowed for the highest resolution images of the planet formation environment, the protoplanetary disk, to date, revealing an exceptional level of structure. It has long been believed that these structures can be created by still forming protoplanets which carve out gaps and rings in the dust of these systems. Despite this extensive evidence of protoplanets, little success has been had when trying to detect them with traditional means (i.e. looking for their heat signatures directly). Recently, we have shown that these protoplanets will cause small disturbances in the gas around them, driving waves and ripples in the gas, similar to a fish swimming through water. Although we have started to observe the first of these signatures, we are still limited by the quality of the images we can take with ALMA. When reconstructing an image from multiple telescopes (an interferometer), we only sample different parts of the sky and we have to make assumptions about what we missed, a problem which the Event Horizon Telescope spent a considerable amount of time considering. This project aims at exploring a new group of methods, ‘regularized maximum likelihood techniques’, to reconstruct these images with a focus on recovering images that maximize the changes to detect these embedded planets. Application of these methods to the growing archive of observations promises to present an entirely new approach to planet hunting.

Relevant links:


Learning Peptide-specific T Cell Receptors in Human Cancers by DeepNeural Network and Structural Modeling
Songpeng Zu (Harvard Faculty of Arts and Sciences)

T cells, one kind of the immune cells, play an essential role in discovering and killing the tumor cells.  Each T cell has a unique type of protein on its cell surface named T cell receptor (TCR). Some TCRs could target the peptides specifically presented on the tumor cell surfaces. Recognizing tumor-specific  TCRs from T cell repertoires can help us  to diagnose the chance of having cancer and to design the appropriate cancer  immunotherapy for individuals.

In order to find the tumor-specific TCRs, current approaches firstly collect the T cell repertories in both the normal tissues and tumor tissues from one patient, and detect tumor-tissue enriched TCRs. Then secondly find the significantly tumor-tissue-enriched  TCRs shared among patients. However, the shared ones seems to be commonly generated sequences, unlikely the tumor-specific TCRs.

The major issue for the first step is that the observed TCRs in the normal tissue only reflect a small part of the overall normal TCRs in one individual. So the tumor-tissue enriched TCRs might be the normal TCRs not observed in the current normal tissue samples.  

A direct method to solve the issue above is to use the TCRs from large number of normal tissues to describe explicitly the population of  normal TCRs.  But how to describe this population is difficult.  Inspired by the natural language processing field,  we imagine that each TCR in one individual is like one sentence in one document. We can use the deep learning-based representation model to describe the population composed by hundreds of millions of TCRs from normal tissues.


2019 Projects


Machine learning methods for integrating biological multi-omic datasets to decipher parasite development in the malaria mosquito
Duo Peng (Harvard T.H. Chan School of Public Health)

Duo Peng and colleagues identified mosquito metabolic pathways that favor the early development of Plasmodium falciparum—the deadliest human malaria parasite—in the Anopheles gambiae mosquito. Duo performed high-depth transcriptome sequencing of mosquitoes during early stages of malaria parasite infection. Using the data collected, he developed a machine learning model that predicts parasite load in mosquitoes using expression values of mosquito genes. The machine learning model uses the Extreme Gradient Boosting algorithm framework. The fine-tuned model can explain 65% of the variation of parasite load using mosquito gene expression values while the model is rigorously safeguarded against overfitting. According to the model, three genes involved in fatty acid metabolism and one gene involved in the amino acid metabolism strongly shape malaria development in mosquitoes. Duo Peng and colleagues are currently validating these gene candidates experimentally. Confirmed genes and related metabolic pathways can serve as targets of malaria transmission control programs to block the transmission of this devastating pathogen.

Assessing the Air Pollution Effect on Hospital Admissions in the U.S: A Matching Approach for Big Data
Maayan Yitshak Sade (Harvard T.H. Chan School of Public Health)

Numerous epidemiological studies have concluded that exposure to particulate air pollution (PM2.5-particulate matter smaller than 2.5µm in diameter) increases the risk of mortality, of cardiovascular, respiratory, neurological and psychological morbidity, and shortens life expectancy. Randomized controlled trials (RCT) are not feasible when studying population based air pollution effects, therefore the majority of the evidence relies on classical regression methods with adjustment for possible confounders. Unlike RCT’s, where the randomization assures that the exposure of interest is independent from all other parameters at the time of randomization, classical observational approaches are prone to confounding bias. Propensity score (PS) matching, is a causal modeling approach which overcomes this limitation by mimicking the randomization process. We are using this method to assess the causal impact of high daily levels of PM2.5 on hospital admissions across the U.S. More specifically, we answer the question: how many cardiovascular hospital admissions could be prevented by lowering air pollution levels?

Personalizing mental health care: Bringing machine learning support into the clinic through user-centered design
Maia Jacobs (Harvard John A. Paulson School of Engineering and Applied Sciences)

The promise of machine learning (ML) in medicine is alluring, but few tools are actually being used in clinical practice. One area of healthcare that researchers have expected to benefit from the implementation of DSTs, but has yet to adopt such technological support, is major depressive disorder (MDD). Towards the goal of translating ML predictions to real- world decision support tools, we explore to what extent clinical practice could be improved if clinicians were presented with recommendations produced by such models. Using a series of experiments and co-design sessions with healthcare providers, we found that the implementation of ML tools with high accuracy rates may be insufficient to improve treatment selection accuracy, while also demonstrating the risk of overreliance when clinicians are shown incorrect treatment recommendations. Our findings also indicate that current trends in explainable AI may be inappropriate for clinical environments, and we consider paths towards designing these tools for real-world medical systems. Collectively, this work demonstrates the importance of human-computer interaction and data science collaborations in designing ML tools for clinical decision-making.