Become an MSSE Partner
Collaborate with students to develop scalable computational software and solve problems in the molecular sciences.
UC Berkeley’s Master of Molecular Science and Software Engineering (MSSE) prepares students with a background in chemistry, physics, biology, engineering, and computer science for careers in computational science, data science, machine learning, and software engineering. To learn more: view our curriculum.
At the end of the program, students partner with companies and labs to address complex and challenging machine learning and software engineering problems. Projects may be self-contained or part of a larger business initiative. Supervised by a faculty member, students will spend 10 hours a week working on their capstone for 16 weeks during their last spring semester.
Partner with Us
Project Tracks
Projects may consist of software development, machine learning, mathematical modeling and simulations, and high-performance computing. Given the wide variety of student backgrounds, professional interests, and computational science topics covered in the MSSE program, the capstone projects will be classified into one of the following three professional interdisciplinary tracks:
A project on this track focuses on the research and development of a computational science application. Students understand the scientific problem at hand and the impact that their project will have in a particular field of science.
A project on this track focuses on the development of large-scale software libraries, tools, or computational applications relevant to molecular science. Students understand the need for scaling a given computational application, the computational complexity of applications and key kernels, and the impact that the project deliverables will have in a particular field of science.
A project on this track focuses on the development of a library or software package for computational sciences applying the best practices of software engineering and covering all elements of the software engineering cycle. Students understand the need for such a library or service to the computational science community.
Capstone Partnership Timeline
Project Proposal Period Begins
Project Proposal Period Closes
Projects Recommended to Students
Students Assigned to Capstone Projects
Capstone Begins
Capstone Ends
Curriculum Highlights
- Software Engineering
- C++, Python
- Computational Science
- Machine Learning
- Data Science
- Deep Learning
- Mathematical Modeling & Simulations
- Molecular Science
- Data Visualization
- High Performance Computing
- Computational Quantum Chemistry
FAQ
Examples of Current Projects
Uncertainty Quantification for AMPL point predictions
We will learn to use AMPL and train models for property prediction. We will explore uncertainty quantification (UQ) methods and evaluate the point predictions of the models. The success of a predictive model requires the uncertainty quantification (UQ) that measures how reliable a point prediction is. UQ is especially important for drug discovery as we explore chemical space beyond the training data distribution. Students will examine classic UQ methods used for chemical compounds, as well as modern UQ methods designed for neural network models. Finally, we will apply various visualization methods to present the prediction uncertainties.
CNNs and UNets for dementia prediction
ADNI is an open-source dataset for tracking the progression of Alzheimer’s disease in thousands of subjects. CNNs have demonstrated relative success in predicting a patient’s diagnosis from images alone but, so far, the anatomic connectivity between brain regions has been ignored. You will be evaluating the impact of considering the known anatomical connectivity between brain regions to predict diagnosis in this dataset by using open-source Graph Convolutional NNs, CNNs or UNets.
Inference of biophysical models of pathology transmission in Alzheimer’s disease
Mathematical biophysical models of pathology progression in Alzheimer's disease and other degenerative diseases have now been available for a decade. These models attempt to capture how tau and amyloid pathology will spread on the brain's anatomic connectivity network. We are offering a new project that will combine these biophysical models with AI/ML, such that the biophysical model's parameters are inferred using Simulation-Based Inference (SBI) or another neural network tool, to perform model inference. The student will be tasked with developing and extending this approach and applying it to both mouse brain and human brain pathology images. The end product will be a method that can infer these parameters in a patient and use it to produce future simulations of the patient's entire pathoprogression trajectory.
Machine Learning for Protein Folding Stability in De Novo Design
Utilizing a curated dataset of high-quality protein folding stabilities, this project aims to enhance de novo design and structure prediction tools by incorporating a predictive layer for protein folding stability. Students will develop machine learning models to achieve this and are expected to evaluate their method's performance in comparison to existing approaches where applicable.
Peptide Digest
To perform top-tier science and deliver discoveries efficiently, scientists must stay consistently apprised of the latest relevant publications. However, thoroughly reading these papers is time-consuming and can often be wasteful when the details of the publication don’t satisfy the claims of the abstract or other criteria used for reading selection. Large Language Models (LLMs) have recently been democratized and are capable of efficiently reading, summarizing, and parsing text. We propose to customize an LLM toward digesting large amounts of potentially impactful publications to deliver efficient summaries and metadata to researchers. This text can include a “priority” score based on input parameters relevant to the department of interest, here we’ll start with computational peptide research. The models can additionally report relevant metadata such as the number of compounds, diversity of the series, presence of unnatural amino acids, cyclicity, etc. Not only can this help identify the most impactful papers, but it can also help with public dataset selection for methods development. Researchers will aid in highlighting the highest details of interest and evaluating the quality of the results. This work can be done purely on non-proprietary data and could be highly publishable. Initial success with computational peptide research can be expanded to other domains. The work of the students can include the adaptation of a public model, the development of an API and web interface, and the automation of article scraping, digestion, and reporting.
10K Reports Sentiment analysis and validation
Sentiment analysis and validate the accuracy of sentiment analysis predictions made on 10K reports by comparing them with the actual financial performance of the company in the next quarter to see if our sentiment analysis could capture the next quarter's performance.
AI-Driven Therapeutic Target Identification
Numerous AI approaches have been developed to harness the data from high throughput assays, GWAS, and clinical records to identify novel therapeutic targets. More recently, LLMs have facilitated the target discovery process. The quickly growing list of published algorithms, each with strong performance demonstrated on particular tasks and datasets, poses a challenge for our computational team as we try to identify the best algorithms we can use to prioritize novel therapeutic targets we should invest in.
For this project, you will select a few of the most promising algorithms to implement and compare head-to-head performance when applied to target prediction tasks in disease areas and pathways with different types and amounts of data available. Your goals are to 1) characterize the strengths and weaknesses of the target prediction approaches and 2) identify the most important data types for successful target prediction. As a stretch goal, you will have the opportunity to propose and evaluate an appropriately scoped enhancement to an existing algorithm if you have identified a compelling opportunity.
AI-Driven Therapeutic Target Identification
Numerous AI approaches have been developed to harness the data from high throughput assays, GWAS, and clinical records to identify novel therapeutic targets. More recently, LLMs have facilitated the target discovery process. The quickly growing list of published algorithms, each with strong performance demonstrated on particular tasks and datasets, poses a challenge for our computational team as we try to identify the best algorithms we can use to prioritize novel therapeutic targets we should invest in.
For this project, you will select a few of the most promising algorithms to implement and compare head-to-head performance when applied to target prediction tasks in disease areas and pathways with different types and amounts of data available. Your goals are to 1) characterize the strengths and weaknesses of the target prediction approaches and 2) identify the most important data types for successful target prediction. As a stretch goal, you will have the opportunity to propose and evaluate an appropriately scoped enhancement to an existing algorithm if you have identified a compelling opportunity.
Cell Failure Mode
A battery company has collected data from its battery testing cycles for the pouch cells built to evaluate proprietary product performance and benchmark to leading industry standard products. Some of these cells fail in testing due to various reasons, and when this occurs it is important to take apart components of the cell with a structured teardown approach. There is limited capacity to complete the cell teardown process and collect images for these failed batteries. In this project, you will leverage advanced machine learning techniques, which may include convolutional neural networks (CNNs) and recurrent neural networks (RNNs), to predict cell failure based on early cell performance data, with a focus on materials-independent applications. The targeted approach involves transfer learning and anomaly detection methodologies to develop robust predictive models capable of identifying failure patterns across various cell types. We expect to utilize Python packages such as PyTorch for this project and once models are trained, they will be deployed, versioned, and monitored on the deployment server.
Chemical Components Classifications for Machine Learning Training
The training of machine learning models for fast data-driven properties prediction is essential in the search for new novel chemical structures for electrolyte components. The training sets need to have a wide distribution of different functional groups to accurately train our models for lab staff to filter and target the best possible chemical candidates for product development. You will be writing a Python program to classify a library of over 1,000,000 compounds to build a design space of chemical functionality classes and down-select to approximately 50,000 compounds for quantum mechanical properties calculations. Further, this approach will facilitate broadening the current library and balancing the chemical functionality families that are available to aid in producing a better training set for the ML model.
Examples of Past Capstone Projects
Implementation of A Flexible Parameterization interface for the pymatagen.io.openmm Package
A collaboration with LBNL. This capstone project focused on enhancing the functionality of an io extension library for pymatgen, implementing flexible parameterization for improved accuracy in high throughput molecular simulations. The primary challenge addressed was modeling the behavior of atoms and molecules that can alter their charge depending on their environment. To tackle this issue, the project developed an interface that allows users to utilize a flexible array of force fields, including polarizable force fields, which offer more accurate simulation and prediction of material properties. This improvement is particularly significant for functional materials, such as catalysts and batteries, that are not accurately modeled using non-polarizable force fields. The successful implementation of this interface has expanded the capabilities of the Materials Project, contributing to the advancement of materials design and discovery.
An Open-Source Web Application for Molecular Descriptor Libraries and Datasets
A collaboration with MolSSI. Artificial intelligence and machine learning have greatly accelerated the rate of molecular discovery. However, the distribution and availability of data for these molecules still need to be improved. We have developed an open-source web application for molecular libraries and databases to address this challenge. The application uses containerization (Docker) and open-source libraries to create a modular and extensible platform for accessing molecular descriptors. This effort will streamline the accessibility to descriptor data necessary for computational advancements. The application includes a web front-end built using React, a Python REST API backend, and a postgres database. Due to the open-source containerized nature of our application, it can easily be modified or extended for other applications or data sets. Our use case is the Kolossal viRtual dAtabase for moleKular descriptors of orgaNophosphorus (Kraken) database, which contains descriptors for monodentate organophosphorus(III) ligands. The main deliverables were updated documentation and website architecture, automated back-end REST API testing, a molecule neighbor search feature based on dimensionality reductions: PCAs and UMAP, and an updated substructure search page. The new web application aims to be easy to use, visually appealing, and packed with features that will make it a valuable resource for academic and industry researchers.
Integrating Industry Standard Rendering Techniques for Visualizing Solvation in Molecular Dynamics Simulations Using The MolecularNodes Software Plug-in
A collaboration with MolSSI. MolecularNodes is a Blender plugin that offers advanced rendering for molecular dynamics (MD) simulations. Unlike other visualization software, it employs industry-standard rendering techniques commonly used in film and video games. However, MolecularNodes is not able to natively display solvent interactions, which are often crucial in elucidating protein structure-function relationships. To address this, this project integrated SolvationAnalysis, a plugin that provides data structures for examining solute-solvent interactions, to MolecularNodes, enabling users to easily specify and visualize solute and solvent structures. This expansion of MolecularNodes’ functionality is expected to appeal to researchers, students, and non-specialists because it allows users to produce professional-quality visualization of solvent interactions without needing sophisticated programming expertise.