Projects – UCSD-HBCU Computer Science for Social Impact Summer Institute

2024 projects

Advisor. Jingbo Shang
Title. Extracting Emerging Phrases from Massive Text Corpora
We have demonstrated a path to quality phrase extraction from massive text corpora using distant supervision from existing knowledge bases (e.g., Wikipedia) (Shang et al, 2018). This data-driven method unifies multiple statistical signals using ensemble learning techniques, but requires scanning over the entire corpus every time a new document is added. Also, this method is effective only when the phrases are frequent. Our goal in this project is to develop a novel method that more easily incorporates new documents and infrequent, emerging phrases. Recent advances in neural language models, as well as statistical methods that detect trends of phrases, will be employed.
The student would learn how to conduct parallel computing and train deep networks for this task, as well as develop new deep learning models for text processing.

Advisor. Gary Cottrell
Title. Aiding in Identifying Natural Molecules for Medicine
In collaboration with William Gerwick at Scripps Institute of Oceanography, we have been developing systems to speed up structure determination from NMR spectra of small molecules extracted from Natural Products (NPs) (Zhang et al., 2017; Li et al., 2020; Reher et al., 2020). Approximately 70% of all approved drugs are NPs, their analogues, or a chemical modification of an existing NP (Newman & Cragg, 2016). In addition to these academic and societal benefits, NPR provides a powerful incentive for the conservation and sustainable use of biodiversity and biodiverse habitats (Kursar, T. A. et al., 2006). A bottleneck in this research is determining the structure of a new molecule. Molecules are analyzed by extracting the NMR spectrum of a molecule. However, it takes a skilled researcher approximately two weeks to then infer the structure from the spectrum. Our goal is to learn a mapping from the NMR spectra of natural products (sometimes called the “fingerprint” of a molecule) to their structure. We have been developing advanced techniques using deep learning to do this.
We are developing improvements over our previous methods to more specifically produce the structure of a molecule in terms of SMILES strings. The student would learn how to train deep networks for this task.

Advisor. Fatemeh Asgarinejad
Title. Predicting Influenza Infections through Enhanced SEIR Modeling:
Anticipating the Impact of Vaccination in the United States

The control of influenza outbreaks is an essential public health concern in the United States, and annually thousands of people die from it. In this study, we aim to predict the number of infected individuals by incorporating vaccination rates into the traditional SEIR (Susceptible-ExposedInfectious-Removed) model. Leveraging time-series data on the number of vaccinated cases for influenza in various regions in the United States of America, we aim to forecast future vaccination rates. Subsequently, these predicted rates are integrated into our modified SEIR model to project the number of infected cases. Finally, we aim to analyze infection and death rate based on different vaccination rates. Furthermore, we extend our model to a regional model for San Diego County.

Advisor. Ndapa Nakashole
Title. Few Shot Text Classification of Clinical Text
A challenge that arises when automatically analyzing natural language in clinical text is that clinicians are free to use their choice of words to describe patient conditions, medications, and other items in the reports. If we analyze the data in its raw form, the results can be misleading due to false positives and false negatives arising from inconsistencies in the data. Thus, lack of uniformity necessitates data normalization so that across different patient reports, even in the face of polysemy, abbreviations, spelling errors, or other variations, the same concepts are mapped to the same name. This project will study entity detection and normalization in biomedical text data in order to link mentions of entities such as symptoms, and medications to their formal names in a biomedical ontology, Unified Medical Language System (UMLS).
Given the problem of limited labeled data in the medical domain, which arises naturally due to rare events, we will therefore develop specialized algorithms for few-shot and zero-shot classification, wherein the goal is to perform classification with zero or only few examples per class.
In this project the student will learn about and extend deep learning methods for for few shot classification. Implementation will be in PyTorch.

Advisor. Jorge Cortes, Department MAE
Title. Multi-agent Robotics

Our lab utilizes Turtlebot 4 ground robots and Crazyflie 2.0 drones in environments and objectives requiring the use of multiple robots. This project focuses on the multi-agent implementation of one of the following themes: machine learning, path planning, mapping, or obstacle avoidance-based algorithms. The projects require knowledge about robot dynamics, control, and localization. More specifically, the available algorithms are related to cooperative multi-agent reinforcement learning, distributed estimation, and visibility graphs.

Student Responsibilities:
Learn about unicycle and UAV dynamics, controls, and localization utilizing Python and ROS2.
Implement algorithms such as path planning, reinforcement learning, dynamic obstacle avoidance, Simultaneous Localization and Mapping (SLAM).

Experience with object-oriented programming, data structures, and algorithms is required.
Prior experience with Python, Robotics, Control, and Robot Operating System (ROS, however, ROS2 would be better) are preferred.

Advisor. Akbar Rafiey, EnCORE Postdoctoral Fellow
Title. Privacy and fairness in decentralized computation
As AI and ML technologies become more widespread, it’s crucial to ensure a balance between computational power and ethical standards. It is often impractical and raises privacy concerns to collect data at a central server and hence decentralized and distributed approaches such as federated learning are at high demand. However, decentralized approaches face their own challenges regarding fairness and privacy. In this project we aim to develop efficient decentralized mechanisms that are applicable to various practical settings such as recommender systems, feature selection for sensitive data, and resource allocation. Our primary focus will be on the privacy and fairness of such mechanisms.
In this project the student will learn the key concepts in privacy preserving and fair computation and will be involved in hands-on algorithm development and performance analysis.

Advisor. Neophytos Charalambides, EnCORE Postdoctoral Fellow
Title. Security in Distributed Computation
Since the time of Julius Caesar, securing communication has played an important role in delivering messages. After world war 2, major advancements took place. More recently; with the advent of massive datasets, there has been an interest in securely carrying out computations
without revealing the information of users, in order to distributively solve optimization and machine learning problems. Federated learning is a collaborative iterative technique, in which each user uses its own dataset, without revealing it. In this project, we will consider a traditional approach from differential privacy of adding noise to measurements; in order to hide the information, while also allowing redundancy; in order to
recover exact or approximate gradients in a distributive fashion, for a gradient descent procedure. While we will work with synthetic and real-valued data, part of the difficulty is determining what noise is appropriate in order to not reveal information about the data itself; through the gradients, which means the noise should be data dependent. The student
collaborator will be asked to help determine the details of the technique and implement it in Python, Matlab or R, in order to study the statistical guarantees and quantitatively survey the findings.
Student Responsibilities: Carry out a basic literature review on the topics, and help explicitly define the privacy/communication protocol. Once this has been determined, an implementation will take place on different data-sets, in order to quantitatively study the privacy of the protocol;
from a statistical perspective. The implementation will not take place on a distributed system, but we will emulate such a system on a local server.

Advisor. Yusu Wang (Halicioglu data science institute)
Title. Exploring the Power of Graph Neural Networks in Geometric: A TILOS Research Initiative
Graphs type of data are ubiquitous in various scientific and engineering domains, e.g., social networks, bond graph representation of biochemical molecules, road networks, knowledge graphs, netlists from chip design, and so on. There are many tasks that require us either to make decision on a collection of graphs (e.g., property prediction of drugs) or perform optimization on an input graph (e.g., computing maximum independent set of an input graph). Recently there have been tremendous advancement in developing various graph neural networks (e.g., message passing based, or transformer based) to perform learning and optimization on graphs. In this project, we will explore the effectiveness of these graph learning models over some geometric optimization problems. This work is part of the broader research agenda at TILOS (https://www.tilos.ai/), which is an NSF National AI Institute with the goal of pioneering learning-enabled optimizations that transform chip design, robotics, networks and other use domains.

Students responsibilities: Software development in Python and at least one popular deep learning framework such as PyTorch.

Title. Accelerating Bioinformatics Workflows: Next-Generation Architectures and Algorithms for Advanced Data Analysis in Biology
Advisor. Tajana Rosing Niema Moshiri
The world of Biology is constantly evolving, and with recent advances in sequencing technologies and mass spectrometry, data production has sped up considerably, resulting in dataset sizes that are orders of magnitude larger than ever before. This has created a need for faster tools in order to analyze these massive datasets. In this project, we will explore designing and implementing next-generation architectures and algorithms in order to accelerate Bioinformatics workflows used to analyze data produced by these evolving technologies.

Student Responsibilities: Software development in. Python, bash scripting, experimental design, data collection and analysis. Experience with Raspberry Pi or Arduino-like platforms is a plus. The student will gain experience with neural networks, basic optimization, and deep learning frameworks such as PyTorch.