Profile

Sarah Bratt

Assistant Professor, School of Information | Member of the Graduate Faculty

School of Information

Full Page

Overview

Research

Undergrad Research Opportunity For Credit
Using Machine Learning to Disambiguate Software Entity Mentions in PMC We have a disambiguation problem for software mentions in scientific literature. There is no "gold standard dataset" so we have to manually annotate a training set to evaluate our model. Data: We are using the Chan Zuckerburg software citation mentions. Students' roles: We need students to create a corpus of annotated software mentions to assist with disambiguating the software entities. We will then work with a CS professor to train a disambiguation model for the software entities. Learning outcomes: data science applications, research methods, time management Data Analysis: Once disambiguated, the software entities can be analyze. We can ask: Where in a paper does Software get Cited in the Scientific literature?
Undergrad Research Opportunity For Credit
Project Title: Mapping the Global "Supply Chain" of Genomics Research Datasets Datasets are produced by scientists all over the world, who deposit data to international research data repositories. The datasets are then (re)used to pursue scientific knowledge, develop scientific innovations, and shape the allocation of resources to support further discovery. However: the infrastructures for producing and sharing data are skewed toward western nations who possess the resources and historical cyberinfrastructure (e.g. computing systems, data repositories) to produce and share research data. Drawing from a novel source of data, the 'big metadata' from GenBank, this project describes the volume of data produced by geographic regions and analyzes the international distribution of labor on datasets in biomedical research and genomics. We also take a critical perspective in this project to focus on power and influence of gatekeepers and industry actors on shaping the production and diffusion of knowledge in data-intensive science. Data Collection: Scrape data from BioSamples. Find out if papers on BioSample and BioProject are already linked.If not: Scrape BioSamples and BioProject data (using Selenium?) for as many GenBank records as possible. Store BioSamples and BioProject Data on server; link to GenBank records if possible. Data Analysis: Calculate to what extent is there alignment between the genetic sample country of origin and the authors of the associated publication? Analyze the location of the sample from 1992-2021 have the locations changed? RQs (based on student interest): 1. Who are the industry vs academia vs etc collaborators on these? 2. Genomics has become a commercialized affair. The supply chain includes industry as well as academics. Who funds the projects with high misalignment? Use NIH ExPoRter data. Who is the funder of biosamples? 3. Who is the PI on the grant of the funding? (See NIH ExPorteR data) Data: GenBank, NCBI Taxonomy, S&T capacity index Skills/tools preferred: python, R, linux, statistics, data visualization Skills you will learn: Social network analysis, visualization in R

Publications(7)

Recent

Articulating Institutionalization: How U.S. Academic Faculty Organize Work to Deposit Data and the Impacts on Long-Term Research Data Sustainability

2023

institutionalization, research data

Giving Research Software Engineers a Larger Stage Through the Better Scientific Software Fellowship

2022

software engineering, fellowship, research, scientific software, career development

The structural shift and collaboration capacity in GenBank Networks: A longitudinal study

2022

network analysis, genomic research, collaboration dynamics, structural shift

Truthiness: Challenges associated with employing machine learning on neurophysiological sensor data

2016

machine learning, neurophysiological sensor data, challenges, employing, truthiness

Emergence of collaboration networks around large scale data repositories: a study of the genomics community using GenBank

2016

collaboration networks, genomics community, large scale data repositories, emergence, genbank

Measuring situational awareness aptitude using functional near-infrared spectroscopy

2015

neuroscience, cognitive psychology, neuroimaging, situational awareness, medical technology

FNIRS: A new modality for brain activity-based biometric authentication

2015

biometric authentication, brain activity, fnirs, modality, research

People

Profile

Sarah Bratt

Research Opportunities(2)

Undergrad Research Opportunity For Credit

Undergrad Research Opportunity For Credit

Publications(7)

Articulating Institutionalization: How U.S. Academic Faculty Organize Work to Deposit Data and the Impacts on Long-Term Research Data Sustainability

Giving Research Software Engineers a Larger Stage Through the Better Scientific Software Fellowship

The structural shift and collaboration capacity in GenBank Networks: A longitudinal study

Truthiness: Challenges associated with employing machine learning on neurophysiological sensor data

Emergence of collaboration networks around large scale data repositories: a study of the genomics community using GenBank

Measuring situational awareness aptitude using functional near-infrared spectroscopy

FNIRS: A new modality for brain activity-based biometric authentication