Assistant Professor, School of Information | Member of the Graduate Faculty
Using Machine Learning to Disambiguate Software Entity Mentions in PMC We have a disambiguation problem for software mentions in scientific literature. There is no "gold standard dataset" so we have to manually annotate a training set to evaluate our model. Data: We are using the Chan Zuckerburg software citation mentions. Students' roles: We need students to create a corpus of annotated software mentions to assist with disambiguating the software entities. We will then work with a CS professor to train a disambiguation model for the software entities. Learning outcomes: data science applications, research methods, time management Data Analysis: Once disambiguated, the software entities can be analyze. We can ask: Where in a paper does Software get Cited in the Scientific literature?
Project Title: Mapping the Global "Supply Chain" of Genomics Research Datasets Datasets are produced by scientists all over the world, who deposit data to international research data repositories. The datasets are then (re)used to pursue scientific knowledge, develop scientific innovations, and shape the allocation of resources to support further discovery. However: the infrastructures for producing and sharing data are skewed toward western nations who possess the resources and historical cyberinfrastructure (e.g. computing systems, data repositories) to produce and share research data. Drawing from a novel source of data, the 'big metadata' from GenBank, this project describes the volume of data produced by geographic regions and analyzes the international distribution of labor on datasets in biomedical research and genomics. We also take a critical perspective in this project to focus on power and influence of gatekeepers and industry actors on shaping the production and diffusion of knowledge in data-intensive science. Data Collection: Scrape data from BioSamples. Find out if papers on BioSample and BioProject are already linked.If not: Scrape BioSamples and BioProject data (using Selenium?) for as many GenBank records as possible. Store BioSamples and BioProject Data on server; link to GenBank records if possible. Data Analysis: Calculate to what extent is there alignment between the genetic sample country of origin and the authors of the associated publication? Analyze the location of the sample from 1992-2021 have the locations changed? RQs (based on student interest): 1. Who are the industry vs academia vs etc collaborators on these? 2. Genomics has become a commercialized affair. The supply chain includes industry as well as academics. Who funds the projects with high misalignment? Use NIH ExPoRter data. Who is the funder of biosamples? 3. Who is the PI on the grant of the funding? (See NIH ExPorteR data) Data: GenBank, NCBI Taxonomy, S&T capacity index Skills/tools preferred: python, R, linux, statistics, data visualization Skills you will learn: Social network analysis, visualization in R