A student-faculty co-authored submission won the ‘Best Student Paper’ award at the recent 2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL). PhD in Computer Science student Joel Pepper authored the paper with Alice B. Kroeger Professor Jane Greenberg, PhD, and Computer Science Professor David E. Breen, PhD.
Titled “Automatic Metadata Generation for Fish Specimen Image Collections,” the paper focuses on using machine learning (ML) to solve challenges in assessing publicly available fish images (full abstract listed below).
“This work is part of a project sponsored by the National Sanitation Foundation (NSF) titled ‘Biology Guided Neural Networks’ (BGNN) that seeks to use ML to gain new scientific insights from large collections of digitized biological specimens available online,” Pepper explains. “We at Drexel developed a process to automatically generate important metadata properties for these specimen images. These properties include where the fish are located, which pixels comprise the fish, how many fish are in an image, how fish are oriented, quality of the image contrast, and specimen length in centimeters.”
Additional co-authors include Yasin Bakiş, PhD and Henry Bart, PhD from Tulane University, and Xiaojun Wang, PhD from Western Michigan University.
JCDL is an international forum where global participants from a full range of disciplines and professions focus on digital libraries and associated technical, practical and social issues. CCI Assistant Professor of Information Science Mat Kelly, PhD, served as virtual meeting support committee chair at this conference.
Pepper is also a recipient of the 2021 National Science Foundation Graduate Research Fellowship (GRFP), which recognizes and supports outstanding graduate students in NSF-supported science, technology, engineering, and mathematics disciplines who are pursuing research-based master’s and doctoral degrees at accredited United States institutions. GRFP is the oldest graduate fellowship of its kind and has a long history of selecting recipients who achieve high levels of success in their future academic and professional careers. According to NSF's website, the reputation of the GRFP follows recipients and often helps them become life-long leaders that contribute significantly to both scientific innovation and teaching. Past fellows include numerous Nobel Prize winners, former U.S. Secretary of Energy, Steven Chu, Google founder, Sergey Brin and Freakonomics co-author, Steven Levitt.
Full Abstract:
Metadata are key descriptors of research data, particularly for researchers seeking to apply machine learning (ML) to the vast collections of digitized specimens. Unfortunately, the available metadata is often sparse and, at times, erroneous. Additionally, it is prohibitively expensive to address these limitations through traditional, manual means. This paper reports on research that applies machine-driven approaches to analyzing digitized fish images and extracting various important features from them. The digitized fish specimens are being analyzed as part of the Biology Guided Neural Networks (BGNN) initiative, which is developing a novel class of artificial neural networks using phylogenies and anatomy ontologies. Automatically generated metadata is crucial for identifying the high-quality images needed for the neural network’s predictive analytics. Methods that combine ML and image informatics techniques allow us to rapidly enrich the existing metadata associated with the 7,244 images from the Illinois Natural History Survey (INHS) used in our study. Results show we can accurately generate many key metadata properties relevant to the BGNN project, as well as general image quality metrics (e.g. brightness and contrast). Results also show that we can accurately generate bounding boxes and segmentation masks for fish, which are needed for subsequent machine learning analyses. The automatic process outperforms humans in terms of time and accuracy, and provides a novel solution for leveraging digitized specimens in ML. This research demonstrates the ability of computational methods to enhance the digital library services associated with the tens of thousands of digitized specimens stored in open-access repositories world-wide.