The world is creating 2.5 quintillion bytes of data every day, according to tech giant IBM, with little sign of slowing down. Moreover, unstructured data accounts for nearly 80-percent of all information, meaning the majority of data in our information universe is not easily searchable or discernible in terms of its value.
There is a large need, both in government and industry, to develop sophisticated data management methods to better organize and secure the world’s information. That is why the College of Computing & Informatics’ (CCI) new Metadata Research Center (<MRC>) is dedicated to advancing the analysis and understanding of metadata, or “data about data,” semantics and ontologies in our information sources.
“There is some truth in saying: your data, or information, is only as good as your metadata,” said Center director, Jane Greenberg, PhD
. “You may have the most interesting, juicy, rich and important information, but if the metadata supporting discovery and access is not good, no one, or no machine, may find it.”
What we can learn from metadata is more than meets the eye. Metadata applies to nearly every discipline that relies on accurate and secure data everyday including healthcare, finance and cybersecurity.
“Metadata supports a range of functions such as discovery, access, use and provenance tracking,” Greenberg said. “It can tell us who used information and when, give us insight into the information’s lifecycle, and help with authenticating information.”
Data curators are essential to generating and maintaining good metadata, as they are skilled in discerning if the data is usable or corrupt, or if it violates any laws (such as HIPAA privacy rules).
The <MRC> was established in 2006 at the University of North Carolina at Chapel Hill School of Information and Library Science, and moved to CCI this fall when Greenberg joined Drexel’s faculty as Alice B. Kroeger Professor.
The Center's staff are primarily researching, investigating, or studying the lifecycle of data in analyzing metadata capital (the data’s value), digital data (think binary code), data-at-risk (usually from non-digital or near-obsolete information sources), and knowledge organization.
One of the Center’s key research initiatives, Dryad
, is an open-source, curated data repository that makes the data underlying scientific publications discoverable, freely reusable and citable. The NSF-funded repository served as a “test-bed” for another <MRC> project known as Helping Interdisciplinary Vocabulary Engineering (HIVE), which automatically generates metadata for each citation; this function is especially helpful to content creators and information professionals who are tasked with developing or interpreting complex vocabularies in cataloging publications. HIVE is now integrated into a policy-based data management infrastructure known as iRODs (Integrated Rule-Oriented Data System) in partnership with the DataNet Federation Consortium
and Renaissance Computing Institute’s
(RENCI) iRODS Consortium
, and in NSF’s the Long Term Ecological Research Network
(LTER) where it is regularly used for indexing research data.
Dryad is also helping other scholarly search engines, such as Google Scholar, to have more accurate search results. By sharing its data through the OAI-PMH
(Open Archives Initiative-Protocol for Metadata Harvesting), Dryad and other digital projects are exposing metadata in a standard way, Greenberg explains, where Google Scholar can then “harvest” (or collect) and use the metadata to help improve its search results.
As a non-profit organization, Dryad is moving to a sustainable model where organizations (universities, libraries, publishers, etc.) may purchase a membership
to upload their citation data as part of the knowledge-sharing network. While Dryad has integrated data submissions for a growing list of scholarly journals, submission of data from other publications are also welcome. Researchers use and submit data on Dryad for free, where they can also reap the benefits of others discovering their research and/or receive credit for reuse of their data.
Through her position as 2014 Data Fellow at the National Consortium for Data Science (of which Drexel University is a member
), Greenberg is leading another <MRC> project to target the fundamental aspects of metadata capital, or the cost and value of data over time, through the Metadata Capital Initiative (MetaDataCAPT’L)
“We can think about reuse of good quality metadata as building capital,” Greenberg said. “Metadata generation has a cost, and if that metadata is used over and over again, we might be able to say the data’s ROI increases.”
Other <MRC> research initiatives include the DCMI Science & Metadata Community
, the RDA-Metadata Standards Directory Working Group
and the CODATA/Data-at-Risk Working Group
The Center is supported by a core staff of seven people, including: CCI professional staff members Erin Clary
(Dryad curator), Adrian Ogletree
(research project manager), Isaac Simmons
(research engineer), and CCI students Jasmine Clark
(MS in library and information science candidate), Edward Krause
(MS in health informatics candidate) and Yue Zhang
(doctoral student and research assistant).
CCI faculty affiliated with <MRC> include: Professor Chaomei Chen, PhD
; Professor Xia Lin, PhD
; Assistant Professor Julia Stoyanovich, PhD
; and Associate Professor Jung-ran Park, PhD
. CCI Assistant Professor Lori Richards, PhD
contributed to the Dryad project during the development of its sustainable model.
The Center will be hosting a “Metadata in the Metropolis” event in 2015 (to be announced on CCI’s website) to celebrate metadata research at CCI and in the Greater Philadelphia area.
With the aim of becoming a major hub for metadata research, the Center is interested in pursuing collaboration opportunities with faculty across Drexel, industry and other area institutions. Beyond its innovative research initiatives, <MRC> will continue to celebrate the often under-appreciated field of metadata research in its new home at Drexel.
“Often, metadata is an afterthought, or [elicits the response] ‘tell me which metadata standard to use,’” Greenberg said. “Everyone knows you need metadata, but no one wants to use anyone else’s metadata, so there is a proliferation of standards. To me, one of the most interesting and vital areas of research is finding the most efficient and effective means of metadata generation integrating automatic and human approaches.”
For more information about <MRC> and its research, or for information on collaboration opportunities, please visit www.drexel.edu/cci/mrc.