DC 2013 SAM Science and Metadata CAMP 4 DATA AGENDA

Cyber-infrastructure and Metadata Protocols: CAMP-4-Data Workshop (view the Proceedings)

Dublin Core/iPRES 2013, 6 September 2013 @ DC-2013 in Lisbon, Portugal, 9:00-18:00.
SPONSORS/SUPPORT: Dublin Core-Science and Metadata (DC-SAM) Community; the Research Data Alliance (RDA); Data Observation Network for Earth (DataONE); and the SILS-Metadata Research Center.
AGENDA/ABSTRACTS/SLIDES/BIOGRAPHICAL NOTES.

1 (REGISTRATION - 8:00-9:00)
2 INTRODUCTION
3 INFRASTRUCTURE MODELS AND FRAMEWORKS
4 USAGE AND TRACKING
5 (BREAK - 11:00-11:30)
6 PID (PERSISTENT IDENTIFIERS)
7 APPLICATIONS
8 (LUNCH - 13:00-14:30)
9 BREAKOUT GROUPS
10 (BREAK - 16:00-16:30)
11 BREAKOUT GROUPS/WORKSHOP WRAP-UP
12 WORKSHOP ADJOURNED: 18:00
13 PRESENTER BIOGRAPHICAL NOTES

(REGISTRATION - 8:00-9:00)

INTRODUCTION

9:00-9:30 - Welcome, workshop goals, logistics; participant introductions/Jane Greenberg/all (slides)
9:30-9:45 - The Metadata Zoo/Rebecca Koskela (Slides)
9:45-10:00 - DCC Scheme Directory/Alex Ball (slides)

INFRASTRUCTURE MODELS AND FRAMEWORKS

10:00-10:15 - A 3-Layer Model for Metadata/Keith Jeffery, Anne Asserson, Nikos Houssos and Brigitte Joerg (Slides)

Abstract: We present a 3-layer model for metadata of which the key component is CERIF in the middle, contextual, layer. CERIF forms the lowest, most detailed level of metadata information that is common across research objects such as datasets. Its richness of representation makes it a superset over many other metadata formats allowing their congruent generation from CERIF. CERIF is used in42 countries and is an EU Recommendation to member States.

10:15-10:30 - Cross-Domain Metadata Interoperability: Lessons Learnt in INSPIRE / Andrea Perego, Michael Lutz, Max Craglia and Silvia Dalla Costa (slides)

Abstract: Since 2007, EU Member States have been involved in creating an infrastructure for spatial information in Europe (INSPIRE), based on a legal and technical interoperability framework. This paper presents some of the lessons learnt during the implementation of this infrastructure (which started in 2009) and during work on data and service interoperability coordinated with European and international initiatives. We describe a number of critical interoperability issues affecting both scientific and government data and metadata, and propose how these problems could be effectively addressed by a closer collaboration of the government and scientific communities, by taking advantage of their complementary competencies, and by influencing the development and adoption of standards.

USAGE AND TRACKING

10:30-10:45 - Usage data for metadata properties to support open data registries and semantic wikis/Muriel Foulonneau, Sébastien Martin, Jacques Ducloy, Thierry Daunois and Slim Turki

Abstract: Metadata and ontology repositories are critical to ensure the discovery of existing vocabularies and the reuse of vocabularies and/or individual properties. However, these infrastructures should take into consideration the decision making process and criteria for the selection of a vocabulary of individual concept or property. Usage data in particular are important and can reassure on the maintenance of the vocabulary by a third party. This data is to some extent available through dedicated tools, such as semantic search engines. We illustrate the need for integrating usage data in the vocabulary infrastructures in order to support the reusability of vocabularies and therefore interoperability and data usability in science.

10:45-11:00 - Provenance Central: More Mileage from Provenance Metadata/Bertram Ludaescher and Paolo Missier

Abstract: We argue that to get the most value out of provenance it is critical to provide provenance integration and analysis capabilities. For the former, we are developing D-PROV, an extension of the W3C standard PROV that enriches the generic PROV model with important observables from scientific workflow systems and other provenance-enabled systems such as R. For the latter, we are developing PBase, a system prototype and associated language technologies to query and analyze provenance. PBase will be part of the as DataONE data preservation infrastructure for Earth Science Observation (www.dataone.org). Our envisioned Provenance Central will be able to load and analyze provenance in order to connect data through its provenance with other datasets, workflows, and ontologies, but also with papers, scientific hypotheses, protocols, and users (i.e., authors and scientists). Discovering these connections requires analytical techniques that have not yet been applied to provenance. For example, since such provenance metadata will include, amongst other properties, data attribution information, we propose a novel type of analysis, which involves mining provenance through the entire repository, to elicit implicit social connections amongst the owners of the data. In summary, Provenance Central will be a new way of making data and social connections explicit, thus increasing data (re)usability in unprecedented ways.

(BREAK - 11:00-11:30)

PID (PERSISTENT IDENTIFIERS)

11:30-11:45 - Persistent Identifiers for Terms in a Crowd-Sourced Vocabulary/John Kunze, Greg Janee, Christopher Patton (Slides)

Abstract: Unique, persistent identifiers for vocabulary term concepts are critical for metadata (DC1, SKOS2, etc). This comes as no surprise to followers of Linked Data3, for whom this first principle of the semantic web is a sine qua non for automatic reasoning with web content. It is even more important to metadata users who need a precise way to reference a particular concept when the term may have more than one definition. Such is the case for the SeaIce Metadictionary4, a crowd-sourced online dictionary of metadata terms in which multiple competing definitions are expected to be common and to co-exist indefinitely. Anyone can register and login in order to create new terms, edit their own terms, and comment and vote on others’ terms. Typical use will be that someone, without logging in, searches for and inserts terms they find into metadata that they’re creating to describe their own research. If unsatisfied with the terms that they found – or didn’t find – they can login and take action, which means anything from up- and down-voting terms, commenting on others’ terms, or adding and editing their own terms. Typical users will be research scientists trying to describe their datasets.

11:45-12:00 - Separation of Concerns: PID Information Types and Domain Metadata/Tobias Weigel and Timothy Dilauro (slides)

Abstract: We must define a pragmatic separation of concerns between metadata activities and the typed information associated with Persistent Identifiers. This distinction is important for ongoing debates within respective communities as well as in the RDA working groups. From a data archive’s viewpoint, a useful metaphor is that of the “black box” or “envelope”: Data management is increasingly done by machinery rather than human users. So the machinery must know what to do with the boxes that come in through various channels, but it cannot open them for various reasons. We propose that metadata is a concern that is – from this particular view of automated data management – located inside the black box. A metadata description may actually be a black box object that must be managed just like all the others. Still, some information must be written on the outside of the box to be interpreted by the machinery. This information may be a subset of metadata, but it may also contain additional information not interesting as domain metadata.

APPLICATIONS

12:00-12:15 - Ontology-Enabled Metadata Schema Generator: The Design Approach/Jian Qin, Xiaozhong Liu and Miao Chen (slides)

Abstract: Metadata standards are important for normalizing descriptions of publications and research data and for information discovery and use. Large, complex metadata standards, however, can complicate the creation, sharing, and maintenance of metadata and incur high costs for metadata operations, especially in the domain of scientific data. One strategy to solve the problems of large, complex metadata standards is to break them into independent modules to allow for reuse of elements and maximal possibility of automation. To implement this strategy, we need a metadata infrastructure that contains elements, vocabularies, and other metadata artifacts and that is easy to use. This short paper describes the design approach to an ontology-enabled metadata schema generator as part of the metadata infrastructure.

12:15-12:30 - Metadictionary: Advocating for a Community-driven Metadata Vocabulary Application/ Jane Greenberg, Angela Murillo, John Kunze, Sarah Callahan, Robert Guralnick, Greg Janee, Nassib Nassar, Christopher Patton, and Karthik Ram (slides)

Abstract: Metadata disorder and unnecessary costs are increasing due to the expanding population of scientific data schemes and standards. Metadata challenges are reviewed; and SeaIce, a community driven metadata vocabulary application, is introduced as a potential solution. SeaIce functions and development challenges are presented. CAMP-4-DATA participants are called upon to experiment with the SeaIce application and actively participate in a discussion targeting noted metadata challenges.

12:30-12:40 - CLEPSYDRA Data Aggregation and Enrichment Framework/Cezary Mazurek, Marcin Mielnicki, Aleksandra Nowak, Krzysztof Sielski, Maciej Stroinski, Marcin Werla and Jan Wglarz
12:40-12:50 - RUresearch - Open Source Metadata Application Profile and Research Object Handling for Research Data/Grace Agnew and Mary Beth Weber

Abstract: The Rutgers University Libraries have developed an open source workflow management system that includes a cataloging utility and a compound object handling system that enables the creation of metadata and intelligent object handling to fully support documenting and sharing research data. The cataloging system, which can be used independently and can work with any repository architecture, supports both MODS and Dublin Core metadata schemas. The MODS application profile includes an event-based subschema as a MODS extension schema, that can capture any useful event in the lifecycle of the data, from data capture, to data analysis, to data editing to data reuse. The application profile also includes elements for type of research, research methodology, type of data and type of subject, mapped to MODS and Dublin Core genre and subject elements. The data compound object supports documentation (lab notebooks, images, etc.) and instrumentation (data capture, data analysis, etc.). In addition to relating resources to each other using RDF, the resource handling also includes support for hierarchical file uploads, exactly as they are stored on the researcher’s computer or server. The the metadata and object handling will be presented through examples from the RUcore (Rutgers Community repository) research data portal, RURsearch, http://rucore.libraries.rutgers.edu/research/

12:50-13:00 - Open discussion, setting the afternoon agenda; Brief remarks about RDA-3rd Plenary/Sandra Collins

(LUNCH - 13:00-14:30)

BREAKOUT GROUPS

14:30-14:40 - Overview of discussion topics (DRAFT slides)
14:40-15:10 - Breakout groups, Session 1: Infrastructure and design, policy, human and social aspects.
15:10-15:40 - Breakout groups, Session 2 (topic rotation from session 1).
15:40-16:00 - Report back from breakout groups.

(BREAK - 16:00-16:30)

BREAKOUT GROUPS/WORKSHOP WRAP-UP

16:30-16:45 - Delegates propose/vote on 'special' topics.
16:45-17:15 - Self-selected groups discuss a topic each.
17:15-17:50 - Report back from each group; discussion of possible action points.
17:50-18:00 - Closing remarks.

WORKSHOP ADJOURNED: 18:00

PRESENTER BIOGRAPHICAL NOTES

Grace Agnew is Associate University Librarian for Digital Library Systems at Rutgers, the State University of New Jersey. She has written books, articles and presentations on metadata and is one of the designers of the RUCore MODS-based metadata implementation. She consults on metadata design and teaches a course on metadata design, Mechanics of Metadata, for Library Juice Academy.

Alex Ball works for UKOLN Informatics at the University of Bath as an Institutional Support Officer of the Digital Curation Centre. He is co-moderator of the Dublin Core Science and Metadata Community, and was involved in the creation of the DataCite Dublin Core Applicaton Profile. His interests include engineering data, Web archiving and the place of data within scholarly communications.

Dr. Sarah Callaghan is a senior scientific researcher and project manager for the British Atmospheric Data Centre, at STFC Rutherford Appleton, UK. She currently project manages the NERC Data Citation and Publication project, which is a cross NERC Data Centre project which aims to develop the mechanisms for the citation and publication of datasets. She is also a co- chair of the CODATA-ICSTI Task Group on Data Citations, a member of the DataCite Working Group on Criteria for Datacentres, and an associate editor for the scientific journal Atmospheric Science Letters, with a particular aim of developing the processes for data publication. She has experience of both creating and managing large datasets, and so understands well the frustrations that scientists can experience as a result of dealing with data!

Dr. Max Craglia has worked at the Digital Earth and Reference Data Unit , European Commission-Joint Research Centre since 2005. The Unit is responsible for the technical coordination of the INSPIRE Directive, aimed at creating and infrastructure for Spatial Information in Europe. Within the Unit Max has been responsible for the development of the INSPIRE Implementing Rules for Metadata, and for research on the impact assessment of SDIs and INSPIRE. Max was the technical coordinator of EuroGEOSS project (www.eurogeoss.eu) an Integrated Project developing INSPIRE-compliant GEOSS Operating Capacity in three thematic areas: Drought, Biodiversity/Protected Areas, and Forestry. He is currently the scientific coordinator of the GEO Weather, Ocean, Water project (GEOWOW), which extends the approaches developed in EuroGEOSS to these other 3 thematic areas. During the last 3 years, he has led projects on citizens science in the area of forest fires, and the use of civilian drones to collect environmental information. He has been a member of GEOSS Science & Technology Committee and the GEOSS Data Sharing task Force. Max was one of the founders and serves as chief editor of the International Journal of Spatial Data Infrastructures Research (http://ijsdir.jrc.ec.europa.eu) and was also one of the founders of the Vespucci Initiative for the Advancement of Geographic Information Science (www.vespucci.org).

Muriel Foulonneau is a specialist in metadata, semantic interoperability and the quality and usability of data in distributed systems. She has worked for CNRS in France on open access to scientific data and the University of Illinois on large scale data aggregations. Member of the Advisory Board of the Dublin Core Metadata Initiative, she works on data management systems, recommenders and eLearning applications at the Public Research Centre Henri Tudor in Luxembourg.

Greg Janée is a software developer for the California Digital Library and a researcher in digital libraries and digital curation for the University of California at Santa Barbara. He previously served as technical leader of the National Geospatial Digital Archive (NGDA), Alexandria Digital Library (ADL), Alexandria Digital Earth Prototype (ADEPT), and ADL Gazetteer projects. Greg has an M.S. in computer science and a B.S. in mathematics, both from the University of California at Santa Barbara. Keith Jeffery is now an independent consultant but was Director IT at STFC Rutherford Appleton Laboratory with 360,000 users, 1100 servers and 140 staff. Keith holds 3 honorary visiting professorships, is a Fellow of the Geological Society of London and the British Computer Society, is a Chartered Engineer and Chartered IT Professional and an Honorary Fellow of the Irish Computer Society. Keith is President of ERCIM and past President of euroCRIS, and serves on international expert groups, conference boards and assessment panels. He has advised government on security and green computing. He chaired the EC Expert Groups on GRIDs and on CLOUD Computing. His research passion (since the 1960s) is metadata.

Dr. Brigitte Jörg has been involved in Metadata-related activities and work for many years. She is the founder and director of JeiBee Ltd. a UK-registered company providing consultancy in this area. Through this capacity she is acting in the role of a Coordinator with CASRAI and other initiatives. Prior to the setup of JeiBee Ltd., Brigitte joined the Jisc Innovation Support Center at UKOLN, UK in the role of the National Coordinator with the CERIF Support Project. Before joining UKOLN, she worked with DFKI - the German Research Center for Artificial Intelligence (2001-2012), in Saarbrücken and Berlin, Germany where a main responsibility was the management of the so-called Virtual Information Center in the field of Language Technology. During that time she was also involved in several EU and National projects related to Research Infrastructures. Since 2005, Brigitte has been actively involved in euroCRIS (www.euroCRIS.org), being a member of the Board and steering the CERIF task group from 2004 through 2012.

John Kunze is an Associate Director for the UC Curation Center in the California Digital Library. With a background in computer science and mathematics, he wrote BSD Unix software that comes pre-installed with Linux and Apple operating systems, and has contributed heavily to the standardization of URLs, Dublin Core metadata, and web archiving. His current work focuses on dataset publication, citation, and preservation.

Bertram Ludäscher is professor at the Department of Computer Science and the Genome Center, UC Davis. His research focus includes modeling, design, and optimization of scientific workflows and databases; provenance; data integration; and knowledge representation and reasoning. He is one of the founders of the open source Kepler scientific workflow system project, and a co-lead of the DataONE Working Group on Provenance in Scientific Workflows. With members of his team at UC Davis he is currently developing automated and semi-automated curation methods for quality control in data processing pipelines. Prof. Ludäscher received his M.S. (Dipl.-Inform.) in Computer Science from the University of Karlsruhe in 1992 and his PhD (Dr.rer.nat.) from the University of Freiburg, Germany in 1998. Until 2004 he was a research scientist at the San Diego Supercomputer Center and an adjunct faculty at the CSE Department at UC San Diego.

Dr. Michael Lutz holds a MSc degree in landscape ecology and a PhD in geoinformatics from the University of Münster, Germany. Between 2002 and 2008, he worked as a doctoral researcher in the Münster Semantic Interoperability Lab (MUSIL) and as a post-doctoral researcher at JRC on the semantic modelling of geospatial data and processes to support the discovery, composition and access of geographic information and geoprocessing services in SDIs. Since 2008, Michael has been supporting the INSPIRE data specification activity at the Digital Earth and Reference Unit at the European Commission Joint Research Centre.

Marcin Mielnicki is a software engineer in Poznan Supercomputing and Networking Center since 2007. He received his M.Sc. in Computing Science from Poznan University of Technology in 2008. Currently he is the lead person responsible for the development and maintenance of the PIONIER Network Digital Libraries Federation and the CLEPSYDRA framework. Marcin's professional interests include agile software development and software testing.

Dr. Paolo Missier is a Lecturer in Information and Knowledge Management with the School of Computing Science, Newcastle University, UK. His current research interests include models and architectures for the management and exploitation of data provenance, specifically extensions of e-infrastructures for scientific provenance. Since 2010 he has been co-leading the Provenance Working Group of the NSF-funded DataONE project. Between 2011 and 2013 he has been an active member of the W3C Working Group on Provenance on the Web, where he co-edited a number of the resulting recommendation documents.

Angela Murillo is a fourth-year doctoral student at the School of Information and Library Science at the University of North Carolina at Chapel Hill. She received her bachelor’s degrees in Geosciences, English, and Spanish and her MLIS from the University of Iowa. Angela was the Project Manager for the DigCCurr (http://ils.unc.edu/digccurr/index.html) Project, a Student Fellow for Earth Science Information Partners (http://www.esipfed.org/), and a Summer Intern for DataONE (http://www.dataone.org/). Her research interests include scientific data management, scientific data sharing and reuse, metadata and interoperability.

Christopher Patton is a new masters student in computer science at the University of California, Davis. His academic interests and job experience have been broad to date, ranging from computer vision to networks. In graduate school, he'll focus on the theoretical foundations of computer networks. His current work for DataONE involves designing and implementing a crowd-sourced, online dictionary for metadata terms. Andrea Perego is a researcher at the European Commission's Joint Research Centre (JRC). His research interests include semantic interoperability, multilingual thesauri, ontology design and Semantic Web technologies for government and research data. He is a member of the JRC team in charge of the technical coordination of the INSPIRE Directive of the EU, aiming to establish a harmonised data infrastructure at the European level, to give cross-border access to information that can be used to support EU environmental policies.

Mary Beth Weber is Head of Central Technical Services at Rutgers University Libraries. She is the author of the book Describing Electronic, Digital, and Other Media Using AACR2 and RDA, and the editor of Library Resources and Technical Services, the official journal of the Association for Library Collections and Technical Services (ALCTS), a division of the American Library Association. She is a co-developer of the MODS metadata implementation for RUcore, Rutgers University's institutional repository, and leads the team tasked with further developing and maintaining RUcore metadata. Tobias Weigel has been working at the German Climate Computing Center (DKRZ) since 2010 as a software engineer and architect for various e-science infrastructure and project activities, including IS-ENES and EUDAT, with emphasis on lightweight web services and re-usable components. Current activities concern usage scenarios and tool support for PIDs across these areas, which also relates to the work done in RDA, where he is co-leading the PID Information Types Working Group. He is also a PhD student at the University of Hamburg, working on a PID-centric topic with particular focus on massive PID usage at data centers and infrastructures.

(Back to DC-SAM page: http://wiki.dublincore.org/index.php/DCMI_Science_And_Metadata)

Contents