Data come in all shapes and forms. Tables, for example, contain data but often lack context, or mix data with metadata. Such context might include the meanings of quantity names and units, acronyms or community jargon, or the inter-relatedness of data columns or rows. The DCO Data Science Team works on ways to make disparate datasets accessible to scientists in diverse scientific disciplines by focusing on scientific context.
Traditional data types typically are limited to the numerical type of the datum, such as integer, float, array, char (character) or string. For example, a researcher might receive a table of numbers from a colleague, the title of which includes the word "Thermodynamics"; data that are relevant to their research. Beyond this, the table's data may be represented only by column headers, typically acronyms, one hopes well-known in the particular scientific domain. Moreover, any description of the relationships between the table's columns is not obvious, let alone explicit. However, many researchers think of categories of data, i.e. higher level ways of describing data they generate. For example: Volcanic Gas Composition. To enable such science context in computer-enabled data environments, the notion of "data type" must be extended so that a given data type represents a scientifically useful description of what the datasets associated with the category name actually represents. The ability to specify data types in this fashion enables researchers to better understand the meaning and ultimately usefulness or relevance, of datasets in a given scientific context.
The large number of DCO datasets currently registered by many scientists in DCO’s four Communities (Deep Life, Reservoirs and Fluxes, Deep Energy, and Extreme Physics and Chemistry) include a wide variety of formats and quantities with associated metadata about basic data types spanning Earth sciences, biological sciences, and beyond. The metadata (collected when datasets are registered with DCO) enables researchers to find and access DCO datasets via the Dataset Browser. A scientist might have a very specific request in mind though, such as “I need thermodynamic data from the DCO Extreme Physics and Chemistry Community that includes Mineral Name and Molecular Weight.” Without proper metadata annotations, addressing this specific, but likely common, request is difficult, not only among DCO scientists but across scientific domains. Until recently, no widely agreed upon approach had been taken to address requests with scientific context, and researchers had to resort to other means of finding/assessing datasets.
Such broad issues in data management are the focus of initiatives such as the Research Data Alliance (RDA). RDA is an international effort whose mission is to “... build the social and technical bridges that enable open sharing of data across technologies, disciplines, and countries”. One of the bridge-building efforts to improve data sharing and data use in science communities involves developing “scientific data types”. In 2015, the DCO Data Science Team from the Tetherless World Constellation at Rensselaer Polytechnic Institute leveraged funding from the National Science Foundation, via the RDA to make the DCO Data Portal one of the first platforms adopting two key RDA recommendations that greatly improved the modeling of scientific data types.
The RDA deliverables adopted by DCO are the Data Type Registry (DTR) and Persistent Identifier Information Types (PIT). The first addresses a core interoperability problem among data management systems: the ability to parse, understand, and potentially reuse data retrieved from others. The second addresses the essential types of information associated with persistent identifiers. The curation and reuse of registered datasets within the DCO Data Portal was well suited for testing deployments of RDA DTR and PIT because it helps address the challenges described in the above example of searching for thermodynamic datasets of interest, and will provide valuable experience for other science communities who face the same issue.
In its implementation of RDA DTR and PIT, the DCO Data Science Team first made updates to the DCO Ontology, the backbone of the data portal, to incorporate concepts of data type and associated attributes. They collected data type instances from the DCO community and used them to annotate some initial datasets currently registered with the DCO Data Portal. Results from this work are evident in the faceted DCO dataset browser and data type browser (screen snapshots of a typical result).
The above example, a researcher looking for thermodynamic data that includes Mineral Name and Molecular Weight, can first look for a corresponding data type using the data type browser. In that browser he or she can search Mineral Name and Molecular Weight in the facet window for Parameters and through which they can locate a data type, such as Thermodynamics of Chemicals and Minerals. Once the researcher finds that information, he or she can go to the dataset browser and retrieve all relevant datasets using that known data type. The researcher can also use the DCO Communities facet to restrict results to a subset, i.e. those generated by the DCO Extreme Physics and Chemistry Community.
Using and expanding the registered (science context) data types, the DCO Data Science Team foresees future innovation, such as recommending datasets to a user based on the their research interests and recommending tools for data analysis for specific data types. Such efforts will significantly facilitate work on data curation and promote the sharing and usability of deposited data. We invite the DCO science communities to add or suggest new data types for their datasets, and let the Data Science Team know of any new functionality in which they may be interested.
Register your data with DCO! Within the DCO Data Portal, the DCO Data Science Team manages a digital object identification, registration, and catalog service enabling anyone to explore registered data. The Data Portal data registration process (login required) is composed of two key elements: 1) assignment of an identifier for the dataset being registered, known as DCO-ID and based on the Global Handle System and 2) metadata collection for each registered DCO object, including their science context “data type.”
Please contact Patrick West to obtain more information or get involved in this activity.