DCO Webinar Wednesdays Summer Data Science Series

In this four-part series, members of DCO’s Data Science Team will walk through best practices for data acquisition, processing, and analytics in the geosciences, using Jupyter Notebooks to analyze the datasets used by all four of DCO’s science communities (Extreme Physics and Chemistry, Reservoirs and Fluxes, Deep Energy, and Deep Life). Synthesis Group 2019 and the DCO Engagement Team are hosting this series.

You can join the webinars live, and follow along on Twitter on the hashtag #DCOWebWed.

All webinars will begin with a 25-30-minute presentation, followed by 15 minutes for open discussion and Q&A.

All webinars are archived within 24 hours of the live presentation, so you can catch up at any time.

Contact Katie Pratt (katie_pratt@uri.edu) or Darlene Trew Crist (dtcristdco@gmail.com) with any questions about the webinar series, or if you would like to propose a future series.


9 May 2018, 2PM EDT: Data Science for Geosciences: Data Acquisition – Hao Zhong, Rensselaer Polytechnic Institute, USA

Watch the webinar archive posted above

Over the past century, an enormous amount of data has been produced, archived and published across the geoscience community. Development of experimental devices, analytical tools, as well as scientific methods have been the driving forces underneath the accelerated increase in quantity and improvement in quality of geoscience-related data. In recent decades, such exponential growth of data has uncovered new, data-intensive approaches towards research questions that were once before unsolvable in absence of enough data and even encouraged the pursuit of many new discoveries.

Data science has, therefore, become more instrumental than ever in geoscience research. While numerous high-quality and comprehensive data sources are made available for geoscientists, such as EarthChem, Mindat.org, Visualization and Analysis of Microbial Population Structures (VAMPS), PetDB, etc., many challenges still lie in areas such as dark data acquisition, integration, quality management, processing, analytics and so on. Understanding and adopting suitable data science practices throughout the data life cycle cannot be emphasized enough in order to maximize the utility of existing data and unleash the full potential of data-driven discovery in geoscience. 

This webinar will kick off our four-part webinar series on data science in the geosciences, presented by the DCO Data Science Team at Tetherless World Constellation, Rensselaer Polytechnic Institute. The whole series will cover the data life cycle in the order of data acquisition, data processing, and data analysis. For this episode we will discuss general data acquisition in geoscience, featuring a recent example of legacy data rescue and management by the Data Science Team. Demonstrations will be given in optical character recognition and spreadsheet-processing software as well as Jupyter notebooks for running R statistical language and Python.


13 June 2018, 2PM EDT: Why and How to Cite Data – Mark Parsons, Rensselaer Polytechnic Institute, USA

Watch the webinar archive posted above

Increasingly data, software, and other research artifacts are being recognized as first class scientific objects, crucial to supporting the arguments in an article as well as general transparency and reproducibility.  The DCO Data Science team has long recognized this and has assigned persistent identifiers (PIDs) to publications, people, organizations, instruments, data sets, and sample collections. This enables consistent and persistent reference to these artifacts over time. Indeed, the use of PIDs is becoming a routine part of scholarly communication. The most obvious example, of this is the use of Digital Object Identifiers (DOIs) in the citations of published literature, but increasingly data and software repositories, funding agencies, and others are making greater use of PIDs to manage information.

The Research Data Alliance is an international organization that aims to remove barriers to data sharing by inviting both data users and providers in any field to solve data sharing issues as a team. Working Groups develop RDA Recommendations and strive to develop viable solutions to well-articulated, specific data-sharing problems. After community vetting and endorsement, the Recommendations are available for adoption by others in similar communities inside or outside of RDA. Many of the Recommendations relate to the formal implementation and application of PIDs and associated information.

The DCO Data portal has adopted and is adopting several of these RDA recommendations in order to further increase the visibility, validity, and accessibility of DCO data. In particular, the Data Portal is adopting the Scalable Dynamic Data Citation Methodology (DDC) and Scholix Recommendations. DDC provides a method for persistently referencing a specific subset of dynamically changing data by using PIDs. This allows researchers and other data users to precisely link to data used in a study or in a particular provenance chain. This precise, immutable reference increases the reproducibility and validity of the resultant work. In a similar vein, Scholix provides a high-level framework for exchanging links and basic metadata between scholarly literature and data. The goal is to enable a better understanding of what data underpin the literature and how. Together, if broadly adopted, the DDC and Scholix Recommendations could significantly improve the traceability of data use and reuse, and advance goals ranging from the reproducibility of the research to credit mechanisms for data providers and curators.

This webinar will review these technologies and how they are being implemented in the DCO Data Portal. It should be of interest to repository managers, publishers, and researchers sharing data.



11 July 2018, 2PM EDT: Data Science for Geosciences: Data Processing - Fang Huang, Rensselaer Polytechnic Institute, USA

Watch the webinar archive posted above

Owing to the development of experimental and analytical equipment and methods, large amounts of data are being produced in labs all over the world. In the last a few decades, a number of high-quality and comprehensive data resources became available for geoscientists, including but not limited to EarthChem (geochemistry), mindat.org (mineralogy), Visualization and Analysis of Microbial Population Structures (VAMPS) (geobiology) and PetDB (petrology). The rapidly increasing volume and variety of geoscience-related data give researchers opportunities to answer scientific questions which are unsolvable using traditional methods. And that is where big data analytic techniques come into play.  

It is often said that 80% of data analysis time is spent cleaning and preparing the data. Moreover, data cleaning is not a one-time job – it is an ever-present need while performing data analysis. 

In this webinar, we will mainly focus on data processing. We will start by introducing rules that define a tidy dataset. Bearing these rules in mind, I will show how to use relatively simple python codes to deal with geoscience data with some visualization. The last part of the webinar will highlight an ongoing project on methane experiments. The webinar should be of interest to any researchers working on data science-related projects.



8 August 2018, 2PM EDT: Data Science for Geosciences: Analytics – Anirudh Prabhu, Rensselaer Polytechnic Institute, USA

Watch the webinar archive posted above.

The last 30 years have seen a revolution in the development and availability of big data resources in different geoscience fields, such as EarthChem, NavDat, mindat.org and Paleobiology Database. The increases in the volume, variety and velocity of data make it very important for geoscientists to know methods and techniques in data management and analytics.

Data and Information analytics extends analyses (descriptive and predictive models to obtain knowledge from data) by using insight to recommend action or to guide and communicate decision-making. Thus, analytics is not so much concerned with individual analyses or analysis steps, but with an entire methodology.

This webinar aims to use real-world “geoscience use cases” to introduce participants to relevant methods to recognize and apply quantitative algorithms, techniques and interpret results. Thus, helping participants solve scientific problems using data/model-driven decision-making. 

Further Reading

Virtual Reality Lets DCO Researchers See C in 3D
DCO Highlights Virtual Reality Lets DCO Researchers See C in 3D

New 3D, interactive visualizations developed in collaboration with virtual reality researchers at…

DCO Highlights 4D Collaboration Brings New Dimensions to Earth Sciences

The Deep Time Data-Driven Discovery group is a coalition of researchers seeking to answer questions…

‘Deep Matter and Energy’ Special Issue of Engineering Published June 2019
DCO Research "Deep Matter and Energy" Special Issue Highlights Role of Deep Volatiles

A collection of papers on "Deep Matter and Energy" highlights the role of deep volatiles in…

DCO Highlights Unraveling the ENIGMA of Protein Evolution

The Evolution of Nanomachines in Geospheres and Microbial Ancestors (ENIGMA) project has received a…

Back to top