In his role leading DCO’s Data Science Team, Peter Fox (Rensselaer Polytechnic Institute, USA) has been instrumental in helping the DCO deal with its data. As a professor of Earth and environmental science, computer science and cognitive science, chair in the Tetherless World Constellation, and a member of the DCO Executive Committee, he is ideally positioned to cultivate a culture of data sharing. His group has built an infrastructure to manage DCO datasets, collaborated on new research projects that utilize Earth science data for novel discoveries, and constructed plans to keep DCO data accessible to scientists far beyond 2019.
Fox spoke with DCO science writer Patricia Waldron about how data science creates new opportunities in Earth sciences research and the data science tools now available to DCO researchers.
First off, what is data science and how do you see its role evolving with regard to Earth science research?
Simply put, data science is doing science using other people’s data. Over the last 20 years, releasing data collected using public funds has taken off in a lot of areas of science and the Earth sciences are no exception. Data sharing and using other people’s data have been prolific in atmospheric science and ocean science for many decades, but in solid Earth sciences they haven’t. When the Deep Carbon Observatory came along, it was conceived as a worldwide community. Now, we’ve got a highly connected research network, and the obvious thing to do is to share and exchange data, build on other people’s data sets, and synthesize data sets to get new results.
When we take data science to a new field, sometimes we just advance capabilities by getting to the results faster. But then in other cases, we make completely new discoveries. Discovery science is the most exciting potential for data science.
How did the DCO data science community come about?
When DCO started, there were major valuable community data resources running on an individual’s desktop computer in the basement office at home. That’s a precarious situation to be in. We had to get a lot more serious and systematic about managing data.
We [the Tetherless World Constellation] were approached a couple of years after the Deep Carbon Observatory began. We started from scratch with this new community. There was data infrastructure work to be done and the materialization of the network itself – how people, and projects, and field sites, and data, and the papers that come out of it are all connected – so that people could actually see DCO as a network.
What kinds of research tools or resources are available to interested DCO scientists?
The most recent one deployed for the DCO is a web-based, electronic notebook environment called Jupyter. These notebooks came out of the open-source language Python, which is gaining tremendous popularity because it’s an easy enough programming language to learn, and there are a lot of packages being built for Python.
Jupyter notebooks are very much like a traditional notebook, except that it’s electronic. It sits inside a web browser, with cells. In each cell users can type in comments or a piece of code, load a statistical analysis package, or load in a graphical library. It’s easy because users don’t have to start from scratch. When someone executes a cell, it puts the output right in the notebook. It’s possible to even type in different code in different cells and Juypter notebooks will execute the code according to the programming language.
The thing that is enormously cool is that users can put their notebooks on the web and share them with collaborators. It’s a collaborative mechanism that really facilitates getting science done without having to sit in the same room at the same table while looking at the same thing. The notebooks also can be archived with all the steps required to re-run an analysis or to do some verification or validation. It’s really changing the way researchers are doing science these days.
What other resources are available?
A dedicated computing facility for higher-end computational chemistry, physics, and molecular dynamics is available for community use. This facility is accessible via the web and has Python and other packages installed, so it is ready to go.
Can you give some examples of how DCO researchers are embracing data science?
DCO researchers have made a comprehensive program to monitor the planet’s 20 most active volcanoes and are making this volcano emission data available on the web. Previously, most volcano observation data were not available to anyone.
An activity that really grew out of the DCO and took on a life of its own was related to the evolution of minerals, fossils, and proteins. There is a small group of investigators, especially grad students and postdocs, who have really made a commitment to work from a data science perspective. They looked at a vast variety of data resources and then were open to applying new algorithmic techniques. One thing that was quickly introduced was network analysis of minerals, a similar approach that is common in the analysis of social networks, traffic, and telecommunications. Minerals have a social life and they co-exist together, and show remarkable network patterns.
What kind of responses have you had from the researchers?
Back in March at our Third International Science Meeting, there was an evening workshop on data science tools. We were expecting 15 to 20 people and 95 people showed up at the door. Both early career and senior scientists made up the audience. This type of culture change can take 10 or 15 years, but with DCO, the change has been quick and adaptation of data science techniques is well underway.
How will this information live on in the future?
The aim of the next couple of years is to migrate DCO data into community repositories, or into national or international repositories. For those data, there’s already a logical home. RPI (Rensselaer Polytechnic Institute) is going to run the data portal for the foreseeable future, which includes catalogs from other databases.
Another thrust is to help DCO formalize its data legacy. People will identify the important data sets and step up to host them or organize community sessions around those data sets. But we’ve still got a couple of years to sort out how to make those data sets accessible over the long term.
DCO Researchers interested in working with the Data Science Team can contact its members, listed here.