DCO Project Summary

Printer-friendly version
Project Title
DCO-DS Boundary Activity: Themodynamic Data Legacy Rescue
Start DateEnd Date
NameRoleInstitutionDCO ID
Related GrantsDCO ID
A huge amount of legacy datasets are contained in published literature. It is a valuable work to extract, organize and reuse those datasets. This work focuses on the tables in scanned PDF documents, which are often seen in literature published before the 1990s. To explore methods and techniques for data rescue and management, the DCO-DS team and the DCO-EPC member Prof. Mark Ghiorso organized a boundary activity focusing on themodynamic dataset, or more specifically, the enthalpy and entropy of chemicals.
Project UpdatesClick to add Project Update

Reporting Year 2015 Click to expand

  • [2015-05-20] Themodynamic Data Legacy Rescue - submitted on May 20, 2015

    Update Details:

    1. Rescued Datasets:
    A number of rescued datasets have been registered on the DCO data portal. They are accessible through the DCO dataset browser under the EPC community: http://deepcarbon.net/dco_datasets

    2. Data Extraction Workflow:
    A data extraction workflow was established based on current facilities available, and was used to rescue datasets.

    (1) Archive the pdf documents in a web server.
    (2) Log in the virtual machine remotely - with user name ‘dco_user’. Password can be required via Patrick westp@rpi.edu.
    (3) Download the pdf documents in from the web server to the virtual machine.
    (4) Launch the PDF2XL tool on the virtual machine - Read this document to learn features of PDF2XL
    (5) Load a PDF document into PDF2XL. Draw a bounding box for each table in a page and the tool will recognize the structure of the table (i.e. columns and rows) , the record in each cell and the column headers. The result is shown in a spreadsheet on the same user interface.
    (6) Check the result quality by moving highlight in the cells of the spreadsheet. Comment: It is a pity that we cannot edit in the spreadsheet directly.
    (7) Output the result into an Excel table and then edit the errors recognized in Step 2.
    (8) Locate the PDF record --- in this case study all PDFs are journal papers --- on the publisher's website, and download the citation information and the DOI. Paste the citation information and DOI into the Excel table generated in Step 3.
    (9) Register metadata of the original paper and the rescued dataset on a data portal, such as the DCO data portal.

    3. Further Work:
    Dr. Mark Ghiorso suggested a workflow and esp. the metadata that a geoscientist would like to capture.
    (1) select a reference
    (2) locate that reference as an electronic document in the university library
    (3) download the reference
    (4) metadata: identify and record the chemical composition of the material being studied in the paper; this could be a mineral name found in the title of the paper or in the text, or it could be a chemical composition recorded in a table; identify the source of this material: is it natural or synthetic?
    (5) metadata: identify the structure of the material. Is this a solid, liquid or gas. If the material is a solid, what is the crystal system or space group that characterizes the structure.
    (6) metadata: identify how the experiments were done? Are these experiments (1) high-temperature drop calorimetry, (2) low-temperature adiabatic calorimetry, or (3) differential scanning calorimetry, (4) something else was done
    (7) metadata: identify where the experiments were done: name of lab, where is the lab located?
    (8) metadata: identify the experimental device that was used and who manufactured the device
    (9) metadata: identify the standards that were used, if any, to calibrate the measurement device
    (10) metadata: identify the capsule material if any that contained the experimental material; was the capsule sealed or open?
    (11) metadata: identify the temperature-time history of the experiment, if that information is reported
    (12) metadata: identify the measurement units for temperature, and heat content or heat capacity.
    (13) metadata: identify the mass of material used in the experiment
    (13) metadata: identify whether the experiment was performed in air or in contact with some other gas, and so, what gas
    (14) metadata: identify how the experimental results are presented: tabular form or as graphs
    (15) data: tabulate experimental results for each data type (as in #6) as a function of measured temperature; include the precision of each measurement, if reported; it is possible that the precision could be stated in general terms somewhere in the text and not in the data tables.
    (16) store the original PDF document in the data repository for future reference and clarification of metadata

    For more automated document processing approaches such as NLP. Prof. Gihorso also offered a few suggestions on topics that a geoscientist are interested:
    (1) How many papers report measurements using the technique of low-temperature adiabatic calorimetry (AC)? How many high-temperature drop calorimetry DC)? How many differential scanning calorimetry (DSC)?
    (2) What substances have been investigated over what temperature ranges? are there patterns to this distribution?
    (3) For each substance, are there measurements made using AC and DC/DSC? These measurements could be reported in separate papers
    (4) What laboratories perform the majority of experiments; make a list of labs in descending order
    (5) When were the experiments done (i.e. date of publication)? Code these by type of experiment (AC, DC, DSC).
    (6) Where were the experiments done (i.e. location of lab)? Code these by type of experiment (AC, DC, DSC).
    (7) What substances investigated have experimental results that report heat content or heat capacity anomalies associated with: (a) structural phase transitions, (2) magnetic phase transitions, (3) electronic phase transitions, (4) cation-ordering effects in minerals, or (5) change in phase state (i.e. solid to liquid, liquid to gas)?
    (7) Are there substances that have been investigated by more than one laboratory?
    (8) As a function of time (i.e. date of publication), what units are used to report results of measurements?
Related ProjectsProject URIDCO ID
http://info.deepcarbon.net/individual/n3704 11121/2553-4209-7782-8281-CC
http://info.deepcarbon.net/individual/n2020 11121/3790-2019-8122-2610-CC
Related DatasetsDCO ID
Related PublicationsDCO ID

NOTE: Instructions for editing/updating DCO Project information can be found here.
Click on the project DCO-ID to review and edit project information.