Talk Rapid Information Processing Based on Self-Documented Primary Data

Presented by Andreas Liehr in Scientific Applications 2008 on 2008/07/27 from 15:00 to 15:45
Abstract
The bottleneck for communicating scientific primary data is the lack of a standard for simple tabular data sets. While complex binary data sets can be stored comfortably with the Hierarchical Data Format (HDF5) or the Network Common Data Format (netCDF) these formats burden too much overhead for small tabular data sets. The consequence is, that most scientists save their data in text files consisting of non-annotated bare columns of numbers. Because these data files are always written in the scientist's personal data format, which is rarely documented, the primary data is very often become lost after finishing the project. This continuously results in the recreation of primary data and thus unnecessary extra work.In order to overcome this problem, we have invented the Full Metadata Format (FMF), which is a text based format taking into account the most basic needs of the average scientist. The grammar of FMF has been formallyformaly specified with ANTLR and has been integrated into the Pyphant data analysis framework. This allows us to demonstrate the increase in research performance arising from the simple fact, that primary data is stored in a standardised way together with its meta data. The examples comprise the automatic visualization of data files with publication ready labelled diagrams, analysise of data sets with unit and error propagation, as well as automated data interpretation, which gives rise to new machine learning paradigmsparadigma for natural and engineering sciences.
tagged by
no related entity