E-Locus - Institutional Repository of the University of Crete - Trial & error : data transformation and provenance techniques in scientific workflows

Home Trial & error : data transformation and provenance techniques in scientific workflows

Results - Details

[Add to Basket]

Identifier

000334421

Title

Trial & error : data transformation and provenance techniques in scientific workflows

Alternative Title

Trial & Error: Τεχνικές Μετατροπής και Προέλευσης Δεδομένων σε Επιστημονικά Συστήματα Ροής

Author

Τσισπαράς, Βασίλης Κ

Thesis advisor

Χριστοφίδης, Βασίλης

Abstract

Information integration of heterogeneous, complementary data sources under a global Schema is a scientific area with many perspectives and many open issues. The integration of such complementary information is innovative and of immense potential value for the cultural heritage domain, e-science and other applications. The rate of creating Databases from museums, libraries, archives (MLA) and Digital Libraries has increased in recent years. A lot of information exists in the form of simple semi-structured text and manual Data Transformation from analogical to digital material or form and Data Cleaning is laborious and time consuming.
In many cases the data sources to be integrated are manually formatted text files such as dictionaries, so called "corpora and encyclopedic material" with very complex encoding rules and many exceptions, which need to be transformed into a Database-compatible format conforming to a global Schema, so that all information can be accessed and queried in a uniform way. Most of the past and many recent systems employ one-step (single-step) data transformation procedures. The single step transformations are mainly idiosyncratic, i.e. different for each source, and require the implementation of very specific tools. In order to create such a transformation software repeated testing is required, but the specific software may be used only once.
In this Thesis we created an application called Trial & Error which supports a multiple step data transformation technique and thereby enables the wider use of generic components. We have empirically found from a set of examples that the data transformation process can be broken down into many small steps which can be of more generic nature. The tools we used in these steps were designed as elementary as possible to increase the chance of reuse. The smaller the steps are, the more generic they can be. With the use of generic components and the semi-automatic execution we succeeded to reduce the execution time, the human intervention and improved the error handling technique. Trial & Error uses an existing Workflow Management System (WFMS) in order to associate every data transformation step with a workflow task. We extended the functionality of the WFMS by embedding some programming code parts in it in order to support the control flow. For a particular transformation procedure we select existing software applications or create small software components, suitable to our requirements, and integrate them into the WFMS as tasks. Our application both supports workflow instance creation and execution. It also supports storing and querying the Provenance information for each workflow instance which is very important in this domain. We demonstrate our application by converting from archaeological corpora written in Microsoft Word format into RDF CIDOC CRM compatible format.
This Thesis presents a novel application to Data Transformation and Cleaning and proposes a solution for all the Science domains which need to convert their data laying in books and corpora to digital form.

Physical description

viii, 118 σ. : εικ. ; 30 cm.

Language

English

Issue date

2008-07-22

Collection

School/Department--School of Sciences and Engineering--Department of Computer Science--Post-graduate theses

Type of Work--Post-graduate theses

Views

456

Digital Documents
	Download document View document Views : 4