Data Warehouses and OLAP

Publications on Data Warehousing and OLAP

A Generic and Customizable Framework for the Design of ETL Scenarios

P. Vassiliadis, A. Simitsis, M. Georgantas, P. Terrovitis, and S. Skiadopoulos
In Information Systems, 30(7):492--525, 2005.

Abstract:
Extraction-Transformation-Loading (ETL) tools are pieces of software responsible for the extraction of data from several sources, their cleansing, customization and insertion into a data warehouse. In this paper, we delve into the logical design of ETL scenarios and provide a generic and customizable framework in order to support the DW designer in his task. First, we present a metamodel particularly customized for the definition of ETL activities. We follow a workflow-like approach, where the output of a certain activity can either be stored persistently or passed to a subsequent activity. Also, we employ a declarative database programming language, LDL, to define the semantics of each activity. The metamodel is generic enough to capture any possible ETL activity. Nevertheless, in the pursuit of higher reusability and flexibility, we specialize the set of our generic metamodel constructs with a palette of frequently-used ETL activities, which we call templates. Moreover, in order to achieve a uniform extensibility mechanism for this library of built-ins, we have to deal with specific language issues. Therefore, we also discuss the mechanics of template instantiation to concrete activities. The design concepts that we introduce have been implemented in a tool, ARKTOS II, which is also presented.

Note:This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.

Year: 
Research area: 

Advanced Visualization for Mobile OLAP

A. Maniatis, P. Vassiliadis, S. Skiadopoulos, G. Mavrogonatos, and I. Michalarias
International Journal of Data Warehousing and Mining, 1(1):1--36, 2005.

Abstract:
Data visualization is one of the major issues of database research and OLAP, being a decision support technology, is clearly in the center of this effort. Still, so far, visualization has not been incorporated in the abstraction levels of DBMS architecture (conceptual, logical, physical), neither has it been formally treated in this context. In this paper we start by reconsidering the separation of the aforementioned abstraction levels to take visualization into consideration. Then, we present the Cube Presentation Model (CPM), a novel presentational model for OLAP screens. The proposal lies on the fundamental idea of separating the logical part of a data cube computation, from the presentational part of the client tool. Then, CPM can be naturally mapped on the Table Lens, which is an advanced visualization technique from the Human-Computer Interaction area, particularly tailored for cross-tab reports. Based on the particularities of Table Lens, we propose automated proactive support to the user for the interaction with an OLAP screen. Finally, we discuss implementation and usage issues in the context of an academic prototype system (CubeView) that we have implemented.

Note: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.

Year: 

Advanced Visualization for OLAP

A. Maniatis, P. Vassiliadis, S. Skiadopoulos, and Y. Vassiliou
In Proceedings of the ACM 6th Int'l Workshop on Data Warehousing and OLAP (DOLAP'03), pages 9--16. ACM Press, November 2003.

Abstract:
Data visualization is one of the big issues of database research. OLAP as a decision support technology is highly related to the developments of data visualization area. In this paper we demonstrate how the Cube Presentation Model (CPM), a novel presentational model for OLAP screens, can be naturally mapped on the Table Lens, which is an advanced visualization technique from the Human-Computer Interaction area, particularly tailored for cross-tab reports. We consider how the user interacts with an OLAP screen and based on the particularities of Table Lens, we propose an automated proactive users support. Finally, we discuss the necessity and the applicability of advanced visualization techniques in the presence of recent technological developments.

Note: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.

Year: 

Arktos: A Tool For Data Cleaning and Transformation in Data Warehouse Environments

P. Vassiliadis, Z. Vagena, N. Skiadopoulos, S. Karayannidis, and T. Sellis
IEEE Data Engineering Bulletin, 23(4):42--47, 2000.

Abstract:
Extraction-Transformation-Loading (ETL) and Data Cleaning tools are pieces of software responsible for the extraction of data from several sources, their cleaning, customization and insertion into a data warehouse. To deal with the complexity and efficiency of the transformation and cleaning tasks we have developed a tool, namely ARKTOS, capable of modeling and executing practical scenarios, by providing explicit primitives for the capturing of common tasks. ARKTOS provides three ways to describe such a scenario, including a graphical point-and-click front end and two declarative languages: XADL (an XML variant), which is more verbose and easy to read and SADL (an SQL-like language) which has a quite compact syntax and is, thus, easier for authoring.

Note: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.

Year: 
Research area: 

Arktos: Towards the Modeling, Design, Control and Execution of ETL Processes

P. Vassiliadis, Z. Vagena, N. Skiadopoulos, S. Karayannidis, and T. Sellis
In Information Systems, 26(8):537--561, 2001.

Abstract:
Extraction-Transformation-Loading (ETL) tools are pieces of software responsible for the extraction of data from several sources, their cleansing, customization and insertion into a data warehouse. Literature and personal experience have guided us to conclude that the problems concerning the ETL tools are primarily problems of complexity, usability and price. To deal with these problems we provide a uniform metamodel for ETL processes, covering the aspects of data warehouse architecture, activity modeling, contingency treatment and quality management. The ETL tool we have developed, namely ARKTOS, is capable of modeling and executing practical ETL scenarios by providing explicit primitives for the capturing of common tasks. provides three ways to describe an ETL scenario: a graphical point-and-click front end and two declarative languages: XADL (an XML variant), which is more verbose and easy to read and SADL (an SQL-like language) which has a quite compact syntax and is, thus, easier for authoring.

Note: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.

Year: 
Research area: 

Blueprints for ETL workflows

P. Vassiliadis, A. Simitsis, M. Terrovitis, and S. Skiadopoulos
In Proceedings of the 24th International Conference on Conseptual Modeling (ER'05), volume 3716 of LNCS, pages 385--400. Springer, 2005

Abstract:
Extract-Transform-Load (ETL) workflows are data centric workflows responsible for transferring, cleaning, and loading data from their respective sources to the warehouse. Previous research has identified graphbased techniques that construct the blueprints for the structure of such workflows. In this paper, we extend existing results by explicitly incorporating the internal semantics of each activity in the workflow graph. Apart from the value that blueprints have per se, we exploit our modeling to introduce rigorous techniques for the measurement of ETL workflows. To this end, we build upon an existing formal framework for software quality metrics and formally prove how our quality measures fit within this framework.

Note: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.

Year: 
Research area: 

CPM: A Cube Presentation Model for OLAP

A. Maniatis, P. Vassiliadis, S. Skiadopoulos, and Y. Vassiliou Abstract:

Note: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.

Year: 
Research area: 

Conceptual Modelling for ETL Processes

P. Vassiliadis, A. Simitsis, and S. Skiadopoulos
In Proceedings of the ACM 5th Int'l Workshop on Data Warehousing and OLAP (DOLAP'02), pages 14--21. ACM Press, July 2002.

Abstract:
Extraction-Transformation-Loading (ETL) tools are pieces of software responsible for the extraction of data from several sources, their cleansing, customization and insertion into a data warehouse. In this paper, we focus on the problem of the definition of ETL activities and provide formal foundations for their conceptual representation. The proposed conceptual model is (a) customized for the tracing of inter-attribute relationships and the respective ETL activities in the early stages of a data warehouse project; (b) enriched with a 'palette' of a set of frequently used ETL activities, like the assignment of surrogate keys, the check for null values, etc; and (c) constructed in a customizable and extensible manner, so that the designer can enrich it with his own re-occurring patterns for ETL activities.

Note: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.

Year: 
Research area: 

Graph-Based Modeling of ETL Activities with Multi-Level Transformations and Updates

A. Simitsis, P. Vassiliadis, M. Terrovitis, and S. Skiadopoulos
In Proceedings of the 7th Int'l Conference on Data Warehousing and Knowledge Discovery (DaWaK'05), volume 2589 of LNCS, pages 43--52. Springer, 2005

Abstract:
Extract-Transform-Load (ETL) workflows are data centric workflows responsible for transferring, cleaning, and loading data from their respective sources to the warehouse. In this paper, we build upon existing graph-based modeling techniques that treat ETL workflows as graphs by (a) extending the activity semantics to incorporate negation, aggregation and self-joins, (b) complementing querying semantics with insertions, deletions and updates, and (c) transforming the graph to allow zoom-in/out at multiple levels of abstraction (i.e., passing from the detailed description of the graph at the attribute level to more compact variants involving programs, relations and queries and vice-versa).

Note: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.

Year: 
Research area: 

Modelling ETL Activities as Graphs

P. Vassiliadis, A. Simitsis, and S. Skiadopoulos
In Proceedings of the 4th Int'l Workshop on the Design and Management of Data Warehouses (DMDW'02), pages 52--61. CEUR Workshop Proceedings, May 2002

Abstract:
Extraction-Transformation-Loading (ETL) tools are pieces of software responsible for the extraction of data from several sources, their cleansing, customization and insertion into a data warehouse. In this paper, we focus on the logical design of the ETL scenario of a data warehouse. Based on a formal logical model that includes the data stores, activities and their constituent parts, we model an ETL scenario as a graph, which we call the Architecture Graph. We model all the aforementioned entities as nodes and four different kinds of relationships (instance-of, part-of, regulator and provider relationships) as edges. In addition, we provide simple graph transformations that reduce the complexity of the graph. Finally, in order to support the engineering of the design and the evolution of the warehouse, we introduce specific importance metrics, namely dependence and responsibility, to measure the degree to which entities are bound to each other.

Note: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.

Year: 
Research area: 

Modelling ETL Processes as Graphs

P. Vassiliadis, A. Simitsis, and S. Skiadopoulos
In Proceedings of the Hellenic Data Management Symposium (HDMS'03), September 2003.

Abstract:
Το άρθρο αυτό αφορά στο λογικό σχεδιασμό ΕΜΦ (Εξαγωγής-Μετασχηματισμού-Φόρτωσης) σεναρίων για αποθήκες δεδομένων. Με βάση ένα τυπικό λογικό μοντέλο που αποτελείται από σημεία αποθήκευσης δεδομένων, διεργασίες και τα συστατικά τους μέρη, ένα ΕΜΦ σενάριο μοντελοποιείται ως γράφος, που ονομάζεται Γράφος Αρχιτεκτονικής. Όλες οι προαναφερθείσες οντότητες αποτελούν τους κόμβους του γράφου και τα τέσσερα διαφορετικά είδη σχέσεων που έχουν μεταξύ τους (όπως σχέσεις στιγμιότυπου, μέρους, ρύθμισης και παροχής) τις ακμές. Με σκοπό να υποστηριχτεί ο σχεδιασμός και η εξέλιξη της ΑΔ, ορίζονται συγκεκριμένες μετρήσεις σπουδαιότητας: η εξάρτηση και η υπευθυνότητα, για τον υπολογισμό του βαθμού κατά τον οποίο είναι συνδεδεμένες μεταξύ τους οι οντότητες του σεναρίου.

Note: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.

Year: 
Research area: 

Modelling and Language Support for the Management of Pattern-Bases

M. Terrovitis, P. Vassiliadis, E. Skiadopoulos, S. Bertino, B. Catania, and A. Maddalena
In Proceedings of the 16th International Conference on Scientific and Statistical Database Management (SSDBM'04), pages 265--274. IEEE Computer Society, 2004

Abstract:
In our days knowledge extraction methods are able to produce artifacts (also called patterns) that concisely rep- resent data. Patterns are usually quite heterogeneous and require ad-hoc processing techniques. So far, little empha- sis has been posed on developing an overall integrated en- vironment for uniformly representing and querying dif- ferent types of patterns. Within the larger context of mod- elling, storing, and querying patterns, in this paper, we: (a) formally de¯ne the logical foundations for the global setting of pattern management through a model that cov- ers data, patterns and their intermediate mappings; (b) present a pattern speci¯cation language for pattern man- agement along with safety restrictions; and (c) intro- duce queries and query operators and identify interest- ing query classes.

Note: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.

Year: 
Research area: 

Modelling and Optimization Issues for Multidimensional Databases

P. Vassiliadis and S. Skiadopoulos.
In Proceedings of CAiSE'00, volume 1789 of LNCS, pages 482--497. Springer, June 2000.

Abstract:
It is commonly agreed that multidimensional data cubes form the basic logical data model for OLAP applications. Still, there seems to be no agreement on a common model for cubes. In this paper we propose a logical model for cubes based on the key observation that a cube is not a self-existing entity, but rather a view over an underlying data set. We accompany our model with syntactic characterisations for the problem of cube usability. To this end, we have developed algorithms to check whether (a) the marginal conditions of two cubes are appropriate for a rewriting, in the presence of aggregation hierarchies and (b) an implication exists between two selection conditions that involve different levels of aggregation of the same dimension hierarchy. Finally, we present a rewriting algorithm for the cube usability problem.

Note: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.

Year: 
Research area: 

Modelling and Querying Multidimensional Databases

P. Vassiliadis and S. Skiadopoulos
In Proceedings of the Hellenic Data Management Symposium (HDMS'02), July 2002.

Abstract:
Η Σύγχρονη Αναλυτική Επεξεργασία Δεδομένων (On-Line Analytical Processing - OLAP) είναι μια τάση στην τεχνολογία των βάσεων δεδομένων, που στηρίζεται στη θεώρηση της πληροφορίας με πολυδιάστατο τρόπο στο επίπεδο των πελατών. Παρά την κοινή αποδοχή των πολυδιάστατων κύβων σαν το κεντρικό λογικό μοντέλο για OLAP και την πληθώρα των ερευνητικών προτάσεων, υπάρχει μικρή συμφωνία στην εύρεση μιας κοινής ορολογίας και σημασιολογίας για το λογικό μοντέλο δεδομένων. Στο άρθρο αυτό προτείνεται ένα επιπλέον λογικό μοντέλο για κύβους, με βάση την παρατήρηση ότι ένας κύβος δεν είναι μια αυθύπαρκτη οντότητα, αλλά μια όψη πάνω σε ένα υποκείμενο σύνολο δεδομένων. Το προτεινόμενο μοντέλο είναι αρκετά ισχυρό στο να καλύπτει όλες τις συνηθισμένες πράξεις OLAP όπως επιλογή, συναθροιστική άνοδος και αναλυτική κάθοδος σε επίπεδα αδρομέρειας, μέσω μιας συνεπούς και πλήρης άλγεβρας. Δείχνεται επίσης πώς αυτό το μοντέλο μπορεί να χρησιμοποιηθεί σαν η βάση για την επεξεργασία λειτουργιών στους κύβους και παρουσιάζονται συντακτικοί χαρακτηρισμοί για τα προβλήματα της χρησιμότητας κύβων (ήτοι, του προβλήματος χρησιμοποιήσεως δεδομένων από κάποιον κύβο για να υπολογιστεί ένας άλλος κύβος).

Note: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.

Year: 
Research area: 

On the Logical Modelling of ETL Processes

P. Vassiliadis, A. Simitsis, and S. Skiadopoulos
In Proceedings of the 14th Conference on Advanced Information Systems Engineering (CAiSE'02), short paper, volume 2348 of LNCS, pages 782--786. Springer, May 2002

Abstract:
Extraction-Transformation-Loading (ETL) tools are pieces of software responsible for the extraction of data from several sources, their cleansing, customization and insertion into a data warehouse. Research has only recently dealt with the above problem and provided few models, tools and techniques to address the issues around the ETL environment [1,2,3,5]. In this paper, we present a logical model for ETL processes. The proposed model is characterized by several templates, representing frequently used ETL activities along with their semantics and their interconnection. In the full version of the paper [4] we present more details on the aforementioned issues and complement them with results on the characterization of the content of the involved data stores after the execution of an ETL scenario and impact-analysis results in the presence of changes.

Note: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.

Year: 
Research area: