Optimising ETL processes for network databases

From Master Projects
Jump to: navigation, search

has title::Optimising ETL processes for network databases
status: finished
Master: project within::Information Sciences
Student name: student name::Anil Bhikhie
number: student number::1286307
Start start date:=2009/06/15
End end date:=2009/12/25
Supervisor: Rahul Premraj
Second reader: has second reader::Steven Klusener
Company: has company::Capgemini
Poster: has poster::Media:Media:Posternaam.pdf

Signature supervisor



In data warehouse projects one of the most important phases is the ETL process. The ETL process stands for Extract, Transform and Load. It is a process by which data is extracted and transformed from source systems and loaded into the data warehouse. The ETL process is known as the most time consuming task during the development of a data warehouse. Approximately 60-80% of the work of developing a data warehouse is spent on the ETL process. The challenge is to reduce the time spent on the ETL process. More concise: If we concentrate on the ETL-process, how can we make this process more effective and save time? What factors are influencing the process?

During the literature study some important influencing factors for the ETL process were found: - The detection of data anomalies - The volume of the data warehouse - The time window available for ETL - Design of the ETL architecture - The importance of metadata - Tool-based vs Hand-coded ETL - Complexity of the ETL process - Users

Then some recommendations were presented to help improving and reducing the time spent on the ETL process.

So far the different source systems, from where data had to be extracted, were not distinguished. Capgemini’s legacy systems are using aged network databases. Extracting data from network databases is quite difficult, because ETL tools are mainly developed for relational databases. There is a possibility that in the near future a demand for management information will arise and Capgemini wants to know how they can extract data from their network databases from Oracle, CODASYL DBMS. The core question now is:

How to implement an efficient ETL process for Capgemini’s CODASYL DBMS?