Optimising ETL processes for network databases
|has title::Optimising ETL processes for network databases|
|Master:||project within::Information Sciences|
|Student name:||student name::Anil Bhikhie|
|Second reader:||has second reader::Steven Klusener|
In data warehouse projects one of the most important phases is the ETL process. The ETL process stands for Extract, Transform and Load. It is a process by which data is extracted and transformed from source systems and loaded into the data warehouse. The ETL process is known as the most time consuming task during the development of a data warehouse. Approximately 60-80% of the work of developing a data warehouse is spent on the ETL process. The challenge is to reduce the time spent on the ETL process. More concise: If we concentrate on the ETL-process, how can we make this process more effective and save time? What factors are influencing the process?
During the literature study some important influencing factors for the ETL process were found: - The detection of data anomalies - The volume of the data warehouse - The time window available for ETL - Design of the ETL architecture - The importance of metadata - Tool-based vs Hand-coded ETL - Complexity of the ETL process - Users
Then some recommendations were presented to help improving and reducing the time spent on the ETL process.
So far the different source systems, from where data had to be extracted, were not distinguished. Capgemini’s legacy systems are using aged network databases. Extracting data from network databases is quite difficult, because ETL tools are mainly developed for relational databases. There is a possibility that in the near future a demand for management information will arise and Capgemini wants to know how they can extract data from their network databases from Oracle, CODASYL DBMS. The core question now is:
How to implement an efficient ETL process for Capgemini’s CODASYL DBMS?