ETL Concepts

Context

Now-a-days, every organization possess a huge set of data and they are finding a lot of difficulty in storing and maintaining the data. The data come from a varied sources and they need to be processed in such a way so as to give a meaningful information in order to make better business decisions.  But, the bitter truth is that these data come along with a lot of challenges:

  • Giant organizations with loads of data: Such huge chunks of data can be in any format. They might be available in an unstructured manner(such as in files) or in different databases.
  • Lack of channelized communication between the databases: Different databases do not communicate well. As a result, they need to be collated, compared and combined to work together as a seamless whole.
  • Relying on single database: Many organizations use single database for storing their data. Every pair of database requires a unique interface. When a particular database is changed, many interfaces need to be upgraded.

The solution to such problems is what we call ‘Data Integration‘. It allows a communication channel between different databases and file formats.

Well, there are different architectures used in data integration technology, but Informatica uses the ETL architecture, and this is the most popular architecture which is used for performing data integration.

ETL stands for:

  •  E – Extract
  •  T – Transform
  •  L – Load

ETL – Definition

ETL is short-form for extract, transform, load, three functions that are combined into one tool to pull data out of one database/file (commonly called as source) and place it into another database/file (better known as target). It is said to be the foundation of a data-warehouse.

  • Extract is the process of reading data from a database/file. Here, the data is collected, often from multiple and different types of sources.
  • Transform is the process of converting the extracted data from its previous form into the form it needs to be in so that it can be placed into another database.
  • Load is the process of writing the data into the target database.

ETL process can be described further into 4 processes:

 Data Capture  ———> Scrub ———-> Transform ———> Load & Index

(Operational Systems)—————————————> (Enterprise Data warehouse)

  • ETL Process 1 : Data Capture -> Data capture is actually extraction of data and obtaining a snapshot of the chosen data from source and loading into the data warehouse.

Further, there can be two types of data extraction:

i. Static Extract : Capturing a snapshot of source data at a point in time.

ii. Incremental Extract : Capturing changes that have occurred since last static.

  • ETL Process 2 : Data Scrub -> Data scrub or data-cleanse uses pattern recognition and AI techniques to upgrade data quality.

The technique includes Fixing errors and AI Usage.

i. Fixing errors : Mis-spellings, Erroneous dates, incorrect field usages,  mismatched address, missing data, duplicate data, inconsistencies, etc. fall under  ‘fixing-error’ categories.

ii. AI Usage : Decoding, Reformatting, Time-stamping, Conversion, Key-generation, Merging, Error-detection/Logging, Locating, etc. fall under ‘AI Usage’.

  • ETL Process 3 : Data Transform -> Transformation means conversion of data from Operational System format to EDW (Enterprise Data warehouse) format.

The transformation process can be of two types:

i. Record Level : This again includes:

  • Selection – Partitioning the data (Selective Selection)
  • Joining – Combining the data
  • Aggregation – Summarization of the data

ii. Field Level : This includes:

  • Single Field– From one field to one field (1 : 1)
  • Multi Field– From one field to many fields, or from many fields to one field (1 : M, M : 1)
  • ETL Process 4 : Data Load & Index -> Place transformed data to the enterprise data warehouse and create index.

Data Load & Index can happen in two modes:

i. Refresh Mode : This means bulk re-writing of target data at periodic intervals.

ii. Update Mode : Only the changes made to the source data and written to the target enterprise data warehouse.

So, now, as we all have a fair idea on ETL and its processes, we are in a better position to judge why Informatica is on a role for providing best ETL solutions. We will also come across the real life scenarios wherein Informatica can come very handy in the upcoming blog.

Next

2 Comments

Leave a comment