What is an ETL?
66Extract, Transform and Load (ETL) Data
What is an ETL
ETL is an acronym for Extract, Transform and Load (ETL) Data. It is a process that involves extracting or harvesting data from outside sources, cleansing and converting the same for operational needs and deploying or uploading it onto the target database or system. ETL process is often used for data outsourcing for a new system such as data warehousing. ETL process is also employed in data integration or interfacing with different systems.
Data Extraction
Now a days data lies in different heterogeneous systems such as in web pages, blogs, e-commerce sites, wikis, forums, search engine result pages, relational databases, flat files, excel spreadsheets, xml files, ERP applications, content management systems, web services, documents and so on. Data extraction from such variety of systems is often quite challenging. Various methods and techniques can be employed in this step such as data scraping, web spidering, screen scraping, sql scripting, scripting, programming, using API etc.
Data Transformation or Conversion
Transforming the data to the required format is the next step. Usually mapping, search and replace, manual entry and editing, sql queries are used to transform collected data. Most common tool in this phase can be simple excel spreadsheet but ms access is very powerful and convenient if you know how to design and write sql queries. Using ms access (or any other relational database) it is very easy to validate the data for uniqueness, mandatory entries, referential integrity, and so on. These are either difficult or troublesome to attain using an excel spreadsheet.
The transformed data can be often converted to xml files or sql dump files which can be readily fed into the target system database. Often target systems have built in data importer or uploader that expects data in a specified format. Therefore data conversion from one format to another is an important pre requisite for the final step of data loading.
Data Loading
Ultimately transformed or converted data is uploaded onto the target system application or database, usually data warehouse.
Data loading can be done using target system's built-in importer. In absence of such tool, with the help of scripting (often sql dumping) or programming data can be fed into the target system.
Data loading can be one-off activity or a continuous and dynamic activity depending on the nature of the sources and targets and objective of the project.
Challenges
ETL process can become considerably complex sometimes. A common source of problem is dependencies among various tasks of ETL processes. Duplicate, inconsistent, incomplete or inaccurate data can pose quite a challenge for data cleansing. Use of wrong techniques or tools can produce invalid results. If the process is poorly planned and executed, significant operational problem will occur.
Thus on the surface ETL can be seemed quite simple but it is only a feeling of inexperienced. Executing a successful ETL project requires proper knowledge of database theory, foresightedness and detailed plan. Adequate knowledge about business, entities and their relationships and data normalization is essential. To avoid chaos during go-live and day-to-day operations employ knowledgeable resources in your ETL projects.






