Extract Transform Load Processes Computer Science Essay

Business intelligence is a set of application and cognition to analyse or garner informations to help user in better determination doing [ 1 ] . There is an of import country that covered by concern intelligence, which is ETL procedure.

ETL is stand for Extraction-Transformation-Loading, which is a procedure that used to pull out informations from different beginnings, cleaning, customization and interpolation into a information warehouse [ 3 ] . The Extraction procedure, chiefly used to pull out informations from external beginnings, Transformation procedure is used to transform the information into operational demands to guarantee consistence and to run into concern demand. The Loading procedure is to lade the transformed informations into the informations warehouse or database [ 4 ] . ETL procedure normally used in informations warehousing where the procedure pull out one or more external or internal beginnings, cleansing and filtrating the information, so load it into the informations warehouse for the user to acquire the information for analysing intent to assist in doing critical determination.

There are a batch ETL tools that are celebrated to utilize by the user in making ETL procedures, and this research will chiefly concentrate on Microsoft SQL Server ETL tool. Microsoft SQL Server is a relational theoretical account database waiter, in Microsoft SQL Server there is a constituent called SQL Server Integration Services ( SSIS ) . SSIS is chiefly a platform for workflow application and information integrating, it contain a good information repositing tool which for ETL procedure [ 5 ] . The SSIS contain a set of tools that used to back up ETL procedure, such as Data Flow Engine, Scripting Environment and Data Profiler.

Chapter 2 Extract-Transform-Load Procedures

Basically Extract-Transform-Load procedures are known as ETL procedures, which is a uninterrupted phase procedure in informations warehousing. The first phase in these procedures is extraction, which extract several informations from different databases and assorted informations format. After the information extraction, 2nd phase of the procedure is transmutation, which reformat and cleansing the information to run into the concern or operation meet. Last, the informations will be load into the information warehouse, database or informations marketplace for analysing intent. Beside from extraction, transmutation and lading informations, ETL tool in informations warehouse can besides utilize to travel informations from one operational information to another 1. [ 6 ]

2.1 Extraction

Extraction is the first portion of the ETL procedure which pull outing informations from external or internal beginnings. Normally the beginnings are maintained by different operating system and hardware which is non compatible.

2.1.1 Data Extraction Method

When get downing ETL procedure, the first thing demand to take consider is the method of extraction, because it will impact the transit procedure, beginning system and besides the refreshing clip for the warehouse. [ 7 ] There are two types of informations extraction methods, which are logical extraction and physical extraction.

Logical Extraction

Logical extraction method divided into two sorts, they are full extraction and incremental extraction. Full extraction extract the informations wholly from the beginnings, this extraction do non necessitate to maintain path of the alterations to the information beginnings because it retroflex all the information which is latest in the beginning system. The illustration of full extraction is export file of a distinguishable tabular array. For incremental extraction, the construct are wholly different with full extraction because incremental extraction infusion informations individually while full extraction merely extract informations in one clip merely. Incremental extraction demand to supervise the alterations that made to the beginnings, at a specific clip, the information that has been changed will be extract. This extraction may be the last extraction of the extraction procedure or a more complex concern dealing or event such as last booking twenty-four hours of a financial period. The alterations information of the information beginnings must be identified in the specific period. The alterations information can be provided by the beginning informations in application column, which reflect the latest alterations timestamp or in a tabular array format which include a mechanism that used to maintain path of the alterations. [ 7 ]

Physical Extraction

Physical extraction method besides divided into two types, which are on-line extraction and offline extraction. Online extraction extracts informations straight from the beginning system, which the procedure connect straight to the beginning system to entree the informations tabular array by themselves or link to an intermediate system that shops the informations in preconfigured mode, e.g. snapshot logs and alteration tabular arraies. Online extraction can take to pull out the information or dealing information utilizing original beginning or prepared beginning objects. In Contrast, offline extraction does non pull out informations straight from the beginning system but staged explicitly outside of the original informations beginning system. These information has an bing construction e.g. archive logs and redo logs, or is created by an extraction agenda. [ 7 ]

2.1.2 Data Extraction Design

When planing the information extraction procedure, there are a few factors that need to see because it will impact the undertaking cost, comprehensibility, easiness of care or development. Those factors that need to see are:

Data Consistency

If there is a necessary to pull out informations from more than one beginning, so the information of these systems must be in a individual concern universe non from different concern universe in different clip deflection. In add-on, non all beginnings system informations include the day of the month and clip casts in every individual dealing. The extraction sometimes needs to work out this job by running extraction after all the batch and interface tallies are complete or make the clip stomping for all the extracted information, where dimensions are interlocked with the clip dimensions. [ 8 ]

Reliable beginning

The information that need to be extract must be dependable, which mean it must come from a dependable and right original beginning. For illustration, client information possibly exist in different systems, so the information must be extract from the beginning that is most updated and complete. [ 8 ]

Seasonably Availability

The extraction procedure must be completed on clip and in the right format in order for the following phase of the ETL procedure can foster the processing in agenda. For illustration, if the transmutation procedure needs to get down at nine in the forenoon, so the extraction procedure should be completed earlier before that clip. [ 8 ]

Extraction Quality

The quality extraction procedure must be guarantee that the complete synchronism between extracted informations in presenting country and in production databases. This can be monitor by making count and collection lucifer on the of import field in both theatrical production and beginning system. [ 8 ] Besides, the hapless quality of the extracted informations such as losing value, value non in appropriate format and referential unity issues, must be concern because the informations that load into the informations warehouse must be accurate. For illustration, if the database is usage for marketing field, the reference of the client must be validate first to avoid the returned mail issue. [ 9 ]

2.1.3 Extraction Beginnings

Beginnings in ETL procedure are different, because these beginnings are based on the platform, Database Management System and besides linguistic communication that used to link with the beginning, for illustration COBOL, Transact-SQL and etc. These beginnings are:

Open Database Connectivity ( ODBC )

Open Database Connectivity us an interface that used to link Database Management System and besides level files. ODBC interface contain an ODBC director to link the ETL application with ODBC driver. [ 10 ]

Flat Files

Flat file usually known as text file, is a basic information set that are non-related. Flat file is allowed to pull out informations from a database that can non be entered. Flat file is really of import in because most of the informations are in level file format, the easiness of managing a level file is better than managing a Database Management System. In add-on, it allow bulk burden which in fixed length or in bounded format. [ 10 ]

Enterprise Resource Planning ( ERP ) system beginnings

Enterprise Resources Planning is used to incorporate assorted informations into one database or platform. This system is rather complex because it usually contain more than 100s of tabular arraies, hence, pull outing from ERP is rather hard. To work out this job, utilizing particular adapters to pass on with these systems is the best solution, but the adapter usually cost a batch. The illustrations for ERP system are Oracle and SAP. [ 10 ]

XML

XML is stand for Extensible Markup Language, it & amp ; acirc ; ˆ™s a information linguistic communication that independent from platform. Those informations from different dependent linguistic communications are able to pass on with each other by utilizing XML. XML designed to depict informations content and construction. [ 10 ]

Web Log and click watercourse informations repositing

Web log used to demo every user that enter a peculiar web site and click watercourse is used to enter and hive awaying the action that done by the computing machine user. This beginning is of import because it shops the information about the penchant and demands of users. [ 10 ]

2.2Transformation

Transformation is the 2nd phase of ETL procedure, which transforms the extracted information into a information warehouse scheme that is consistent by following the concern regulations and demand. Data Transformation is the most complicated portion in ETL procedure, because this procedure demand to guarantee the truth, cogency of informations, convert informations type and concern regulations application. [ 11 ] Data transmutation covered a few country, which are:

Cleansing

Cleaning is the procedure that changes the information that violates the concern regulations in order to conform to these regulations. This procedure is done by ETL plans to find the right and rectify informations values. [ 12 ]

Summarization

The information value will be summarized to acquire the entire figures that will hive away at multiple degrees as concern fact in multidimensional fact tabular arraies. [ 12 ]

Derivation

Create new informations from the extracted informations beginning by utilizing computations, table searchs or plan logic. For illustration: calculate client & A ; acirc ; ˆ™s age based on current twelvemonth and their day of the month of birth. [ 12 ]

Collection

Data component for clients can be aggregated from assorted beginning files and databases, e.g. client file and gross revenues file. [ 12 ]

Integration

The chief purpose of integrating is to hold a consequence of each information component known with one criterion definition and approved name. Integration forces the demand to accommodate different extracted informations names and values for the same information component. Each information component must be associate with ain beginning databases and besides the concern intelligence mark databases. The information standardisation is one of the concern aims. [ 12 ]

Sometimes the procedure of transmutation in informations beginning could be an eternal procedure, because there may be a batch of informations that need to be transformed to implement informations unity and concern demands or regulations. [ 12 ]

2.2.1 Datas Cleansing

Data cleaning, besides known as informations scouring, is the procedure of detect, correct or remove those record that found to be inaccurate or corrupt in the extracted information. The information that are wrong, inaccurate, irrelevant and uncomplete must be replace, modify or cancel because these soiled informations will misdirect or do the truth of the analyze procedure in the informations warehouse or database. [ 13 ] Data cleaning operation is really of import, because if the information in the database or informations warehouse is inaccurate, so it may do serious job, such as incorrect determination doing based on undependable informations. The information cleaning procedure can execute in one individual information or in multiple sets of informations. There are two ways in executing informations cleaning, which are different depends on the information complexness, manual information cleaning will be invoke if in most simple instances while automated informations cleaning will be use if in complex operation. [ 14 ]

Manual Data Cleansing

The manual information cleansing done by the individual that read or look intoing the confirmation of truth in the extracted informations. They will rectify the informations that contain mistake in spelling or a informations that complete losing entries. Those unneeded informations in this procedure will be removed to increase the efficiency of informations processing. [ 14 ]

Automated Data Cleansing

In some instances, the human manual information cleaning is non that efficiency if the sum of information records is in monolithic sum or the operation is more complex and need to be complete in a specific period. Therefore, in this operation the human work is replaced by computing machine plans. [ 14 ]

Data cleaning is a really complex and time-consuming procedure, which required a few sum of work, besides, it is besides an of import component in the ETL procedure because it may do concern failure if the information provided in the information warehouse is inaccurate. [ 14 ] The informations cleansing procedure divided into few phases, which are:

Datas Auditing

The extracted information is audited utilizing statistical method in observing informations anomalousnesss and contradictions. The consequence will give the indicant of the anomalousnesss features and the anomalousnesss data locations. [ 13 ]

Workflow Specification

Workflow in after the information auditing, which make consideration and specification in the anomalousnesss that found in the earlier phase. For illustration, if holding typo mistakes in the information input phase ; the keyboard layout can utilize to happen out the possible solutions. [ 13 ]

Workflow Execution

After the rightness and specification is verified, so the work flow will be executed, and the work flow should be run expeditiously even run on a immense sets of informations. [ 13 ]

Post-Processing and Controling

After the cleansing work flow executed, the consequence must be look into to verify the rightness. The people need to make manual checking and do informations rectification is the consequence of work flow didn & A ; acirc ; ˆ™t make rectification to the mistakes. [ 14 ]

During informations cleansing operation, there are some popular method that usually used by the user, such as parsing, duplicate riddance and statistical method.

Parsing

The map of parsing method is to observe syntax mistake of the information. Parser will make up one’s mind the value of the information is acceptable within the informations specification or non. [ 13 ]

Duplicate Elimination

By utilizing an algorithm, the informations that contains duplicated information or entity will be detected. To do the sensing faster, the informations will be sort by a key that can convey those duplicated informations entries closer together. [ 13 ]

Statistical Method

By utilizing standard divergence, scope or constellating algorithm, informations with unexpected value can be found and this informations are usually error. Other than this, statistical method besides can be used to replace the losing value in the information by one or more possible values that get from the extended informations augmentation algorithm. [ 13 ]

The end product or consequence of the informations transmutation should accomplish a set of quality standards, which include:

Accuracy

Datas must be accurate which can be measured by an aggregative value over the consistence and unity. [ 13 ]

Integrity

Data unity is the most important criterion among the quality standards, which can be measure by an aggregative value in completeness and informations cogency. [ 13 ]

Completeness

Datas must be complete by making rectification to the informations if contain anomalousnesss, for illustration typo mistakes or empty value. [ 13 ]

Singularity

The information should be alone and do non happen duplicated informations, because it will impact the truth of the analysing procedure. [ 13 ]

Cogency

The informations cogency can be checked by the sum of informations that meet the unity restraints demand. [ 13 ]

2.2.2Other Types of Transformation Done In Data

Integration will make to the informations when making informations transmutation ; it will associate the information from legion beginnings into a well-linked information. Data integrating has some cardinal component, which include:

Making Common Keys

Since informations extraction infusion informations from different beginnings that holding different systems, hence different systems have different key in represent the same entity. So making a common key will assist to work out this job to avoid confusion. For illustration: production system could hold agent codifications different from gross revenues system, which may applicable for client or other codification. [ 15 ]

Making alternate keys

Making a alternate key is to utilize in laden informations and star scheme. The creative activity of alternate key is hard, but it & amp ; acirc ; ˆ™s a consecutive forward method. In high degree, we need to acquire the dimension against which need to make the key, list out the maximal possible cases, put a series of foster key ( Numberss or combination of Numberss and characters ) for the dimension and in conclusion map the dimension key to the alternate cardinal scope. [ 15 ]

Standardization

This transmutation focal point on standardise the description and textual properties as good. Different informations or entities in different systems may hold different descriptions. This procedure will standardise these different property into one descriptive property for all maestro and codifications. [ 15 ]

Data Type Conversion

Some Fieldss in the informations beginning are in inappropriate format, for illustration recognition card field should be in numeral format, but the information is in character signifier. Some informations type transition may go on in informations extraction procedure because the informations utilizing different database theoretical account in the mark system and beginning. [ 15 ]

Create Derived Attributes

From the original informations beginning, we can make the derive property from the field by using some transmutation regulations. For illustration, day of the month field in a data field can be use to derive hebdomad of old ages, month of twelvemonth, one-fourth of twelvemonth and etc.

De-normalization

Some informations warehouse theoretical account may necessitate the informations theoretical account to undergo a alteration to run into its demand. Therefore, the informations need to be de-normalized or normalise in some state of affairs. In most instances, de-normalization is more frequently than normalise a dimension theoretical account. For illustration a tabular array may hold separate field like metropolis, zone, province and etc, de-normalize can utilize to fall in these field into one field call as reference. [ 15 ]

Standardization

Cases of standardization are less than de-normalization cases. Standardization may necessitate in some state of affairs because those table that extremely de-normalized may take to the job in slow response clip since there is a batch of articulations. [ 15 ]

Datas Relevance

The information that need to lade into the informations warehouse should be relevant, so any entity or component that are non required for analysis should be extinguish. In informations extraction procedure, some of the irrelevant informations will be filtered, but some elaborate informations still remain in the information beginning. Therefore, these informations transmutation will go on in remotion of unneeded Fieldss, remotion of full entity, truncate or extinguish the codification. [ 15 ]

2.3 Loading

Loading is the last phase of ETL procedure ; it is the stage that loads the extracted information that is clean and filtered into the terminal mark, which usually is a information warehouse. The loading procedure is differ based on the demand of the organisation such as overwrite the information with cumulative, update the infusion informations in day-to-day, hebdomadal or even monthly, some informations warehouse may lade new informations in historicized signifier e.g. hourly. [ 4 ]

In order to hold an effectual and faster velocity of burden, there are some methods that can utilize to accomplish this aim.

Turn off Loging

During the loading period, the logging should be turn off because if the logging is on, so the system will make dealing log when making composing action to the database. Since lading informations into informations warehouse involve a immense sum of informations composing, by turning the logging off, the operating expense job can be avoid. [ 16 ]

Drop and recreate indexes

Similar to logging job, indexing in database will besides do operating expense job. So to work out this job, the indexes should be bead before start burden tabular arraies and animate the index back after burden is completed. This action will do the indexing procedure making in a batch instead than a individual activity that every clip new record adds into the system. This method will merely cut down the operating expense if the information burden into the tabular array is rather a batch, if the information is in little sum so animate index will be more clip than making it separately to the new record. [ 16 ]

Pre-sort the file

The primary cardinal index can be pre-sort for informations warehouse burden, with this action ; it will rush up the indexing procedure greatly. Among all the method, this is the technique that most recommended. [ 16 ]

Use Append

Alternatively of utilizing full refresh, append the updated tabular array is better, because adding new records is faster than reconstruct the whole tabular array. But if there is a batch new records need to be add, so is better to drop and animate. [ 16 ]

Parallelize table burden

Bulk burden in analogue and dropping indices will fix the burden clip, most informations warehouse platform have the capableness to execute this action. [ 16 ]

Manage Partition

The divider should be manage when lading informations, because divider can utilize to split a tabular array into many smaller portion for disposal usage and besides will better the public presentation of question in a big fact tabular array. Partition tabular array by utilizing day of the month such as twelvemonth, month and one-fourth is the most recommended ways. [ 17 ]

Allow incremental burden

Incremental burden should be enabled because it will maintain the database synchronized with the beginning system. [ 17 ]

Chapter 3 SSIS

3.1 SQL Server 2008 Architecture

Component Interfaces with Integration Servicess

( SQL Server Integration Services, 2009 ) [ 18 ]

SQL Server 2008 provide a batch characteristic that are really utile which include database engine, analysis services for multidimensional informations, analysis services for informations excavation, integrating services, reproduction, coverage services and SQL server service agent. The map of each characteristic will be explained in below:

Database Engine

Database engine is the centre service for informations hive awaying, procuring and processing. Database engine has controlled entree and rapid dealing processing capableness to run into the demand of informations devouring application, besides, it besides provide support for prolonging high handiness. [ 19 ]

Analysis Services for Multidimensional Data

This characteristic support Online Analytical Processing ( OLAP ) which can utilize to make, design and manage multidimensional construction that contains informations from other informations beginnings. [ 19 ]

Analysis Services for Data Mining

This characteristic is for create, design and visualise informations mining theoretical account. These theoretical accounts can be created from another informations beginning by utilizing industry-standard informations excavation algorithms. [ 19 ]

Integration Servicess

Integration service is the platform for informations integrating solution. The chief procedure in integrating services is extract, transform and burden, which chiefly used for informations warehousing. [ 19 ]

Reproduction

Reproduction is the engineering for transcript and distributes informations or database object from a database to another and synchronising between the databases to run into consistence. [ 19 ]

Coverage Servicess

Reporting service is a web-enabled and enterprise coverage functionality that can bring forth study from assorted informations beginnings, the study can bring forth in assorted formats. [ 19 ]

Service Broker

Service agent used to construct secure and scalable database application. It provides a message-based communicating platform that allow independent application constituent to execute as a functioning whole. [ 19 ]