1 / 25

Agenda 02/21/2013

Agenda 02/21/2013. Discuss exercise Answer questions in task #1 Put up your sample databases for tasks #2 and #3 Define ETL in more depth by the activities performed. Discuss the “controversy” in ETL activities. Discussed in prior classes. Lots of data.

sorcha
Download Presentation

Agenda 02/21/2013

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Agenda 02/21/2013 • Discuss exercise • Answer questions in task #1 • Put up your sample databases for tasks #2 and #3 • Define ETL in more depth by the activities performed. • Discuss the “controversy” in ETL activities

  2. Discussed in prior classes... • Lots of data. • Traditional transaction processing systems • Non-traditional transaction processing • Call center; Click-stream; Loyalty card; Warranty cards/product registration information • External data from government and commercial entities. • Lots of poor quality data for lots of reasons that can be traced back to lots of people.

  3. Master Data Management: What does it mean and why is it so difficult to manage master data?

  4. Populating the data warehouse • Extract • Take data from source systems. • May require middleware to gather all necessary data. • Transformation • Put data into consistent format and content. • Validate data – check for accuracy, consistency using pre-defined and agreed-upon business rules. • Convert data as necessary. • Load • Use a batch (bulk) update operation that keeps track of what is loaded, where, when and how. • Keep a detailed load log to audit updates to the data warehouse.

  5. Data Cleansing • Source systems contain “dirty data” that must be cleansed • ETL software contains rudimentary to very sophisticated data cleansing capabilities • Industry-specific data cleansing software is often used. Important for performing name and address correction • Leading data cleansing vendors include general hardware/software vendors such as IBM, Oracle, SAP, Microsoft and specialty vendors Information Builders (DataMigrator), Harte-Hanks (Trillium), CloverETL, Talend, and BusinessObjects (Centric)

  6. Steps in data cleansing • Parsing • Correcting • Standardizing • Matching • Consolidating

  7. Parsing • Parsing locates and identifies individual data elements in the source files and then isolates these data elements in the target files. • Examples include parsing the first, middle, and last name; street number and street name; and city and state.

  8. Parsing

  9. Correcting • Corrects parsed individual data components using sophisticated data algorithms and secondary data sources.

  10. Correcting

  11. Standardizing • Standardizing applies conversion routines to transform data into its preferred (and consistent) format using both standard and custom business rules.

  12. Standardizing

  13. Matching • Searching and matching records within and across the parsed, corrected and standardized data based on predefined business rules to eliminate duplications.

  14. Matching

  15. Consolidating • Analyzing and identifying relationships between matched records and consolidating/merging them into ONE representation.

  16. Consolidating

  17. Source system view – 3 clients Account# 1238891 Policy No.ME309451-2 Transaction B498/97

  18. The reality – ONE client Account# 1238891 Policy No.ME309451-2 Transaction B498/97

  19. William Parker Beth Lewis Karen Parker-Lewis William Parker-Lewis, Jr. Consolidating whole groups

  20. ETL Products • SQL Server 2012 Integration Services from Microsoft • Power Mart/Power Center from Informatica • Warehouse Builder from Oracle • Teradata Warehouse Builder from Teradata • DataMigrator from Information Builders • SAS System from SAS Institute • Connectivity Solutions from OpenText • Ab Initio

  21. What about unstructured data? • What is unstructured data? • What percentage of data in organizations is considered to be “unstructured”? • Examples • Why store it in a data warehouse? • Does it do any good in large text fields? • Special ETL for unstructured data

  22. Unstructured Data Example • Notes about post-service of a product: • The hub bent when the bicycle hit a large pothole. • The plane takes off sluggishly during high-altitude departures. • The product won’t allow entry of a 1098-T when the person is declared as a dependent. • “Text analytics” are used to transform the data.

  23. Text analytics • Parses text and extracts facts (complaints, problems, issues) about key entities (customers, products, locations). • Uses natural language processes (NLP). • NLP converts human language into more formal representations that are easier for a computer program to manipulate. • Combination of computational linguistics and artificial intelligence.

  24. Goal of ETL • Structured and unstructured data stored in a relational database. • Data is complete, accurate, consistent, and in conformance with the business rules of the organization.

  25. Controversy in ETL • Is it necessary? • Has the advent of big data changed our need for ETL? • ETL vs. ELT • Does the use of Hadoop eliminate the need for ETL software???

More Related