1 / 20

ISQS 6339, Business Intelligence Supplemental Notes on the Term Project

ISQS 6339, Business Intelligence Supplemental Notes on the Term Project. Zhangxi Lin Texas Tech University. Term project. 3-6 students form a team to fulfill a data mart development project. Stage 1 (10%): SQL Server Project proposal. March 4

theta
Download Presentation

ISQS 6339, Business Intelligence Supplemental Notes on the Term Project

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ISQS 6339, Business IntelligenceSupplemental Notes on the Term Project Zhangxi Lin Texas Tech University

  2. Term project • 3-6 students form a team to fulfill a data mart development project. • Stage 1 (10%): SQL Server Project proposal. March 4 • Stage 2 (25%): Data mart Implementation. March 27 • Stage 3 (10%): Hadoop Project proposal. Due April 15 • Stage 4 (25%): Hadoop Project completed. Due April 29 • Stage 5 (10%): Presentation. May 1 & 6 • Stage 6 (30%): Final report. Due May 13 • Detailed instructions: http://zlin.ba.ttu.edu/6339/Projects14.html

  3. Merits of the project outcomes • Carefully developed project proposal demonstrating the understanding of the business requirements, attractive analytics themes, and clearly defined project goal and objectives • Comprehensive data mart design, such as multiple fact tables, with supporting analytic themes • Applications of advanced ETL model or techniques, such as slowly changing dimensions, the use of containers, etc. • Advanced OLAP cube design, and/or optional MDX scripting by self-taught • Rich data analysis outcomes • Well-presented final report • Demonstrating the creative ideas and skillful data warehousing ability

  4. Project Datasets ISQS 6339 2014-01

  5. Qiyi (奇易网) Data of Online Purchases • Website: http://www.6695.com • Data format: MySQL • Data size: about 3GB • Data structure: 11 tables • This data is to be installed to MySQL. It can be applied for two purposes • Data warehousing with Hadoop/HBase, MySQL, or SQL Server 2008 • Data mining for credit assessment or purchase preference (ISQS 6347) • Note: some contents are in Chinese

  6. Vehicle locations provided by Beijing 1039 traffic radio Website: http://www.fm1039.com/index/index.htm Data format: MongoDB Data Size: about 3BG for traffic status in a week Data structure: (x, y) of vehicles’ locations This data is to be installed on MongoDB This data can be installed either at Hadoop or SQL Serve 2008 for traffic condition analyses

  7. Lending Club Data Website: https://www.lendingclub.com/ This is a famous P2P lending company to go IPO in 2014. Google bought its 7% share in 2013 in $125 million. Data size: 5 compressed tables, total about 80 MB

  8. Propsper.com dataset Website: http://www.prosper.com Propsper.com is another famous P2P lending company with a differentiated business model from that used by Lending Club. Data size is to be determined.

  9. Helsinki cell phone user data 15 users’ activity data, 20-22 tables, about 30 GB

  10. Hadoop projects

  11. Components • Load Balancer • Oozie • Solr, SolrCloud, SolrJ, HA • NewSQL • Kafka, Storm, Impala • REST • ZK • MySQL • Nginx/HA-Proxy • Flume • Sqoop • Ganglia • Technology stack • Tomcat, Jetty  • Avro

  12. Data Warehousing Methodology - Implementing data warehouse systematically 13

  13. Dimensional Modeling Process • Preparation • Identify roles and participants • Understanding the data architecture strategy • Setting up the modeling environment • Establishing naming conventions • Data profiling and research • Data profiling and source system exploration • Interacting with source system experts • Identifying core business users • Studying existing reporting systems • Building Dimensional models • High-level dimensional model design • Identifying dimension and fact attributes • Developing the detailed dimensional model • Testing the model • Reviewing and validating the model

  14. Business Dimensional Lifecycle Business Req’ts definition Technical Arch. Design Product Selection & Installation Growth Dimensional Modeling Physical Design ETL design & Development Deployment Project Planning BI Appl. Specification BI Application Development Maintenance Project Management 15

  15. Data Profiling • Data profiling is a methodology for learning about he characteristics of the data • It is a hierarchical process that attempt to build an assessment of the metadata associated with a collection of data sets. • Three levels • Bottom – characterizing the values associated with individual attributes • Middle – the assessment looking at relationships between multiple columns within a single table. • Highest level – the profile describing relationships that exist between data attributes across different tables. • Can run a program against the sandbox source system to obtain the needed information. 16

  16. ETL Methodology • Develop a high-level map • Build a sandbox source system (optional) • Detailed data profiling • Make decisions • The source-to-target mapping • How often loading tables • The strategy for partitioning the relational and Analysis Services fact table • The strategy for extracting data from each source system • De-duplicate key data from each source system (optional) • Develop a strategy for distributing dimension tables across multiple database servers (optional) 17

  17. Sandbox Source System • Sandbox • A protected, limited environment where applications are allowed to "play" without risking damage to the rest of the system. • A term for the R&D department at many software and computer companies. The term is half-derisive, but reflects the truth that research is a form of creative play. • In the DW/BI context, sandbox source system is a subset of source database for analytic exploration tasks • How to create • Set up a static snapshot of the database • By sampling 18

  18. Decision Issues in ETL System Design Source-to-target mapping Load frequency How much history is needed 19

  19. Strategies for Extracting Data • Extracting data from packaged source systems –self-contained data sources • May not be good to use their APIs • May not be good to use their add-on analytic system • Extracting directly from the source databases • Strategies vary depending on the nature of the source database • Extracting data from incremental loads • How the source database records the changes of the rows • Extracting historical data 20

More Related