1 / 27

Scraping the Glass: NBA Rebounding A nalysis

Scraping the Glass: NBA Rebounding A nalysis. Web Scraping and Predictive Analytics with Alteryx. Greg Murray Hector Amaya. Use Alteryx to scrape BasketballReference.com Return a list of all basketball players in catalog Return season statistics for all players in 2000 or greater

dagan
Download Presentation

Scraping the Glass: NBA Rebounding A nalysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scraping the Glass: NBA Rebounding Analysis Web Scraping and Predictive Analytics with Alteryx Greg Murray Hector Amaya

  2. Use Alteryx to scrape BasketballReference.com • Return a list of all basketball players in catalog • Return season statistics for all players in 2000 or greater • Use the resulting data set to build a predictive model Objectives

  3. Batch Macros • Download Tool • XML parse Tool • Regex Tool and Regex Parsing • Alteryx Predictive Suite Key Concepts

  4. Step 1: Examine the webpage(s) that will be scraped Objective 1a:Collect list of Players from BR Catalog

  5. Step 2: Examine the structure of HTML for the desired section • Use your web browsers inspect function (right click  Inspect) Objective 1a:Collect list of Players from BR Catalog

  6. Basic structure for HTML tables Objective 1a:Collect list of Players from BR Catalog <Table> - Table Tag <thead> - Table Heading Tag <tr> - Table Row Tag <th>Column 1</th> - Column Headers <th>Column 2</th> <th>Column 3</th> </tr>- Table Row End Tag </thead>- Table Heading End Tag <tbody> - Table Body Tag <tr> <td> data 1</td> - Table Detail (cell) <td> data 2</td> <td> data 3</td> </tr> </tbody>-Table Body End Tag </Table> - Table End Tag

  7. Step 3: Use Alteryx to retrieve the HTML • Pass first URL to download tool using a text input tool • Use the Text columns tool parsing the HTML into rows • Add record IDs to rows Objective 1a:Collect list of Players from BR Catalog

  8. Step 4: Isolate code containing table data • Challenge: We need to identify the table dynamically • Table can be located on different lines on each page • Table can have different # of rows depending on # of players Objective 1a:Collect list of Players from BR Catalog

  9. Step 5: Isolate and parse the table headers Objective 1a:Collect list of Players from BR Catalog • Isolate table headers by reusing the section of tools we used to isolate the table • Parse the table headers using an XML Parse tool

  10. Step 6: Isolate and parse the table detail Objective 1a:Collect list of Players from BR Catalog • Isolate table detail by reusing the section of tools we used to isolate the table and table headers

  11. Step 6: Isolate and Parse Table Detail Objective 1a:Collect list of Players from BR Catalog • Parse the table detail using a combination of XML Parsing and Regex

  12. Step 7: Union headers to detail Objective 1a:Collect list of Players from BR Catalog • Use the Autoc0nfig by Position option • Ensure the header data stream in first in the Output Order

  13. Step 8: Validate the output Objective 1a:Collect list of Players from BR Catalog • Compare data in the workflow to the webpage

  14. Step 9: Convert the workflow into a batch macro Objective 1a:Collect list of Players from BR Catalog • Add a Control Parameter tool to Text input tool at the beginning of the workflow • Add a Macro Output tool the end of the workflow

  15. Step 9: Convert the workflow into a batch macro Objective 1a:Collect list of Players from BR Catalog • Save the workflow as a Batch Macro into your macro repository • If necessary add a new macro repository through the user settings

  16. Step 10: Create a new workflow using the new macro Objective 1a:Collect list of Players from BR Catalog • Pass the macro the player catalog URLs to download and parse all of the Player Catalog webpages

  17. Step 1: Examine the webpage(s) that will be scraped Objective 1b:Return Season Statistics for individual Players

  18. Step 2: Examine the structure of HTML for the desired section • Use your web browsers inspect function (right click  Inspect) Objective 1b:Return Season Statistics for Individual Players

  19. Step 3: Use Alteryx to retrieve the HTML • Pass first URL to download tool using a text input tool • Use the Text columns tool parsing the HTML into rows • Add record IDs to rows Objective 1b:Return Season Statistics for Individual Players

  20. Step 4: Isolate code containing table data • Challenge: We need to identify the table dynamically • Table can be located on different lines on each page • Table can have different # of rows depending on seasons played • There are multiple tables on a player’s page Objective 1b:Return Season Statistics for Individual Players

  21. Step 5: Isolate and parse the table headers Objective 1b:Return Season Statistics for Individual Players • Isolate table headers by reusing the section of tools we used to isolate the table • Parse the table headers using an XML Parse tool

  22. Step 6: Isolate and the parse table detail Objective 1b:Return Season Statistics for Individual Players • Isolate table detail by reusing the section of tools we used to isolate the table and table headers • Parse the table detail using a combination of XML Parsing and Regex

  23. Step 7: Union headers to detail/Append URL suffix Objective 1b:Return Season Statistics for Individual Players • Use the Autoc0nfig by Position option • Ensure the header data stream in first in the Output Order

  24. Step 8: Convert the workflow into a batch macro Objective 1b:Return Season Statistics for Individual Players • Add a Control Parameter tool to Text input tool at the beginning of the workflow • Add a Macro Output tool the end of the workflow

  25. Step 8: Convert the workflow into a batch macro Objective 1b:Return Season Statistics for Individual Players • Open the Interface Designer (CTRL +ALT + D) • Change Batch Macro Output mode to : ‘Auto Configure by Name’ • Save the workflow as a Batch Macro into your macro repository

  26. Step 9: Use the batch macro to parse the player pages Objective 1b:Return Season Statistics for Individual Players • Open a New workflow • Use the Player Catalog list as your input data • Concatenate ‘https://www.basketball-reference.com’ with the Player URL suffix • Pass the Concatenated URL into the batch macro • Limit the number of records per run using sample or filter tool

  27. Resources • Regex : www.regexr.com • Batch Macros: https://community.alteryx.com/t5/Live-Training/Live-Training-Build-Your-First-Batch-Macro/m-p/52900 • Alteryx Tool Mastery: • Download • Regex

More Related