1 / 22

Predicting Innovative Firms using Web Mining and Deep Learning

Predicting Innovative Firms using Web Mining and Deep Learning. Jan Kinne & David Lenz. May 2019. IGL 2019. Motivation: Shortcomings of Traditional Innovation Indicators. Timeliness Coverage Data collection costs Granularity

hawa
Download Presentation

Predicting Innovative Firms using Web Mining and Deep Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Predicting Innovative Firms using Web Mining and Deep Learning Jan Kinne & David Lenz May 2019 IGL 2019

  2. Motivation:Shortcomings of Traditional Innovation Indicators • Timeliness • Coverage • Data collection costs • Granularity STI policy making requires an accurate and timely picture of the current state of the STI system in order to plan and evaluate policy measures in an evidence-based manner.

  3. Motivation:Web Mining of Firm Websites • Firms want us (as customers) to know how innovative they are. • Almost all (significant) firms have websites nowadays. • Important medium to tender and promote services and products. • Firms have an incentive to keep their websites up-to-date. • Firm website content is often related to product, service, and organisational innovations. Using this openly accessible data to create timely and granular firm-level innovation indicators.

  4. Web Mining of Firm Websites:Analysis Framework Surveys Download of Web Address Web Scraper Established Innovation Indicator Databases Website Content Patents LBIO Firm Database Metadata Data Mining (e.g. Sector, Firm Size) … Match at Extraction of Innovation Indicators Firm-level Innovation-related Information

  5. Deep Learning of Website Text:Framework Scraped firm website text + trainingdata Database with traditionalinnovation indicators ! output tf-idf web scraper input ? Vector representationof website text Firm with unknown innovation activity (product innovator = ?) Firm with predicted innovation activity (product innovator = 0.83) Scraped firm website text Neural network trained on labelled (innovative/non-innovative) website texts

  6. ARGUS Web Scraper Scraped firm website text + trainingdata Database with traditionalinnovation indicators ! output tf-idf web scraper input ? Vector representationof website text Firm with unknown innovation activity (product innovator = ?) Firm with predicted innovation activity (product innovator = 0.83) Scraped firm website text Neural network trained on labelled (innovative/non-innovative) website texts

  7. ARGUS Web Scraper:Overview • Free and easy-to-use web scraping tool with GUI [github.com/datawizard1337/ARGUS] • ~1,000 webpages per minuteper CPU core • ~5 days for 50m webpages on an office-grade PC • Scraping of hyperlinks and texts

  8. Text Vectorization Scraped firm website text + trainingdata Database with traditionalinnovation indicators ! output tf-idf web scraper input ? Vector representationof website text Firm with unknown innovation activity (product innovator = ?) Firm with predicted innovation activity (product innovator = 0.83) Scraped firm website text Neural network trained on labelled (innovative/non-innovative) website texts

  9. Text Vectorization:Workflow 650,000 fixedsize (6,150) vectors = 650,000 websitetextsoffirmswithunknowninnovationstatus • Web scraping of 650,000 firm websites with max. 25 webpages each • 50 GB of raw text data • Deep learning based filtering of “bloat” webpages (e.g. imprints) • Aggregating webpages to the website level • Keeping max. 5,000 words per website • TF-IDF vectorization with popularity-based filtering (dictionary of ~6,150 words)

  10. Training Data Scraped firm website text + trainingdata Database with traditionalinnovation indicators ! output tf-idf web scraper input ? Vector representationof website text Firm with unknown innovation activity (product innovator = ?) Firm with predicted innovation activity (product innovator = 0.83) Scraped firm website text Neural network trained on labelled (innovative/non-innovative) website texts

  11. Training data:MIP Survey Data 4,000 labels: productinnovatoryes | no 4,000 fixedsize(6,150) vectors • 2017 Mannheim Innovation Panel (MIP) traditional questionnaire-based survey of ~4,000 firms in Germany • Binary target variable: Firm was product innovator (YES|NO) • Web scraping of website texts  TF-IDF vectorization = 4,000 websitetextsoffirmswithknowninnovationstatus

  12. Neural NetworkTraining Scraped firm website text + trainingdata Database with traditionalinnovation indicators ! output tf-idf web scraper input ? Vector representationof website text Firm with unknown innovation activity (product innovator = ?) Firm with predicted innovation activity (product innovator = 0.83) Scraped firm website text Neural network trained on labelled (innovative/non-innovative) website texts

  13. Neural Network Training:UndercompleteAutoencoder-like Architecture 4 HIDDEN LAYERS 250| 5 | 5 | 125 SIGMOID (LOGISTIC) OUTPUT LAYER 0.0 to 1.0 INPUT LAYER 6,100 neurons 80% of survey data used for training = 3,200 firm websites as TF-IDF vectors “BOTTLENECK” survey labels Predicted probabilities Training (optimization) of 1.5m parameters using backpropagation

  14. Neural Network Training:Prediction Performance 81% of firms predicted “non-innovative” are actually “non-innovative” Composite measure of precision and recall 91% of actually “non-innovative” firms are recovered 20% of survey data as test set

  15. Neural NetworkPrediction Scraped firm website text + trainingdata Database with traditionalinnovation indicators ! output tf-idf web scraper input ? Vector representationof website text Firm with unknown innovation activity (product innovator = ?) Firm with predicted innovation activity (product innovator = 0.83) Scraped firm website text Neural network trained on labelled (innovative/non-innovative) website texts

  16. Neural Network Predictions:Out-of-sample innovation prediction 650.000 “product innovator” probabilities 650.000 firm website texts as TF-IDF vectors

  17. Neural Network Predictions:A Continuous Innovation Indicator 650.000 “product innovator” probabilities Classification threshold of 0.5: 10.31% innovative firms

  18. Neural Network Predictions:Innovativeness by Firm Size

  19. Neural Network Predictions:Innovative Districts • Higher shares of product innovators in urban areas • Correlation with population density: 0.61 • Lower shares of product innovators in East and North

  20. Neural Network Predictions:Microgeographic Pattern

  21. Future Directions:Supporting big data based Policy Making • Industry-specific prediction models • Developing further web-based innovation indicators • Formalization and dissemination of a coherent methodology • Building up a panel database of web data • Application in policy evaluation projects • Investigating microgeographic diffusion of innovation and technology

  22. Thanks! github.com/datawizard1337 twitter.com/jan_kinne zew.de/en/team/jki

More Related