1 / 26

Web Crawling: A tool for Analysing the Labour Market with Online Job Vacancies

Web Crawling: A tool for Analysing the Labour Market with Online Job Vacancies . Why should we analyse Online Job vacancies? Because Internet is growingly serving as a new source of data : it has a large potential Because Data collection from the Internet is cost-effective

orli
Download Presentation

Web Crawling: A tool for Analysing the Labour Market with Online Job Vacancies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web Crawling: A tool for Analysing the Labour Market with Online Job Vacancies • Why should we analyse Online Job vacancies? • Because Internet is growingly serving as a new source of data: it has a large potential • Because Data collection from the Internet is cost-effective • Because it allows covering important gaps in data and knowledge such as wage and understanding employer’s demand • Because it allows matching demand with offer

  2. Web Crawling: A tool for Analysing the Labour Market with Online Job Vacancies • Why should we analyse Online Job vacancies? A wide range of policy applicability: • Demand-led approach to labor market policy: what types of skills should be ‘given’ to persons with disadvantaged situation in the labour market. • Education policy - Second chance education, curricula formation: If conducted over time, could help track what skills are on the rise or in decline • Social policy: labour market discrimination

  3. Web Crawling: a tool for Analysing the Labour Market using Online Job Vacancies II. An example of what you can obtain with online job vacancies:

  4. Web Crawling: a tool for Analysing the Labour Market using Online Job Vacancies What’s a Web Crawler? Web Crawler [or Web Spider or Scraper] is a computer program that automatically gathers, analyses and files information from the internet at many times the speed of a human. (based on Wikipedia) • Pulls information from a website

  5. Web Crawling: A Tool for Analysing the Labour Market with Online Job Vacancies III. What’s a Web Crawler? A Web Crawler [or Web Spider or Scraper] is a computer program that automatically gathers, analyses and files information from the internet at many times the speed of a human. (based on Wikipedia) • Pulls information from a website

  6. Web Crawling: A Tool for Analysing the Labour Market with Online Job Vacancies III. What’s a Web Crawler: Computer program = a set of instructions for a computer communicated by a certain programming language (eg. C++, Pascal or Perl) using a software environment (operating system, compiler, interface) How to write a computer program? use ‘R’ = a language and software environment

  7. Web Crawling: A Tool for Analysing the Labour Market with Online Job Vacancies IV. Why we should use a Web Crawler • It’s free. Only cost = time of computer person • Allows for greater specificity: it can go to a level of detail normally not viable with paper surveys • It’s in real time: NOW! You get the latest data: it’s labour market analysis for NOW. • Even if it is in a country not in a position to use it now, soon or late it will be able to use it.

  8. Web Crawling: A Tool for Analysing the Labour Market with Online Job Vacancies V. How it works Generally, the web crawling program instructs the computer to: • Visit a list of URLs (“Uniform Resource Locator”: www.google.com) - Seeds • Identify further hyperlinks (“reference to further data”: www.google.com/answers) - Crawl frontier You can then instruct the computer to: • Gather information from the hyperlinks; such as job descriptions and save it in eg ACCESS file.

  9. Web Crawling: A Tool for Analysing the Labour Market with Online Job Vacancies • You can download free web crawlers from the web and customize; i.e. http://en.wikipedia.org/wiki/Category:Free_web_crawlers • Buy a customized web crawler; i.e. http://ficstar.com/web-data-mining-web-scraping/?gclid=CKeYtKGi87oCFWmWtAodEB4AIQ • or write it yourself  - which I will tell you about in the next 8 minutes

  10. Web Crawling: A Tool for Analysing the Labour Market with Online Job Vacancies VI. How to write your own simple web crawler: • Download & Install open source (‘for free’) software environment ‘R’http://www.r-project.org/ • http://www.r-project.org/

  11. Web Crawling: A tool for Analysing the Labour Market with Online Job Vacancies • Open the EURES website http://ec.europa.eu/eures/main.jsp?acro=job&lang=en&catId=482&parentCategory=482 • Select the ”Hotel, Catering and Personal Services staff" occupation in the field “Select an occupation from the drop-down menus:” Do not choose subcategories. • Select "Austria" in the field “Single-country selection”. • Press “search”.

  12. Web Crawling: A tool for Analysing the Labour Market with Online Job Vacancies • Select 10 in the “Results per page”. • Press “refine”. • Click on the link to the job ads: for instance “Austria: 8336 job(s) matched (9858 post(s))”. • Click on “Next page”. • Right click on “Previous page”. Select “open in a new tab”

  13. Web Crawling: A tool for Analysing the Labour Market with Online Job Vacancies • Copy the URL from the new tab. For instance: http://ec.europa.eu/eures/eures- • searchengine/servlet/BrowseCountryJVsServlet?lg=EN&isco=51&country=AT&multipleCountries =AT- %25&multipleRegions=%25&date=01%2F01%2F1975&title=&durex=&exp=&qual=&pageSize=10 &totalCount=8336&startIndexes=0-1o1-1o2-1I0-1o1-11o2-1I0-1o1-21o2-1I&page=1 • Now you have the correct URL to paste into the R program.

  14. Web Crawling: A tool for Analysing the Labour Market with Online Job Vacancies For the EURES website this pattern needs to be repeated • twice: once for Page URLS (for instance 1734 for Belgium) and once for Job ads (for instance 52 000 for Belgium) • and in form of loops: for each page {i in 1:N} and each job ad {j in 1:N} In the following I show one step of the loop for finding characteristics of job ads. The commands cannot be copy-pasted but need to be adjusted.

  15. Web Crawling: A tool for Analysing the Labour Market with Online Job Vacancies • Open R. You have the ‘Console’ to show your commands and output and a ‘Script’ to write your program. Open a new script. • Write your own little programme: It consists of 4 main steps: a. Find and read a webpage in HTML code b. Find a pattern in HTML code: eg. “Job Title”. c. Extract information from identified HTML tag: eg. “Car mechanic”. d. Save the data in an ACCESS or EXCEL file. HTML =“HyperText Markup Language” to write a web page

  16. Web Crawling: A tool for Analysing the Labour Market with Online Job Vacancies a. Find and read a webpage in HTML: first_url <- ‘PASTE YOUR URL IN HERE’ first_page <- readLines(‘first_url’) The command ‘readLines’ reads text from specified connection, where a connection can be a URL and can be opened by the computer. EASY.

  17. Web Crawling: A tool for Analysing the Labour Market with Online Job Vacancies b. Find a pattern in the HTML text: Pattern <- <thcolspan=“1”> Job Title: </th> location <- grep(Pattern, first_page) The command ‘grep’ matches strings of text using regular expressions: searches for “Job Title” on page of interest and returns line numbers from HTML text. Regular expression = concise and flexible language that is understood by a regular expression processor; eg. [^abc] matches anything in a text except a,b,c; [^0-9] matches only non-numeric

  18. Web Crawling: A tool for Analysing the Labour Market with Online Job Vacancies c. Extract information from HTML tags: Job_title <- gsub('\t\t\t\t<tdcolspan=\"3\"><span>|</span </td>','\\1',job_ad) The command ‘gsub’ returns a character vector from ‘beginning of text’ to (|) ‘end of text’ using regular expressions, where the text is job_ad.

  19. Web Crawling: A tool for Analysing the Labour Market with Online Job Vacancies d. Save the data: output_new <- as.data.frame(job_title) eures <- "C:/Documents and Settings/thum/My Documents/eures16.csv" write.csv(output_new,eures) The command ‘write.csv’ allows to save the output from ‘output_new’ to the locaton ‘eures’. EASY.

  20. Web Crawling: A tool for Analysing the Labour Market with Online Job Vacancies VII. What do we obtain with it (1): See officeUK.csv (open with ACCESS)

  21. Web Crawling for Job Ads VII. What do we obtain with it (2): See next slide

  22. Web Crawling: A tool for Analysing the Labour Market with Online Job Vacancies VII. What do we obtain with it (2): See Slide 3.

  23. Web Crawling for Job Ads A last word on the ‘politeness policy’ Web crawlers can decrease the functionality of websites as they search in greater depth and speed than a human. They can • Crash servers • Cause server overload • Disrupt networks So, always ask the webmaster for permission.

  24. Web Crawling: A tool for Analysing the Labour Market with Online Job Vacancies VIII. Limitations: • Availability of data on the internet. But this is changing • Learning curve: getting the person to learn how to use it.

  25. Web Crawling: A tool for Analysing the Labour Market with Online Job Vacancies If all this sounded very confusing: http://statistics.berkeley.edu/computing/r-reading-webpages More references: • Spector, P. (2011): Reading Data from Web Pages with R, University of Berkeley, Class Notes s133. • Spector, P. (2011): Stat 133 Class Notes – Spring, 2011; pp 70-88 • Jockers, M. (2013): Text Analysis with R, under review with Springer • R help, ETH Zurich, https://stat.ethz.ch/mailman/listinfo/r-help

  26. Web Crawling: A tool for Analysing the Labour Market with Online Job Vacancies Prepared by Yamina Guidoum from the work of : Dr. Anna-Elisabeth Thum (anna.thum@ceps.eu) Associate Research Fellow at CEPS and Economic Analyst at DG ECFIN, European Commission "The views in this presentation do not reflect the views of the European Commission” And of: Dr. Lucia MytnaKureková Slovak Governance institute & Central European University From the INGRID Winter School “Skills and Occupations in Europe”, November 2013, CEPS, Brussels

More Related