Web crawling a tool for analysing the labour market with online job vacancies
This presentation is the property of its rightful owner.
Sponsored Links
1 / 26

Web Crawling: A tool for Analysing the Labour Market with Online Job Vacancies PowerPoint PPT Presentation


  • 80 Views
  • Uploaded on
  • Presentation posted in: General

Web Crawling: A tool for Analysing the Labour Market with Online Job Vacancies . Why should we analyse Online Job vacancies? Because Internet is growingly serving as a new source of data : it has a large potential Because Data collection from the Internet is cost-effective

Download Presentation

Web Crawling: A tool for Analysing the Labour Market with Online Job Vacancies

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Web crawling a tool for analysing the labour market with online job vacancies

Web Crawling: A tool for Analysing the Labour Market with Online Job Vacancies

  • Why should we analyse Online Job vacancies?

  • Because Internet is growingly serving as a new source of data: it has a large potential

  • Because Data collection from the Internet is cost-effective

  • Because it allows covering important gaps in data and knowledge such as wage and understanding employer’s demand

  • Because it allows matching demand with offer


Web crawling a tool for analysing the labour market with online job vacancies1

Web Crawling: A tool for Analysing the Labour Market with Online Job Vacancies

  • Why should we analyse Online Job vacancies?

    A wide range of policy applicability:

  • Demand-led approach to labor market policy: what types of skills should be ‘given’ to persons with disadvantaged situation in the labour market.

  • Education policy - Second chance education, curricula formation: If conducted over time, could help track what skills are on the rise or in decline

  • Social policy: labour market discrimination


Web crawling a tool for analysing the labour market using online job vacancies

Web Crawling: a tool for Analysing the Labour Market using Online Job Vacancies

II. An example of what you can obtain with online job vacancies:


Web crawling a tool for analysing the labour market using online job vacancies1

Web Crawling: a tool for Analysing the Labour Market using Online Job Vacancies

What’s a Web Crawler?

Web Crawler [or Web Spider or Scraper] is a

computer program that automatically gathers,

analyses and files information from the internet

at many times the speed of a human.

(based on Wikipedia)

  • Pulls information from

    a website


Web crawling a tool for analysing the labour market with online job vacancies2

Web Crawling: A Tool for Analysing the Labour Market with Online Job Vacancies

III. What’s a Web Crawler?

A Web Crawler [or Web Spider or Scraper] is a

computer program that automatically gathers,

analyses and files information from the internet

at many times the speed of a human.

(based on Wikipedia)

  • Pulls information from

    a website


Web crawling a tool for analysing the labour market with online job vacancies3

Web Crawling: A Tool for Analysing the Labour Market with Online Job Vacancies

III. What’s a Web Crawler:

Computer program = a set of instructions for a computer communicated by a certain programming language (eg. C++, Pascal or Perl) using a software environment (operating system, compiler, interface)

How to write a computer program?

use ‘R’ = a language and software environment


Web crawling a tool for analysing the labour market with online job vacancies4

Web Crawling: A Tool for Analysing the Labour Market with Online Job Vacancies

IV. Why we should use a Web Crawler

  • It’s free. Only cost = time of computer person

  • Allows for greater specificity: it can go to a level of detail normally not viable with paper surveys

  • It’s in real time: NOW! You get the latest data: it’s labour market analysis for NOW.

  • Even if it is in a country not in a position to use it now, soon or late it will be able to use it.


Web crawling a tool for analysing the labour market with online job vacancies5

Web Crawling: A Tool for Analysing the Labour Market with Online Job Vacancies

V. How it works

Generally, the web crawling program instructs the computer to:

  • Visit a list of URLs (“Uniform Resource Locator”: www.google.com) - Seeds

  • Identify further hyperlinks (“reference to further data”: www.google.com/answers) - Crawl frontier

    You can then instruct the computer to:

  • Gather information from the hyperlinks; such as

    job descriptions and save it in eg ACCESS file.


Web crawling a tool for analysing the labour market with online job vacancies6

Web Crawling: A Tool for Analysing the Labour Market with Online Job Vacancies

  • You can download free web crawlers from the web and customize; i.e.

    http://en.wikipedia.org/wiki/Category:Free_web_crawlers

  • Buy a customized web crawler; i.e.

    http://ficstar.com/web-data-mining-web-scraping/?gclid=CKeYtKGi87oCFWmWtAodEB4AIQ

  • or write it yourself  - which I will tell you about in the next 8 minutes


Web crawling a tool for analysing the labour market with online job vacancies7

Web Crawling: A Tool for Analysing the Labour Market with Online Job Vacancies

VI. How to write your own simple web crawler:

  • Download & Install open source (‘for free’) software environment ‘R’http://www.r-project.org/

  • http://www.r-project.org/


Web crawling a tool for analysing the labour market with online job vacancies8

Web Crawling: A tool for Analysing the Labour Market with Online Job Vacancies

  • Open the EURES website

    http://ec.europa.eu/eures/main.jsp?acro=job&lang=en&catId=482&parentCategory=482

  • Select the ”Hotel, Catering and Personal Services staff" occupation in the field “Select an occupation from the drop-down menus:” Do not choose subcategories.

  • Select "Austria" in the field “Single-country selection”.

  • Press “search”.


Web crawling a tool for analysing the labour market with online job vacancies9

Web Crawling: A tool for Analysing the Labour Market with Online Job Vacancies

  • Select 10 in the “Results per page”.

  • Press “refine”.

  • Click on the link to the job ads: for instance “Austria: 8336 job(s) matched (9858 post(s))”.

  • Click on “Next page”.

  • Right click on “Previous page”. Select “open in a new tab”


Web crawling a tool for analysing the labour market with online job vacancies10

Web Crawling: A tool for Analysing the Labour Market with Online Job Vacancies

  • Copy the URL from the new tab. For instance: http://ec.europa.eu/eures/eures-

  • searchengine/servlet/BrowseCountryJVsServlet?lg=EN&isco=51&country=AT&multipleCountries =AT- %25&multipleRegions=%25&date=01%2F01%2F1975&title=&durex=&exp=&qual=&pageSize=10 &totalCount=8336&startIndexes=0-1o1-1o2-1I0-1o1-11o2-1I0-1o1-21o2-1I&page=1

  • Now you have the correct URL to paste into the R program.


Web crawling a tool for analysing the labour market with online job vacancies11

Web Crawling: A tool for Analysing the Labour Market with Online Job Vacancies

For the EURES website this pattern needs to be

repeated

  • twice: once for Page URLS (for instance 1734 for Belgium) and once for Job ads (for instance 52 000 for Belgium)

  • and in form of loops: for each page {i in 1:N} and each job ad {j in 1:N}

    In the following I show one step of the loop for

    finding characteristics of job ads. The commands

    cannot be copy-pasted but need to be adjusted.


Web crawling a tool for analysing the labour market with online job vacancies12

Web Crawling: A tool for Analysing the Labour Market with Online Job Vacancies

  • Open R. You have the ‘Console’ to show your commands and output and a ‘Script’ to write your program. Open a new script.

  • Write your own little programme: It consists of 4 main steps:

    a. Find and read a webpage in HTML code

    b. Find a pattern in HTML code: eg. “Job Title”.

    c. Extract information from identified HTML tag: eg. “Car mechanic”.

    d. Save the data in an ACCESS or EXCEL file.

    HTML =“HyperText Markup Language” to write a web page


Web crawling a tool for analysing the labour market with online job vacancies13

Web Crawling: A tool for Analysing the Labour Market with Online Job Vacancies

a. Find and read a webpage in HTML:

first_url <- ‘PASTE YOUR URL IN HERE’

first_page <- readLines(‘first_url’)

The command ‘readLines’ reads text from

specified connection, where a connection can be a URL and can be opened by the computer.

EASY.


Web crawling a tool for analysing the labour market with online job vacancies14

Web Crawling: A tool for Analysing the Labour Market with Online Job Vacancies

b. Find a pattern in the HTML text:

Pattern <- <thcolspan=“1”> Job Title: </th>

location <- grep(Pattern, first_page)

The command ‘grep’ matches strings of text using regular expressions: searches for “Job Title” on page of interest and returns line numbers from HTML text.

Regular expression = concise and flexible language that is understood by a regular expression processor; eg. [^abc] matches anything in a text except a,b,c; [^0-9] matches only non-numeric


Web crawling a tool for analysing the labour market with online job vacancies15

Web Crawling: A tool for Analysing the Labour Market with Online Job Vacancies

c. Extract information from HTML tags:

Job_title <-

gsub('\t\t\t\t<tdcolspan=\"3\"><span>|</span

</td>','\\1',job_ad)

The command ‘gsub’ returns a character vector

from ‘beginning of text’ to (|) ‘end of text’ using

regular expressions, where the text is job_ad.


Web crawling a tool for analysing the labour market with online job vacancies16

Web Crawling: A tool for Analysing the Labour Market with Online Job Vacancies

d. Save the data:

output_new <- as.data.frame(job_title)

eures <- "C:/Documents and Settings/thum/My Documents/eures16.csv"

write.csv(output_new,eures)

The command ‘write.csv’ allows to save the

output from ‘output_new’ to the locaton ‘eures’.

EASY.


Web crawling a tool for analysing the labour market with online job vacancies17

Web Crawling: A tool for Analysing the Labour Market with Online Job Vacancies

VII. What do we obtain with it (1):

See officeUK.csv (open with ACCESS)


Web crawling for job ads

Web Crawling for Job Ads

VII. What do we obtain with it (2):

See next slide


Web crawling a tool for analysing the labour market with online job vacancies18

Web Crawling: A tool for Analysing the Labour Market with Online Job Vacancies

VII. What do we obtain with it (2):

See Slide 3.


Web crawling for job ads1

Web Crawling for Job Ads

A last word on the ‘politeness policy’

Web crawlers can decrease the functionality of

websites as they search in greater depth and

speed than a human. They can

  • Crash servers

  • Cause server overload

  • Disrupt networks

    So, always ask the webmaster for permission.


Web crawling a tool for analysing the labour market with online job vacancies19

Web Crawling: A tool for Analysing the Labour Market with Online Job Vacancies

VIII. Limitations:

  • Availability of data on the internet. But this is changing

  • Learning curve: getting the person to learn how to use it.


Web crawling a tool for analysing the labour market with online job vacancies20

Web Crawling: A tool for Analysing the Labour Market with Online Job Vacancies

If all this sounded very confusing: http://statistics.berkeley.edu/computing/r-reading-webpages

More references:

  • Spector, P. (2011): Reading Data from Web Pages with R, University of Berkeley, Class Notes s133.

  • Spector, P. (2011): Stat 133 Class Notes – Spring, 2011; pp 70-88

  • Jockers, M. (2013): Text Analysis with R, under review with Springer

  • R help, ETH Zurich, https://stat.ethz.ch/mailman/listinfo/r-help


Web crawling a tool for analysing the labour market with online job vacancies21

Web Crawling: A tool for Analysing the Labour Market with Online Job Vacancies

Prepared by Yamina Guidoum from the work of :

Dr. Anna-Elisabeth Thum

([email protected])

Associate Research Fellow at CEPS and Economic Analyst at DG ECFIN, European Commission

"The views in this presentation do not reflect the views of the European Commission”

And of:

Dr. Lucia MytnaKureková

Slovak Governance institute & Central European University

From the INGRID Winter School “Skills and Occupations in Europe”, November 2013, CEPS, Brussels


  • Login