Web mining
1 / 33

Web Mining - PowerPoint PPT Presentation

  • Uploaded on

Web Mining. Shah Mohammad Nur Alam Sawn 03/03/2014. What is Web Mining?. Discovering desired and useful information from the World Wide Web. Exploiting Geographical Location Information of Web Pages. Orkut Buyukkokten ( [email protected] ) Junghoo Cho( [email protected] )

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Web Mining' - kyrie

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Web mining
Web Mining

Shah Mohammad NurAlam Sawn


What is web mining

What is Web Mining?

Discovering desired and useful information from the World Wide Web

Exploiting geographical location information of web pages
Exploiting Geographical Location Information of Web Pages

Proof of concept using mapping databases
“Proof of Concept” using mapping databases

Ways of exploiting information from internet:

  • Improve the search engine; such as, not showing irrelevant information about the query.

  • To identify the “globality” of resources; such as, use of hyperlink and exploiting information about web sites then it can estimated how global a web entity is.

Problems of exploit geographical location information of entities
Problems of exploit geographical location information of entities

  • How to compute geographical information?

  • How to exploit this information?

C entitiesomputing geographical information

  • Information Extraction; such as, automatically analyze web pages to extract geographic entities like area or zip code.

  • Network IP Address Analysis; such as, focus on the location of their hosting web sites.

Exploiting the information using databases
Exploiting the Information using databases entities

  • Site Mapper (http://www.internic.net/)

    It has the phone numbers of network administrators of all Class A and B domains. From this database, extracted the area code of the domain administrator and built a Site-Mapper table with area code information for IP addresses belonging to Class A and Class B addresses.

  • Area entitiesMapper (http://www.zipinfo.com/)

    It maps cities and townships to a given area code. In some cases, entire states (e.g., Montana) correspond to one area code. In other cases, a big city often has multiple area codes (e.g., Los Angeles). Then write scripts to convert the above data into a table with entries that maintained for each area code the corresponding set of cities/counties.

  • Zip-Code entitiesMapper (http://www.zipinfo.com/)

    This mapped each zip code to a range of longitudes and latitudes.

Graphical Interface of Proof of Concept entitiesPrototype


Output of search





Zip code




Area Code


Geospatial data mining on the w eb discovering locations of emergency service facilities 2012
Geospatial Data Mining on the entitiesWeb: Discovering Locations of Emergency Service Facilities. (2012)

Wenwen Li, Michael F. Goodchild, Richard L. Church , and Bin Zhou

  • GeoDaCenter for Geospatial Analysis and Computation, School of GeographicalSciences and Urban Planning, Arizona State University, Tempe AZ 85287 ([email protected])

  • Department of Geography, University of California, Santa Barbara Santa Barbara, CA 93106 {good,church}@geog.ucsb.edu

  • Institute of OceanographicInstrumentation, Shandong Academy of Sciences Qingdao, Shandong, China 266001 ([email protected] )

Google search image of fire station
Google search image of fire station entities

Actual Location

Google result

Process of Web entitiesCrowler

A Web crawler is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing. A Web crawler may also be called a Web spider, or an automatic indexer.

Cont. entities

d1:Distance between p and the location of the foremost digit in the number block closest (before) to location p.

d2: Distance between p and the location of the last digit of the first number that appears(for detecting 5-digit ZIP code), or the last digit of the second number after p if the token distance of the first and second number block equals

r1: regular expression [1-9][0-9]*[\\s\\r\\n\\t]*([a-zA-Z0-9\\.]+[\\s\\r\\n\\t])+

r2: : regular expression "city-Pattern "[\\s\\r\\n\\t,]?+


Decision rules of desired addresses by training data based on semantic information
Decision rules of desired addresses by training data based on semantic information

Station + Num

Key word Station and

Title web page as fire

Station on

web page title

Architecture of proposed cyber miner
Architecture of Proposed Cyber Miner on semantic information

  • Here input is seedingweburls and output is targetaddress

Search results of cyber miner
Search Results of Cyber Miner on semantic information

Location of all fire station obtained by Cyber Miner from address database

Web based geographic search engine for location aware search in singapore
Web-based on semantic informationgeographic search engine for location aware search in Singapore

  • Flora S. TsaiSchool of Electrical & Electronic Engineering, Nanyang Technological University, Singapore 639798, Singapore 2010.

Geo search
Geo search on semantic information

This is able to search for location-specific information in Singapore based Web sites. The user is able to view their search locations on a satellite map instead of the two-dimensional maps currently used in street directories. The Web-based search engine is able to search for locations based on area names, building names, and groups of landmark types, business names, and business categories. Furthermore, the user is also able to use their current coordinates as a parameter so that the search engine is able to return results in order of the distance from the user’s current location.

Google earth
Google earth on semantic information

Using googleearth for theirsearch

Keyhole markup language
Keyhole Markup Language on semantic information

Keyhole Markup Language (KML) is a file format used to display geographic data in an earth browser such as Google Earth, Google Maps and Google Maps for mobile.

Street directory
Street directory on semantic information


Usefull for mobile phoneonly and it is alsowebmapservicewhichmerge with googleearth

Global positioning system
Global Positioning System on semantic information

Google Earth allows download of tracks and waypoints from GPS devices creates KML files for the waypoints and tracks downloaded.

Design on semantic information

Design cont
Design on semantic informationCont.

  • BusinessAreaAddress, where the address is stored without the postal code;

  • BusinessAreaPostal, where the postal code is stored;

  • Area, where the keywords of the area are stored, e.g. Causeway Point;

  • General Area, where the General Area of the location is stored, e.g. Yishun.

A lgorithms
A on semantic informationlgorithms

Here use the Haversine’sFormula for faster processing.

  • For two points on a sphere of radius Rwith latitudes Ø1 andØ2,latitude separation ΔØ= Ø1 - Ø2and longitude separation Δλ.

  • where angles are in radians, and the distance d

    between the two points is related to their locations

    by the formula:

    h=haversin(Δ Ø)+cos(Ø1 )cos(Ø2)haversin(Δ λ)……(1)

Algorithms cont
Algorithms on semantic informationCont.

  • Let hdenote haversin(d/R)given from above. d can then be solved either by simply applying the inverse haversine (if available) or by using the arcsine (inverse sine) function:

  • d=(R)haversin-1 (h)=(2R)arcsin(√h)………………..(2)

  • This formula is only an approximation when applied to Earth as earth is not a perfect sphere, its radius Rvaries from 6356.78 km at the poles to 6367.45 km at the equator. The error is therefore 0.1% depending on the location due to this slight elipticity. Assuming that the geometric mean of R= 6367.45 km is used.

  • The output of this formula is calculating distance from two coordinates

Result on semantic information

The database from which these results are taken contain 1652 entries with the following categories:

  • Apparel, Bank, Cinema, Department Store, Duty Free Shop, Electronics, F&B (food and bev- erage), Fast Food, Food Court, Furniture, Health and Beauty, Minim-art, Musical Instruments, Restaurant, Snack Bar, Sports, Stationery,Seafood, and Supermarket.

  • The landmark type searched for are Building, Road, MRT stations, Schools and Shopping Centres. General Area searched under Advanced have various roads grouped into one big area, e.g. Tan-jongKatong and Haig Road are both grouped under the Katongarea

Simple search
Simple search on semantic information



Advance search
Advance search on semantic information

Thank you for your patience
Thank you for your patience! on semantic information