From the inside out michael hunter reference librarian hobart and william smith colleges
Download
1 / 81

from the inside out michael hunter - PowerPoint PPT Presentation


  • 316 Views
  • Updated On :

From the Inside Out Michael Hunter Reference Librarian Hobart and William Smith Colleges. Google from the Inside Out. Hardware and Database Creation Relevance Ranking and Link Analysis Advanced and “Hidden” Search Features Hands-on Session Pay-for-Placement and Revenue Issues

Related searches for from the inside out michael hunter

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'from the inside out michael hunter' - Roberta


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
From the inside out michael hunter reference librarian hobart and william smith colleges l.jpg

From the Inside Out

Michael Hunter

Reference Librarian

Hobart and William Smith Colleges


Google from the inside out l.jpg
Google from the Inside Out

  • Hardware and Database Creation

  • Relevance Ranking and Link Analysis

  • Advanced and “Hidden” Search Features

  • Hands-on Session

  • Pay-for-Placement and Revenue Issues

  • Our Google “Wish List”

  • Other Services to Keep Our Eyes On


Google s beginnings l.jpg
Google’s Beginnings

  • 1996 -- Sergey Brin, Larry Page of Stanford develop “BackRub” –based on analysis of links TO a page from other sites

  • Sept. 7, 1998 Menlo Park, CA –- Google launches in beta with over 10,000 queries a day

  • December, 1998 – Listed in PC Magazine’s Top 100 Websites


What s in a name l.jpg
What’s in a name?

  • “Google” is a play on “googol”, a term coined by mathematician Milton Sirotta to refer to the number one followed by 100 zeros


Google s hardware l.jpg
Google’s Hardware

  • Over 10,000 servers in two locations containing “hundreds of copies of the database”

  • Index of more than 3 billion web documents

  • Handles thousands of queries on a sub-second basis

  • Interviews in MP3 format with Chief Operations Engineer Jim Reese

    • //technetcast.com/tnc_play_stream.html?

      stream_id=420 (1 hr. 13 min)

    • //technetcast.com/tnc_play_stream.html?

      stream_id=421 (15 min.)


Google s multi faceted database l.jpg
Google’s Multi-faceted Database

  • Indexed html pages

  • Unindexed html pages

  • Other file types

  • Html pages that are re-indexed daily



What types of pages are unindexed 25 l.jpg
What types of pages are unindexed? (25%)

  • Dead or inaccurate links

  • Duplicate pages

  • Database-generated URLs

  • Pages with robots.txt or noindex meta tags

  • Pages on an intranet

  • Pages “waiting” to be indexed fully


How did they get into google l.jpg
How did they get into Google?

  • Google crawls and downloads links in the documents it encounters

  • Some of these links are dead, or inaccurate or cannot be crawled for other reasons (intranets, robots.txt)

  • The URL’s are in the database, but the documents are not


Why does google leave them in l.jpg
Why does Google leave them in?

  • They are not COMPLETELY unindexed

  • Indexed elements include

    • Words in the URL

      http://members.home.net/gourdeaud/

    • Words in the anchor text on indexed pages that link to the unindexed URL

      <a href= members.home.net/gourdeaud/ >Gourdeaud’s biography</a>

    • Can be useful in URL searches or unique term queries and PageRank


How can i distinguish unindexed pages in search results l.jpg
How can I distinguish unindexed pages in search results?

  • No extract

  • No page size

  • No cached copy of the page


Deep web components non html filetypes 1 75 search syntax california power shortage filetype pdf l.jpg

Adobe Portable Document Format (pdf)

Adobe PostScript (ps)

Lotus 1-2-3 (wk1, wk2, wk3, wk4, wk5, wki, wk

Lotus WordPro (lwp)

MacWrite (mw)

Microsoft Excel (xls)

Microsoft PowerPoint (ppt)

Microsoft Word (doc)

Microsoft Works (wks, wps, wdb)

Microsoft Write (wri)

Rich Text Format (rtf)

Text (ans, txt)

Deep Web Components: Non-html filetypes (1.75%)SEARCH SYNTAX “california power shortage” filetype:pdf


Google non html filetypes warning l.jpg
Google Non-html FiletypesWarning!

  • FOR NON-HTML FILES

    • Clicking on a title in the results list opens the application as well, involving risk of a virus or worm that may be attached to the file

    • INSTEAD, click the “View as HTML” option; no applications will be opened and no risk of virus or worm

    • NOTE: Titles for non-html files are frequently not descriptive of content


Non html filetypes in google notess study march 6 2002 25 one word searches l.jpg

Non-html filetypes in GoogleNotess Study March 6, 2002 – 25 One-Word Searches



Deep web components daily re indexed pages 15 l.jpg
Deep Web Components:Daily re-indexed pages (.15%)

  • Over 3 million

  • Regular html pages that Google has noticed are frequently updated.

  • Google re-indexes these “every day or so”

  • Date of Google’s last visit to the page appears in the results listing


Google s database l.jpg
Google’s Database

  • Freshness

  • Breadth

  • Depth


Database freshness l.jpg
Database Freshness

  • Refreshes its entire web index “on a roughly monthly basis, about every 28 days”.

  • On-going process

  • Some segments fresher than others


Notess study april 6 2002 pages that are updated daily and report that date l.jpg

Notess Study April 6, 2002Pages that are updated daily and report that date


Database breadth size l.jpg
Database Breadth (Size)

  • About 3 billion documents (indexed and unindexed)

  • Daily figure on the homepage

    3,083,324,652 on March 8, 2003

    (Not including Images or Usenet)

  • FAST (alltheweb.com) claimed

    2.1 billion indexed documents ,

    March 8, 2003


Database depth l.jpg
Database Depth

  • Google “typically” downloads the first 110 K of a web document

  • Download includes URL’s of outgoing links


Database blending l.jpg
Database “Blending”

  • Results from Google’s News vertical engine are included in results for all searches

  • Blending is increasingly common among search services

    • News

    • Shopping

    • Directory


Relevance ranking and link analysis l.jpg

Relevance Ranking and Link Analysis

Google’s “PageRank”

Demystified


Relevance ranking l.jpg
Relevance Ranking

  • Processing and presenting retrieved results

  • Proprietary information

  • Search Engine Optimization Industry has made it even more so

  • “How can I make my site rank high in Google?”


What happens when i enter a search at google l.jpg
What happens when I enter a search at Google?

  • Check of search syntax and spelling

  • Query routed to the appropriate server “based on the [database] segment on which the answer is likely to be found”


What happens when i enter a search at google31 l.jpg
What happens when I enter a search at Google?

  • Processing of Visible text

    • Search term(s) position – title, heading, text

    • Search term(s) frequency

    • Search term(s) proximity

  • Processing of Invisible text

    • Meta tags

    • Anchor text (within the <a> tag href)

      <a href=www.hws.edu >Hobart and William Smith Colleges</a>


What happens when i enter a search at google32 l.jpg
What happens when I enter a search at Google?

  • PageRank link analysis applied

  • Click popularity (Google Toolbar voting data)

  • Link context (Proximity of links to your search term(s) within the document)

  • Final dynamic mix of “about 25 factors”


Pagerank demystified l.jpg
PageRank Demystified

  • Patented link analysis program

  • Part of Google since its beginnings

  • Objective – To make ranking more of a “human process”

  • Assigns each page in Google a PageRank score, which is dynamic (changeable)

  • Weighs heavily in final ranking of results


Pagerank s multi layered processing l.jpg
PageRank’s Multi-layered processing

  • Layer I

    • Do others think your site is of value as demonstrated by linking to you?

      IF SO …

  • Layer II

    • Are these “others” in turn linked to by sites recognized through linkage within “web communities”?


Pagerank s multi layered processing35 l.jpg
PageRank’s Multi-layered processing

  • A Favorable Ranking Scenario

    A .com site selling prosthetics linked

    TO by

    A local orthopedic association in turn linked TO by

    A national orthopedic group in turn linked TO by

    The National Institutes of Health


Visualizing linkage in google s database with touchgraph l.jpg
Visualizing Linkage in Google’s Database with TouchGraph

  • Browser:

    http://www.touchgraph.com/TGGoogleBrowser.html

  • Instructions:

    http://www.touchgraph.com/TGGB_FullInstructions.html


How does google identify web communities l.jpg
How Does Google Identify “Web Communities”?

  • Mutual linkage patterns

  • Metadata elements and keywords found in common

  • Human examination/verification of the quality of key sites within the community

  • Other proprietary factors ???????


Pagerank nitty gritty l.jpg
PageRank Nitty Gritty

  • Every page of a site can have a PageRank score, not just the main page

  • The value of a link from Site B to Site A is decreased with each additional link from Site B to anyother site

    Rationale: If Site B has only a few links, each one could be more important than if Site B has hundreds of outgoing links


Pagerank nitty gritty40 l.jpg
PageRank Nitty Gritty

  • Requires human adjustment in the case of large subject directories and quality lists of links

  • PageRank scoring is a dynamic process always in flux

  • To find a page’s PageRank score, go to the Toolbar and click on the green meter


Pagerank feedback l.jpg
PageRank Feedback

  • Site A has NO outgoing links, but is linked TO by Site B

  • Site A decides to create a single link to Site B

  • This increases Site B’s PageRank score

  • Site B’s increased score in turn automatically increases Site A’s score


Sounds easy to manipulate l.jpg
Sounds easy to manipulate…

  • Possibilities include

    • Spam

    • Link “farms”

    • Cloaking (sneaky re-directs)

  • Google is vigilant

  • If Google detects any manipulation of PageRank, it eliminates the domain from its database and never crawls there again.


Pagerank processing l.jpg
PageRank Processing

  • How does Google know who has linked to Site A, for example?

  • By searching its database for all sites with links to Site A

  • No way to do this by examining Site A, as there is no physical change to a document when it is linked TO


Implications of pagerank l.jpg
Implications of PageRank

  • PageRank is entirely dependent on linkage data derived from the Google database

  • Breadth, depth and freshness of the crawl is critical to accurate and current data for PageRank scoring


A different perspective on pr anti google l.jpg
A Different Perspective on PR:Anti-Google

  • Daniel Brandt claims

    • “PageRank discriminates against new web sites” (which may not yet be linked to by other sites).

    • “Careless custodian of private information” (Google associates each search with a cookie, set to last 36 years)

    • Maintains googlewatch.org


Pagerank a summary l.jpg
PageRank –A Summary

All links are not created equal

  • Is this site linked TO by “good” web pages associated with this topic?

  • EXAMPLE: If a page is linked to by a subject directory (Yahoo, OD, LII) its rank will be higher than another page with many links from personal web pages, link “farms”, etc.

  • NOTE: Link Analysis (PageRank) is not the same as Link Popularity (number of links)


Searching google touring the known and the unknown l.jpg

Searching Google: Touring the Known and the Unknown

Please share your discoveries with us!


Command searching with google s fields aka search operators l.jpg
Command Searching with Google’s Fields (aka Search Operators)

  • Field Searches that cannot be combined with other search elements:

  • NOTE: No space allowed between operator and following text

    • cache: retrieves cached version of the specified URL

    • link: retrieves pages that have links to the specified URL

    • related: retrieves pages that are “similar” to the specified URL (same as Similar Pages feature in results listing)


Command searching with google s fields aka search operators49 l.jpg
Command Searching with Google’s Fields (aka Search Operators)

  • Field Searches that cannot be combined with other search elements:

    • info: retrieves information that Google has about the specified URL

    • stocks: retrieves stock information about the companies whose ticker symbols follow the stocks: operator

      stocks:intc (Intel)


Command searching with google s fields aka search operators50 l.jpg
Command Searching with Google’s Fields (aka Search Operators)

  • Field Searches that can be combined with other search elements:

    • site: restrict results to those from the specified domain

      site:www.google.com PageRank

      NOTE: retrieves all pages from www.google.com that contain PageRank anywhere


Field searches that can be combined with other search elements l.jpg
Field Searches that Operators)can be combined with other search elements:

  • allintitle: restrict results to those with all terms present in the html title element

    allintitle:synchrotron radiation

  • intitle: restrict results to those with this single term in the title element

    intitle:synchrotron intitle:radiation

    NOTE: intitle:synchrotron radiation retrieves

    synchrotron in title and radiation anywhere


Field searches that can be combined with other search elements52 l.jpg
Field Searches that Operators)can be combined with other search elements:

  • allinurl: restrict results to those with all terms present in the URL

    Note: ignores all punctuation

    allinurl:usda pesticides

  • inurl: restrict results to those with this single term in the URL

    inurl:usda inurl:pesticides

    NOTE: inurl:usda pesticides retrieves

    usda in URL and pesticides anywhere


Google answers l.jpg
Google Operators)Answers

  • Fee Based answer service

  • User sets fee ($2.50-up) and time frame for question (Guidelines offered)

  • Searchable archive available

  • Comments can be added (by anyone) to unanswered questions

  • Users rate answers


Google answers who are the researchers l.jpg
Google Operators)AnswersWho are the “researchers”?

  • Must be 18 years old

  • Write an essay on why you want to be a researcher

  • Answer 5 sample questions

  • Training manual available at

    http://answers.google.com/answers/ researchertraining.html


Google api application program interface l.jpg
Google’ API Operators)Application Program Interface

  • Free programs for developers and researchers interested in incorporating Google in their applications

    • Iterative searches on a topic (SDI)

    • Search via non-html interfaces

    • Games that play with Web information

  • Daily limit of 1,000 queries

  • Uses SOAP (Simple Object Access Protocol) that is XML-based

  • More at //google.com/apis/index.html


Froogle l.jpg
Froogle Operators)

  • New Service launched in Dec, 2002

  • Locates information about products for sale online

  • Gives URL’s of sites offering the item

  • Provides links to exact page in the site where you can make the purchase


Froogle57 l.jpg
Froogle Operators)

  • Ranking follows normal Google ranking processes

  • Paid placements always clearly marked

  • “Sort by price” may be a future enhancement

  • Access at http://froogle.google.com or via Google Advanced Search


Google s hidden features l.jpg
Google’s Hidden Features Operators)

  • Daterange search

  • “Wildcard” words

  • Phonebook command search

  • info: field search

  • “Dictionary” feature


Daterange l.jpg
Daterange Operators)

  • Not officially supported at google.com (unreliable)

  • Reliable only through API programs

  • At google.com, MAY be most reliable for the past 1 or 2 days

  • Searches the date of the document’s entry into the database, not its creation.


Daterange search results each day s entries for dog search executed on oct 9 l.jpg
Daterange Search Results (???) Operators)each day’s entries for dog search executed on Oct. 9,

  • Oct. 9 No hits

  • Oct. 8 6

  • Oct. 7 “about 212,000” many dated 10/7

  • Oct. 6 “about 8980” many dated 10/7

  • Oct. 5 “about 5900” many dated 10/7

  • Oct. 6-7 “about 57,100” !!!!

    NOT TRUE DATERANGE FUNCTIONALITY


With those caveats l.jpg
With those caveats ….. Operators)

  • Daterange uses the Julian calendar, a continuous count of days since noon, UTC, of Jan. 1, 4713 BC

  • Date changes at noon, not midnight

    2452565=12:00pm Oct. 16 to 11:59 am Oct. 17

  • Often used in astronomical and military contexts

  • JD convertor:

    //aa.usno.navy.mil/data/docs/JulianDate.html


Daterange search for oct 14 news daterange 2452561 2452561 4 450 hits l.jpg

Daterange Search for Oct 14 Operators)news daterange:2452561-2452561(4,450 hits)


Phonebook command search l.jpg
Phonebook Command Search Operators)

  • Searches US residential (rphonebook:) and business (bphonebook:) listings of Yahoo, MapQuest and other services

  • rphonebook:

    • MUST INCLUDE

      • Last name City and/or State

    • MAY INCLUDE

      • First name

  • bphonebook:

    • MUST INCLUDE

      • Business name (min. 1 word) City and/or State

    • MAY INCLUDE

      • Full Business name


Wildcard words l.jpg
Wildcard Words Operators)

  • Google offers a word-sized asterisk to function as a wildcard

  • Stands for a whole word

  • Cannot be used for part of a word

    • “three * mice” = 22,000

    • “three bl* mice” = 0


Wildcard words65 l.jpg
Wildcard Words Operators)

  • Several * can be used together

    milosevic “International * * Hague”

    Retrieves military tribunal OR

    military court OR war tribunal OR military tribunal


Slide66 l.jpg
info: Operators)

  • Not exactly hidden, but not well-known

  • Searches for any information Google has about a site

  • Convenient way to monitor linkage


Dictionary feature l.jpg
Dictionary Feature Operators)

  • Term(s) in a query for which Google has definitions are underlined in the text above the results listing (“Searched the Web for …”)

  • Clicking on the term(s) sends you to the dictionary provider (you leave Google).

  • Definitions are provided from sources “selected solely on the basis of quality”


A few good alternatives to google l.jpg
A Few Good Alternatives to Google Operators)

  • FAST - //alltheweb.com

  • Teoma - //teoma.com

  • Gigablast - //gigablast.com


Pay for placement and other revenue issues l.jpg
Pay-For-Placement and Operators)Other Revenue Issues


Revenue at google selling search software l.jpg
Revenue at Google: Operators)Selling Search Software

  • Provides search software and interface for portals and corporate intranets -“Powered by Google”

  • Over 150 customers worldwide (Yahoo, Sony, AOL/Netscape, Cisco Systems)

  • Google charges an initial set-up fee and a charge per 1,000 searches


Revenue at google advertising adwords l.jpg
Revenue at Google: Operators)Advertising: AdWords

  • Ads located to the right of search results

  • Cost-per-click model (pay only if someone actually clicks into your site from Google)

  • No monthly minimum charge


Revenue at google advertising adwords73 l.jpg
Revenue at Google: Operators)Advertising: AdWords

  • Highest bidder does NOT take top placement

  • Google measures number of visitors to an advertiser’s site and length of visits

  • This popularity-based relevance helps determine position of an ad

  • Offers smaller businesses a chance to compete for visibility


Revenue at google premium sponsorships l.jpg
Revenue at Google: Operators)Premium Sponsorships

  • Launched in mid-2002

  • Advertisers purchase keywords or phrases

  • Limited to no more than two sites per keyword or phrase

  • Highest bidder’s site appears at the top of results listing, labeled Sponsored Site


And ranking a mini glossary l.jpg
$$$ and Ranking: A Mini-Glossary Operators)

  • Pay-for-Placement

    • Paying for a specific position within search results retrieved using specific search terms

  • Pay-for-Inclusion

    • Paying for inclusion anywhere within search results retrieved using specific search terms

  • Pay-for-Submission

    • Paying to be included in the database (no special ranking treatment)

  • To date, no pay for inclusion or submission at Google


Revenue at google the professional s view l.jpg
Revenue at Google: Operators)The Professional’s View

  • To date, advertising clearly labeled at Google

  • If revenues decline,database size and quality may be effected

  • Development and support of search features and enhancements will be driven by commercial sector

  • Change in ownership can alter the nature and educational value of any search service


The last 12 months at google l.jpg
The Last 12 Months at Google Operators)

  • Dec. 2001 - Database is at 3 billion:

    • 2 Billion Web documents (all types)

    • 700 Million Usenet Postings

    • 330 Image files

  • March - 3rd party sells advertising based on PageRank scores

  • Ongoing - Accused of censorship and manipulation of ranking algorithms


The last 12 months at google79 l.jpg
The Last 12 Months at Google Operators)

  • Sept 2 - Access to Google (and Altavista) blocked in China by Chinese Government

  • Sept 11 - Chinese government restores access, but continues to monitor Google

  • Sept. 23 - Re-designed News Service launched

  • December - Froogle launched

  • Year-End Zeitgeist at

    • http://www.google.com/press/zeitgeist2002.html


Google is good but here s a wish list for future improvements l.jpg
Google is Good, but here’s a Wish List for Future Improvements

  • Categorization of Results (Folders)

    • Teoma, WiseNut, FAST all do

  • Nesting

  • Way to limit link: search to external links only

  • Indexing XML documents that have no html equivalents

  • Crawling Deep Web databases

  • Advanced NEWS search

  • OTHERS??????


Thank you and best of luck in getting more from google l.jpg

Thank you and best of luck in Getting MORE from Google!!! Improvements

Michael Hunter

Reference Librarian

Hobart and William Smith Colleges

Geneva, NY 14456

(315) 781-3552 [email protected]


ad