from the inside out michael hunter reference librarian hobart and william smith colleges
Download
Skip this Video
Download Presentation
From the Inside Out Michael Hunter Reference Librarian Hobart and William Smith Colleges

Loading in 2 Seconds...

play fullscreen
1 / 81

from the inside out michael hunter - PowerPoint PPT Presentation


  • 320 Views
  • Uploaded on

From the Inside Out Michael Hunter Reference Librarian Hobart and William Smith Colleges. Google from the Inside Out. Hardware and Database Creation Relevance Ranking and Link Analysis Advanced and “Hidden” Search Features Hands-on Session Pay-for-Placement and Revenue Issues

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'from the inside out michael hunter' - Roberta


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
from the inside out michael hunter reference librarian hobart and william smith colleges
From the Inside Out

Michael Hunter

Reference Librarian

Hobart and William Smith Colleges

google from the inside out
Google from the Inside Out
  • Hardware and Database Creation
  • Relevance Ranking and Link Analysis
  • Advanced and “Hidden” Search Features
  • Hands-on Session
  • Pay-for-Placement and Revenue Issues
  • Our Google “Wish List”
  • Other Services to Keep Our Eyes On
google s beginnings
Google’s Beginnings
  • 1996 -- Sergey Brin, Larry Page of Stanford develop “BackRub” –based on analysis of links TO a page from other sites
  • Sept. 7, 1998 Menlo Park, CA –- Google launches in beta with over 10,000 queries a day
  • December, 1998 – Listed in PC Magazine’s Top 100 Websites
what s in a name
What’s in a name?
  • “Google” is a play on “googol”, a term coined by mathematician Milton Sirotta to refer to the number one followed by 100 zeros
google s hardware
Google’s Hardware
  • Over 10,000 servers in two locations containing “hundreds of copies of the database”
  • Index of more than 3 billion web documents
  • Handles thousands of queries on a sub-second basis
  • Interviews in MP3 format with Chief Operations Engineer Jim Reese
    • //technetcast.com/tnc_play_stream.html?

stream_id=420 (1 hr. 13 min)

    • //technetcast.com/tnc_play_stream.html?

stream_id=421 (15 min.)

google s multi faceted database
Google’s Multi-faceted Database
  • Indexed html pages
  • Unindexed html pages
  • Other file types
  • Html pages that are re-indexed daily
what types of pages are unindexed 25
What types of pages are unindexed? (25%)
  • Dead or inaccurate links
  • Duplicate pages
  • Database-generated URLs
  • Pages with robots.txt or noindex meta tags
  • Pages on an intranet
  • Pages “waiting” to be indexed fully
how did they get into google
How did they get into Google?
  • Google crawls and downloads links in the documents it encounters
  • Some of these links are dead, or inaccurate or cannot be crawled for other reasons (intranets, robots.txt)
  • The URL’s are in the database, but the documents are not
why does google leave them in
Why does Google leave them in?
  • They are not COMPLETELY unindexed
  • Indexed elements include
    • Words in the URL

http://members.home.net/gourdeaud/

    • Words in the anchor text on indexed pages that link to the unindexed URL

<a href= members.home.net/gourdeaud/ >Gourdeaud’s biography</a>

    • Can be useful in URL searches or unique term queries and PageRank
how can i distinguish unindexed pages in search results
How can I distinguish unindexed pages in search results?
  • No extract
  • No page size
  • No cached copy of the page
deep web components non html filetypes 1 75 search syntax california power shortage filetype pdf
Adobe Portable Document Format (pdf)

Adobe PostScript (ps)

Lotus 1-2-3 (wk1, wk2, wk3, wk4, wk5, wki, wk

Lotus WordPro (lwp)

MacWrite (mw)

Microsoft Excel (xls)

Microsoft PowerPoint (ppt)

Microsoft Word (doc)

Microsoft Works (wks, wps, wdb)

Microsoft Write (wri)

Rich Text Format (rtf)

Text (ans, txt)

Deep Web Components: Non-html filetypes (1.75%)SEARCH SYNTAX “california power shortage” filetype:pdf
google non html filetypes warning
Google Non-html FiletypesWarning!
  • FOR NON-HTML FILES
    • Clicking on a title in the results list opens the application as well, involving risk of a virus or worm that may be attached to the file
    • INSTEAD, click the “View as HTML” option; no applications will be opened and no risk of virus or worm
    • NOTE: Titles for non-html files are frequently not descriptive of content
deep web components daily re indexed pages 15
Deep Web Components:Daily re-indexed pages (.15%)
  • Over 3 million
  • Regular html pages that Google has noticed are frequently updated.
  • Google re-indexes these “every day or so”
  • Date of Google’s last visit to the page appears in the results listing
google s database
Google’s Database
  • Freshness
  • Breadth
  • Depth
database freshness
Database Freshness
  • Refreshes its entire web index “on a roughly monthly basis, about every 28 days”.
  • On-going process
  • Some segments fresher than others
database breadth size
Database Breadth (Size)
  • About 3 billion documents (indexed and unindexed)
  • Daily figure on the homepage

3,083,324,652 on March 8, 2003

(Not including Images or Usenet)

  • FAST (alltheweb.com) claimed

2.1 billion indexed documents ,

March 8, 2003

database depth
Database Depth
  • Google “typically” downloads the first 110 K of a web document
  • Download includes URL’s of outgoing links
database blending
Database “Blending”
  • Results from Google’s News vertical engine are included in results for all searches
  • Blending is increasingly common among search services
    • News
    • Shopping
    • Directory
relevance ranking and link analysis

Relevance Ranking and Link Analysis

Google’s “PageRank”

Demystified

relevance ranking
Relevance Ranking
  • Processing and presenting retrieved results
  • Proprietary information
  • Search Engine Optimization Industry has made it even more so
  • “How can I make my site rank high in Google?”
what happens when i enter a search at google
What happens when I enter a search at Google?
  • Check of search syntax and spelling
  • Query routed to the appropriate server “based on the [database] segment on which the answer is likely to be found”
what happens when i enter a search at google31
What happens when I enter a search at Google?
  • Processing of Visible text
    • Search term(s) position – title, heading, text
    • Search term(s) frequency
    • Search term(s) proximity
  • Processing of Invisible text
    • Meta tags
    • Anchor text (within the <a> tag href)

<a href=www.hws.edu >Hobart and William Smith Colleges</a>

what happens when i enter a search at google32
What happens when I enter a search at Google?
  • PageRank link analysis applied
  • Click popularity (Google Toolbar voting data)
  • Link context (Proximity of links to your search term(s) within the document)
  • Final dynamic mix of “about 25 factors”
pagerank demystified
PageRank Demystified
  • Patented link analysis program
  • Part of Google since its beginnings
  • Objective – To make ranking more of a “human process”
  • Assigns each page in Google a PageRank score, which is dynamic (changeable)
  • Weighs heavily in final ranking of results
pagerank s multi layered processing
PageRank’s Multi-layered processing
  • Layer I
    • Do others think your site is of value as demonstrated by linking to you?

IF SO …

  • Layer II
    • Are these “others” in turn linked to by sites recognized through linkage within “web communities”?
pagerank s multi layered processing35
PageRank’s Multi-layered processing
  • A Favorable Ranking Scenario

A .com site selling prosthetics linked

TO by

A local orthopedic association in turn linked TO by

A national orthopedic group in turn linked TO by

The National Institutes of Health

visualizing linkage in google s database with touchgraph
Visualizing Linkage in Google’s Database with TouchGraph
  • Browser:

http://www.touchgraph.com/TGGoogleBrowser.html

  • Instructions:

http://www.touchgraph.com/TGGB_FullInstructions.html

how does google identify web communities
How Does Google Identify “Web Communities”?
  • Mutual linkage patterns
  • Metadata elements and keywords found in common
  • Human examination/verification of the quality of key sites within the community
  • Other proprietary factors ???????
pagerank nitty gritty
PageRank Nitty Gritty
  • Every page of a site can have a PageRank score, not just the main page
  • The value of a link from Site B to Site A is decreased with each additional link from Site B to anyother site

Rationale: If Site B has only a few links, each one could be more important than if Site B has hundreds of outgoing links

pagerank nitty gritty40
PageRank Nitty Gritty
  • Requires human adjustment in the case of large subject directories and quality lists of links
  • PageRank scoring is a dynamic process always in flux
  • To find a page’s PageRank score, go to the Toolbar and click on the green meter
pagerank feedback
PageRank Feedback
  • Site A has NO outgoing links, but is linked TO by Site B
  • Site A decides to create a single link to Site B
  • This increases Site B’s PageRank score
  • Site B’s increased score in turn automatically increases Site A’s score
sounds easy to manipulate
Sounds easy to manipulate…
  • Possibilities include
    • Spam
    • Link “farms”
    • Cloaking (sneaky re-directs)
  • Google is vigilant
  • If Google detects any manipulation of PageRank, it eliminates the domain from its database and never crawls there again.
pagerank processing
PageRank Processing
  • How does Google know who has linked to Site A, for example?
  • By searching its database for all sites with links to Site A
  • No way to do this by examining Site A, as there is no physical change to a document when it is linked TO
implications of pagerank
Implications of PageRank
  • PageRank is entirely dependent on linkage data derived from the Google database
  • Breadth, depth and freshness of the crawl is critical to accurate and current data for PageRank scoring
a different perspective on pr anti google
A Different Perspective on PR:Anti-Google
  • Daniel Brandt claims
    • “PageRank discriminates against new web sites” (which may not yet be linked to by other sites).
    • “Careless custodian of private information” (Google associates each search with a cookie, set to last 36 years)
    • Maintains googlewatch.org
pagerank a summary
PageRank –A Summary

All links are not created equal

  • Is this site linked TO by “good” web pages associated with this topic?
  • EXAMPLE: If a page is linked to by a subject directory (Yahoo, OD, LII) its rank will be higher than another page with many links from personal web pages, link “farms”, etc.
  • NOTE: Link Analysis (PageRank) is not the same as Link Popularity (number of links)
searching google touring the known and the unknown

Searching Google: Touring the Known and the Unknown

Please share your discoveries with us!

command searching with google s fields aka search operators
Command Searching with Google’s Fields (aka Search Operators)
  • Field Searches that cannot be combined with other search elements:
  • NOTE: No space allowed between operator and following text
    • cache: retrieves cached version of the specified URL
    • link: retrieves pages that have links to the specified URL
    • related: retrieves pages that are “similar” to the specified URL (same as Similar Pages feature in results listing)
command searching with google s fields aka search operators49
Command Searching with Google’s Fields (aka Search Operators)
  • Field Searches that cannot be combined with other search elements:
    • info: retrieves information that Google has about the specified URL
    • stocks: retrieves stock information about the companies whose ticker symbols follow the stocks: operator

stocks:intc (Intel)

command searching with google s fields aka search operators50
Command Searching with Google’s Fields (aka Search Operators)
  • Field Searches that can be combined with other search elements:
    • site: restrict results to those from the specified domain

site:www.google.com PageRank

NOTE: retrieves all pages from www.google.com that contain PageRank anywhere

field searches that can be combined with other search elements
Field Searches that can be combined with other search elements:
  • allintitle: restrict results to those with all terms present in the html title element

allintitle:synchrotron radiation

  • intitle: restrict results to those with this single term in the title element

intitle:synchrotron intitle:radiation

NOTE: intitle:synchrotron radiation retrieves

synchrotron in title and radiation anywhere

field searches that can be combined with other search elements52
Field Searches that can be combined with other search elements:
  • allinurl: restrict results to those with all terms present in the URL

Note: ignores all punctuation

allinurl:usda pesticides

  • inurl: restrict results to those with this single term in the URL

inurl:usda inurl:pesticides

NOTE: inurl:usda pesticides retrieves

usda in URL and pesticides anywhere

google answers
Google Answers
  • Fee Based answer service
  • User sets fee ($2.50-up) and time frame for question (Guidelines offered)
  • Searchable archive available
  • Comments can be added (by anyone) to unanswered questions
  • Users rate answers
google answers who are the researchers
Google AnswersWho are the “researchers”?
  • Must be 18 years old
  • Write an essay on why you want to be a researcher
  • Answer 5 sample questions
  • Training manual available at

http://answers.google.com/answers/ researchertraining.html

google api application program interface
Google’ APIApplication Program Interface
  • Free programs for developers and researchers interested in incorporating Google in their applications
    • Iterative searches on a topic (SDI)
    • Search via non-html interfaces
    • Games that play with Web information
  • Daily limit of 1,000 queries
  • Uses SOAP (Simple Object Access Protocol) that is XML-based
  • More at //google.com/apis/index.html
froogle
Froogle
  • New Service launched in Dec, 2002
  • Locates information about products for sale online
  • Gives URL’s of sites offering the item
  • Provides links to exact page in the site where you can make the purchase
froogle57
Froogle
  • Ranking follows normal Google ranking processes
  • Paid placements always clearly marked
  • “Sort by price” may be a future enhancement
  • Access at http://froogle.google.com or via Google Advanced Search
google s hidden features
Google’s Hidden Features
  • Daterange search
  • “Wildcard” words
  • Phonebook command search
  • info: field search
  • “Dictionary” feature
daterange
Daterange
  • Not officially supported at google.com (unreliable)
  • Reliable only through API programs
  • At google.com, MAY be most reliable for the past 1 or 2 days
  • Searches the date of the document’s entry into the database, not its creation.
daterange search results each day s entries for dog search executed on oct 9
Daterange Search Results (???)each day’s entries for dog search executed on Oct. 9,
  • Oct. 9 No hits
  • Oct. 8 6
  • Oct. 7 “about 212,000” many dated 10/7
  • Oct. 6 “about 8980” many dated 10/7
  • Oct. 5 “about 5900” many dated 10/7
  • Oct. 6-7 “about 57,100” !!!!

NOT TRUE DATERANGE FUNCTIONALITY

with those caveats
With those caveats …..
  • Daterange uses the Julian calendar, a continuous count of days since noon, UTC, of Jan. 1, 4713 BC
  • Date changes at noon, not midnight

2452565=12:00pm Oct. 16 to 11:59 am Oct. 17

  • Often used in astronomical and military contexts
  • JD convertor:

//aa.usno.navy.mil/data/docs/JulianDate.html

daterange search for oct 14 news daterange 2452561 2452561 4 450 hits

Daterange Search for Oct 14news daterange:2452561-2452561(4,450 hits)

phonebook command search
Phonebook Command Search
  • Searches US residential (rphonebook:) and business (bphonebook:) listings of Yahoo, MapQuest and other services
  • rphonebook:
    • MUST INCLUDE
      • Last name City and/or State
    • MAY INCLUDE
      • First name
  • bphonebook:
    • MUST INCLUDE
      • Business name (min. 1 word) City and/or State
    • MAY INCLUDE
      • Full Business name
wildcard words
Wildcard Words
  • Google offers a word-sized asterisk to function as a wildcard
  • Stands for a whole word
  • Cannot be used for part of a word
    • “three * mice” = 22,000
    • “three bl* mice” = 0
wildcard words65
Wildcard Words
  • Several * can be used together

milosevic “International * * Hague”

Retrieves military tribunal OR

military court OR war tribunal OR military tribunal

slide66
info:
  • Not exactly hidden, but not well-known
  • Searches for any information Google has about a site
  • Convenient way to monitor linkage
dictionary feature
Dictionary Feature
  • Term(s) in a query for which Google has definitions are underlined in the text above the results listing (“Searched the Web for …”)
  • Clicking on the term(s) sends you to the dictionary provider (you leave Google).
  • Definitions are provided from sources “selected solely on the basis of quality”
a few good alternatives to google
A Few Good Alternatives to Google
  • FAST - //alltheweb.com
  • Teoma - //teoma.com
  • Gigablast - //gigablast.com
revenue at google selling search software
Revenue at Google:Selling Search Software
  • Provides search software and interface for portals and corporate intranets -“Powered by Google”
  • Over 150 customers worldwide (Yahoo, Sony, AOL/Netscape, Cisco Systems)
  • Google charges an initial set-up fee and a charge per 1,000 searches
revenue at google advertising adwords
Revenue at Google:Advertising: AdWords
  • Ads located to the right of search results
  • Cost-per-click model (pay only if someone actually clicks into your site from Google)
  • No monthly minimum charge
revenue at google advertising adwords73
Revenue at Google:Advertising: AdWords
  • Highest bidder does NOT take top placement
  • Google measures number of visitors to an advertiser’s site and length of visits
  • This popularity-based relevance helps determine position of an ad
  • Offers smaller businesses a chance to compete for visibility
revenue at google premium sponsorships
Revenue at Google:Premium Sponsorships
  • Launched in mid-2002
  • Advertisers purchase keywords or phrases
  • Limited to no more than two sites per keyword or phrase
  • Highest bidder’s site appears at the top of results listing, labeled Sponsored Site
and ranking a mini glossary
$$$ and Ranking: A Mini-Glossary
  • Pay-for-Placement
    • Paying for a specific position within search results retrieved using specific search terms
  • Pay-for-Inclusion
    • Paying for inclusion anywhere within search results retrieved using specific search terms
  • Pay-for-Submission
    • Paying to be included in the database (no special ranking treatment)
  • To date, no pay for inclusion or submission at Google
revenue at google the professional s view
Revenue at Google:The Professional’s View
  • To date, advertising clearly labeled at Google
  • If revenues decline,database size and quality may be effected
  • Development and support of search features and enhancements will be driven by commercial sector
  • Change in ownership can alter the nature and educational value of any search service
the last 12 months at google
The Last 12 Months at Google
  • Dec. 2001 - Database is at 3 billion:
    • 2 Billion Web documents (all types)
    • 700 Million Usenet Postings
    • 330 Image files
  • March - 3rd party sells advertising based on PageRank scores
  • Ongoing - Accused of censorship and manipulation of ranking algorithms
the last 12 months at google79
The Last 12 Months at Google
  • Sept 2 - Access to Google (and Altavista) blocked in China by Chinese Government
  • Sept 11 - Chinese government restores access, but continues to monitor Google
  • Sept. 23 - Re-designed News Service launched
  • December - Froogle launched
  • Year-End Zeitgeist at
    • http://www.google.com/press/zeitgeist2002.html
google is good but here s a wish list for future improvements
Google is Good, but here’s a Wish List for Future Improvements
  • Categorization of Results (Folders)
    • Teoma, WiseNut, FAST all do
  • Nesting
  • Way to limit link: search to external links only
  • Indexing XML documents that have no html equivalents
  • Crawling Deep Web databases
  • Advanced NEWS search
  • OTHERS??????
thank you and best of luck in getting more from google

Thank you and best of luck in Getting MORE from Google!!!

Michael Hunter

Reference Librarian

Hobart and William Smith Colleges

Geneva, NY 14456

(315) 781-3552 [email protected]

ad