Grouping search engine returned citations for person name queries
This presentation is the property of its rightful owner.
Sponsored Links
1 / 27

Grouping Search-Engine Returned Citations for Person Name Queries PowerPoint PPT Presentation


  • 51 Views
  • Uploaded on
  • Presentation posted in: General

Grouping Search-Engine Returned Citations for Person Name Queries. Reema Al-Kamha. Research Supported by NSF. The Problem. Search engines return too many citations. Example: “Kelly Flanagan”. Google returns around 685 citations. Many people named “Kelly Flanagan”

Download Presentation

Grouping Search-Engine Returned Citations for Person Name Queries

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Grouping search engine returned citations for person name queries

Grouping Search-Engine Returned Citations for Person Name Queries

Reema Al-Kamha

Research Supported by NSF


The problem

The Problem

  • Search engines return too many citations.

    • Example: “Kelly Flanagan”.

    • Google returns around 685 citations.

  • Many people named “Kelly Flanagan”

    • It would help to group the citations by person.

    • How do we group them?


Kelly flanagan query to google

“Kelly Flanagan” Query to Google


Grouping search engine returned citations for person name queries

Our Solution

  • A Multi-faceted approach

    • Attributes

    • Links

    • Page Similarity

  • Confidence matrix for each facet

  • Final confidence matrix

  • Grouping algorithm


A multi faceted approach

A Multi-faceted Approach

  • Gather evidence from each of several different facets

  • Combine the evidence


Attributes

Attributes

  • Phone number, email address, state, city, zip code.

  • Regular expression for each attribute.


Links

Links

  • People usually post information on only a few host servers.

    • Returned citations that have a same host.

  • People often link one page about a person to another page

    about the same person.

    • The URL of one citation has the same host as one of the URLs that belongs to the web page referenced by the other citation.


Links cont

Links (Cont)


Page similarity

Page Similarity

“adjacent cap-word pairs”:

Cap-Word (Connector | Preposition (Article)? | (Capital-LetterDot))? Cap-Word.


Page similarity1

Page Similarity

  • The number of shared adjacent cap-word pairs (1, 2 , 3, 4 or more).

  • Ignore adjacent cap-word pairs that often occur on web pages (Home Page and Privacy Policy) by constructing a stop-word list.


Confidence matrix construction

Confidence Matrix Construction

  • For each facet we construct a confidence matrix.

0 if no evidence for a facet f

Cij =

P(Ciand Cj refer to a same person | evidence for a facet f )

Training set to compute the conditional probabilities.


Confidence matrix construction cont

Confidence Matrix Construction (Cont)

We select 9 person names.

For each name we collect the first 50 citations.

For 50 citations we have 1,225 comparison pairs.

The size of our training set is 11,025.


Confidence matrix construction cont1

Confidence Matrix Construction (Cont)

For attribute facet

P(Same Person = “Yes” | Email = “yes”)

P(Same Person = “Yes” | City = “yes” and State = “Yes”)

For link facet

P(Same Person = “Yes” | Host1 = “yes” and Host1 is non-popular)

For page similarity facet

P(Same Person = “Yes” | Share2 = “yes”)


Confidence matrix for attribute facet

Confidence Matrix for Attribute Facet

C1 and C2 have the same zip, city, and state, which are “Provo”, “UT”, and “84604”.

C1 and C8 , C2 and C8 have the same city and state, which are “Provo” and “UT”.

C4 and C7 have the same city and state, which are“Palm Desert” and “California”.


Confidence matrix for link facet

Confidence Matrix for Link Facet

C1 and C2 have the same host name, and C1 refers to the host of C2. .

C5 and C6 have the same host name.

C3refers to the host of C5 and C3refers to the host of C6


Confidence matrix for page similarity facet

Confidence Matrix for Page Similarity Facet

C1 and C2 share Associate Professor, Brigham Young, Performance Evaluation, Trace Collection, Computer Organization, Computer Architecture.

C2 and C3 share Memory Hierarchy, Brent E. Nelson, System-Assisted Disk, Simulation Technique, Stochastic Disk, Winter Simulation, Chordal Spoke, Interconnection Network, Transaction Processing, Benchmarks Using, Performance Studies, Incomplete Trace,Heng Zho.

C1 and C8 ,C2 and C8 share Brigham Young. C4 and C7 share Palm Desert, Real Estate, Desert Real .


Final matrix

Final Matrix

  • Combine the confidence matrices for the three facets using Stanford Certainty Measure.

  • For some observation B,

    If CF(E1) is the certainty factor associated with E1

    If CF(E2) is the certainty factor associated with E2

    the new certainty factor for B is:

    CF(E1) + CF(E2) – CF(E1) * CF(E2).


Final matrix cont

Final Matrix (Cont)

Confidence Matrix for Attributes

Confidence Matrix for Links

Confidence Matrix for Page Similarity

0.96 + 0 + 0.78 - 0.96 * 0 - 0.96 * 0.78 - 0.78 * 0 + 0.96 * 0 * 0.78 = 0.9912


Final confidence matrix

Final Confidence Matrix


Grouping algorithm

Grouping Algorithm

  • Input: the final confidence matrix.

  • Output: groups of search engine returned citations, such that each group refers to the same person.

  • The idea is:

    {Ci , Cj} and {Cj , Ck}then{Ci , Cj , Ck}

    The threshold we use for “highly confident” is 0.8.


Grouping algorithm cont

Grouping Algorithm(Cont)

{C1 , C2}, {C2 , C3}, {C3, C5}, {C3 , C6}, {C4, C7}, {C1 , C8},{C2, C8}

Group1: {C1 , C2 , C3 , C5 , C6 , C8}, Group 2: {C4 , C7}, Group 3: {C9}, Group4: {C10}


Experimental results

Experimental Results

  • Choose 10 arbitrary different names.

  • For each name we get the first 50 returned citations.

  • The size of the test set is 500.

  • Use split and merge measures.

    • Consider 8 returned citations C1, C2, C3, C4, C5, C6, C7, C8

    • the correct grouping result:

      Group 1: {C1, C2, C4, C6, C7}, Group 2: {C3, C8}, Group 3: {C5}

    • grouping result of our system:

      Group 1: {C1, C2, C4}, Group 2 :{C3, C6, C7}, Group 3: {C5, C8}

    • The number of splits is 0+1+1=2.

    • The total number of merges is 2.

    • Normalized the split and merge scores.


Experimental results cont

Experimental Results (Cont)

Official College, Sports Network, Student Advantage.


Cases that caused missing merges attributes facet

Cases that Caused Missing Merges--Attributes Facet

  • No shared attributes.

    • 1030 pairs (out of 1036 pairs) in 41 groups in Larry Wild.

  • Only the value of attribute State is shared.

    • 6 pairs in 41 groups in Larry Wild.


Techniques that used to judge in case of no evidence or weak evidence

Techniques that Used to Judge In Case of no Evidence or Weak Evidence


Conclusions

Conclusions

  • Multi-faceted approach is useful, low normalized split score (0.004) and a low normalized merge score (0.014).

  • No individual facet scored better than using all facets together.


Contributions

Contributions

  • Grouped person-name queries by person.

  • Provided an additional tool for search engine queries.


  • Login