Disambiguation March 7, 2003
Problem • Many people have the same name. • Example: Michael Jordan, basketball star or professor? • Prior knowledge is not feasible. • Disambiguation based on context. • Example: Scottie Pippen, Dennis Rodman, Phil Jackson • Example: U.C. Berkeley, David Cohn
Graph David Cohn Scottie Pippen Michael Jordan Phil Jackson U.C. Berkeley Dennis Rodman
Graph David Cohn Michael Jordan U.C. Berkeley Scottie Pippen Michael Jordan Phil Jackson Dennis Rodman
Algorithm • Choose the most relevant people to Michael Jordan. • Relevance measured by P( MJ | p) for each person p.
Choosing Seed Values • We need a starting point. • People that correspond with the senses of MJ. • How well do the seeds separate people into camps? • Exhaustive search through all pairs of people.
Good Seeds David Cohn Scottie Pippen Phil Jackson U.C. Berkeley Dennis Rodman
Bad seeds David Cohn Scottie Pippen Phil Jackson U.C. Berkeley Dennis Rodman
Choosing Seeds I • Let Sj be the jth sense. Denote S1 as basketball star and S2 as professor (interchangeable because no prior knowledge). • In the exhaustive search, we arbitrarily pick some person to be seed0 and another to be seed1 where seed0 corresponds to S0 and seed1 to S1. • Let P(MJ = S1 | MJ, seed1) = 1 and P(MJ = S0 | MJ, seed1) = 0, vice versa.This probability could be wrong, but it is just an arbitrary assignment.
Choosing Seeds II • For each person, p, and sense, Sj: • P( MJ = Sj | MJ, p) = n(seedj, p) P(MJ | seedj) • Person belong to camp Sj only if P(MJ=Sj| MJ, p) > 0.95. • Use harmonic mean to score how well seed0 and seed1 assign people to camps.
Iteration I • Now we have the best seeds, we are going to assign P( MJ = Sj | p) for each person, p. • Step 1: Begin with every person in the unknown except the seeds. • Step 2: For each person in the unknown and each sense, calculate P(MJ = Sj | p) = P(MJ | p) P(MJ = Sj|MJ,p)
Iteration II • Step 3: For each sense, take the highest P(MJ = Sj | p) and take p out of unknown. • Step 4: Repeat step 2 and step 3 until everyone is out of the unknown.
Prediction • Given a link, simply add up all the probability of all the names for each sense. • So MJ in link is S1 or S2. We don’t know anything about basketball stars or professors.
Dataset • Movie database from IMDB • 230,000 actors • 40,000 movies • Randomly pick actors who appeared in 15 movies or more (4000 actors). • Assign them to be the same person. Run the algorithm. See which sense does each movie belong to. • Repeat 100 times. • Average accuracy: 75%
Good Example • Blandick__Clara(38) vs Gibson__Henry(19):final score = 0.98245638 out of 38 correctBlandick__Clara has seed Phelps__Lee18 out of 19 correctGibson__Henry has seed Davies__John__IV_ • Clara Blandick from 1910s to 1950s • Lee Phelps also from that era, appeared in 6 movies with Clara • Henry Gibson from 1960s to 2000s • John Davies IV also from that era, appeared in 2 movies with Henry
Bad Example • Marsh__Mae(25) vs Moorehead__Agnes(19):final score = 0.50000016 out of 25 correctMarsh__Mae has seed Morin__Alberto__I_6 out of 19 correctMoorehead__Agnes has seed Wolfe__Ian • Mae Marsh, Agnes Moorehead, Alberto Morin, and Ian Wolfe all appeared in movies from 1940s to 1970s.