1 / 63

Web Information Extraction Learning based on Probabilistic Graphical Models

Web Information Extraction Learning based on Probabilistic Graphical Models. Wai Lam Joint work with Tak-Lam Wong The Chinese University of Hong Kong. Introduction. Building advanced Web mining applications requires precise text information extraction a large number of different Web sites.

Download Presentation

Web Information Extraction Learning based on Probabilistic Graphical Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web Information Extraction Learning based on Probabilistic Graphical Models Wai Lam Joint work with Tak-Lam Wong The Chinese University of Hong Kong

  2. Introduction • Building advanced Web mining applications • requires precise text information extraction • a large number of different Web sites. • Substantial human effort is needed for the information extraction task. • diverse layout format • content variation The Chinese University of Hong Kong

  3. Wrapper Adaptation Problem (1) The Chinese University of Hong Kong

  4. Wrapper Adaptation Problem (2) Wrapperlearning Learned wrapper The Chinese University of Hong Kong

  5. Product Attribute Extraction and Resolution Problem (1) • The Web contains a huge number of online stores selling millions of different kinds of products. The Chinese University of Hong Kong

  6. Product Attribute Extraction and Resolution Problem (2) • Traditional search engines typically treat every term in a Web document in a uniform fashion. • Consider the digital camera domain. Suppose a user supplies a query: “auto white balance” trying to find cameras related to the product attribute “white balance”. • Possible results: “auto ISO” which is about “light sensitivity” different from the product attribute “white balance” The Chinese University of Hong Kong

  7. Product Attribute Extraction and Resolution Problem (3) • Another related desirable task is to resolve the extracted data according to their semantics. • This can improve indexing of product Web pages and support intelligent tasks such as product search or product matching. The Chinese University of Hong Kong

  8. Our Approach • We have investigated learning frameworks for solving each of the Web information extraction tasks just presented. • Probabilistic graphical models provide a principled paradigm harnessing the uncertainty during the learning process. • A graphical model capturing information extraction knowledge for solving wrapper adaptation (ACM TOIT 2007). • A graphical model for unsupervised learning to extract and resolve product attributes (SIGIR 2008). The Chinese University of Hong Kong

  9. Motivating Example (Source: http://www.crayeon3.com) (Source: http://www.superwarehouse.com) The Chinese University of Hong Kong

  10. Product Attribute Extraction • To extract product attributes: • In the beginning, only the attribute “resolution” is known. • Effective sensor resolution • Layout format • White balance, shutter speed • Mutual cooperation • Light sensitivity The Chinese University of Hong Kong

  11. Product Attribute Resolution • Samples of extracted text fragments from a page: • cloudy, daylight, etc… • What do they refer to? • A text fragment extracted from another page: • white balance auto, daylight,cloudy, tungsten, … … • Product attribute resolution: • To cluster text fragments of attributes into the same group • Better indexing for product search • Easier understanding and interpretation The Chinese University of Hong Kong

  12. Existing Works (Supervised Learning) • Supervised wrapper learning (Chang et al., IEEE TKDE 2006) • They need training examples. • The wrapper learned from a Web site cannot be applied to other sites. • Template-independent extraction (Zhu et al., SIGKDD 2007) • They cannot handle previously unseen attributes. The Chinese University of Hong Kong

  13. Existing Works (Unsupervised Learning) • Handle Web pages generated from the same template (Crescenzi et al., VLDB 2001). • Data may not be synchronized • “Aug 1993 $16.38” extracted from a page • “Paperback Feb 1985 $6.95” extracted from another page • Synchronized data extraction (Chuang et al., VLDB 2007) • Requires a field model (HMM models) for each field and it requires manually prepared training examples. • Can only apply to Web pages that contain multiple records. The Chinese University of Hong Kong

  14. Our Framework • Unsupervised learning framework for jointly extracting and resolving product attributes from different Web sites (SIGIR 2008). • Our framework consists of a graphical model which considers page-independent content information and page-dependent layout information. • Can extract unlimited number of product attributes (Dirichlet process prior) • The resolved product attributes can be used for other intelligent tasks such as product search (AAAI 2008). The Chinese University of Hong Kong

  15. Problem Definition (1) • A product domain, • E.g., Digital camera domain • A set of reference attributes, • E.g., “resolution”, “white balance”, etc. • A special element, , representing “not-an-attribute” • A collection of Web pages from any Web sites, , each of which contains a single product • Let be any text fragment from a Web page The Chinese University of Hong Kong

  16. Problem Definition (2) Line separator <TR> <TD> <P> <SPAN> White balance </SPAN> </P> </TD> <TD> <P> <SPAN> Auto, daylight, cloudy, tungstem, fluorescent, fluorescent H, custom </SPAN> </P> </TD> </TR> <TR> Line separator The Chinese University of Hong Kong

  17. Problem Definition (3) Attribute information Target information Layout information Content information White balance Auto, daylight, … … boldface, in-table 1 (related to attribute) white balance The Chinese University of Hong Kong

  18. Problem Definition (4) Attribute information Target information Layout information Content information View larger image boldface, underline 0 (irrelevant) not-an-attribute The Chinese University of Hong Kong

  19. Problem Definition (5) • Attribute extraction: • Attribute resolution: • Joint attribute extraction and resolution: Attribute information Target information Layout information Content information The Chinese University of Hong Kong

  20. Graphical Models (1) • A graphical model is a family of probability distributions defined in terms of a directed or undirected graph. • Nodes: Random variables • Joint distribution: The products over functions defined on the connected nodes • It provides general algorithms to compute marginal and conditional probability of interest. • It provides control over the computational complexity associated with these operations. The Chinese University of Hong Kong

  21. Graphical Models (2) • One kind of graphical models is directed graph. • Let be a directed acyclic graph • are the nodes • are the edges • Denote as the parents of . • Denote as the collection of random variables indexed by the nodes. • The joint probability distribution is expressed as: The Chinese University of Hong Kong

  22. θ Z1 Z2 Z3 ZN Graphical Models (3) • E.g.: • This model asserts that the variables ZN are conditionally independent and identically distributed given θ. The Chinese University of Hong Kong

  23. θ Zn N Graphical Models (4) • A plate is used to show the repetition of the variables. • Hence, it shows the factorial and nested structures. The Chinese University of Hong Kong

  24. Graphical Models (5) Finite Mixture Model • A generative approach to clustering: • pick one of clusters from a distribution • generate a data point from a cluster-specific probability distribution. • This yields a finite mixture model: • where and are the parameters, and where each cluster has the same parameterized family. • Data are assumed to be generated conditionally IID from this mixture. The Chinese University of Hong Kong

  25. Graphical Models (6) Finite Mixture Model • Mixture models make the assumption that each data point arises from a single mixture component. • the k-th cluster is by definition the set of data points arising from the k-th mixture component. The Chinese University of Hong Kong

  26. Graphical Models (7) Finite Mixture Model • Another way to express this: define an underlying measure where is an atom at . • And define the process of obtaining a sample from a finite mixture model as follows. For : • Note that each is equal to one of the underlying . • indeed, the subset of that maps to is exactly the k-th cluster. The Chinese University of Hong Kong

  27. Graphical Models (8) Finite Mixture Model G θi xi N The Chinese University of Hong Kong

  28. Graphical Models (9) Dirichlet Process Mixture • Define a countably infinite mixture model by taking K to infinity and hoping that means something, where πk Zi α xi ψk G0  N The Chinese University of Hong Kong

  29. Our Model (1) • Our graphical model can be regarded as an extension of Dirichlet mixture model. • Each mixture component • refers to a reference attribute; • consists of two distributions characterizing the content information and target information. • Dirichlet process prior is employed. • It can handle unlimited number of reference attributes. The Chinese University of Hong Kong

  30. Our Model (2) • Attribute extraction: • Attribute resolution: • Joint attribute extraction and resolution: Attribute information Target information Layout information Content information The Chinese University of Hong Kong

  31. Our Model (3) Dirichlet Process Prior(Infinite Mixture Model) S Different Web Site N Text Fragment The Chinese University of Hong Kong

  32. Our Model (4) Dirichlet Process Prior(Infinite Mixture Model) N Text Fragment The proportion of the k-th component in the mixture Target information Layout information Content information parameter of the k-th component Target information parameter of the k-th component Content information The Chinese University of Hong Kong

  33. Our Model (5) S Different Web Site Site-dependent Layout format The Chinese University of Hong Kong

  34. Our Model (6) Dirichlet Process Prior(Infinite Mixture Model) Concentration parameter for DP Base distribution for content info. Base distribution for target info. The Chinese University of Hong Kong

  35. Generation Process (1) The Chinese University of Hong Kong

  36. Generation Process (2) • The joint probability for generating a particular text fragment given the parameters, , , , and, : • Inference: where , , and are the set of observable variables, unobservable variables, and model parameters respectively. • Intractable The Chinese University of Hong Kong

  37. Variational Method (1) • The inference problem is transformed into an optimization problem. • The resulting variational optimization problems admit principled approximate solutions. • The solution to variational problems is often given in terms of fixed point equations that capture necessary conditions for optimality. • In contrast to other approximation methods such as MCMC, variational methods are deterministic. The Chinese University of Hong Kong

  38. Variational Method (2) • Finding is intractable • Our goal:Transform the problem into an optimization problem: • where D denotes KL-divergence • KL-divergence must be non-negative The Chinese University of Hong Kong

  39. Variational Method (3) • KL-divergence is zero if equals the true posterior probability . • Let • By maximizing w.r.t. we get: • Therefore, we have a lower bound on the desired log-marginal probability • LHS is the log-likelihood of the observable variables. . The Chinese University of Hong Kong

  40. Variational Method (4) • The problem becomes maximizing . The Chinese University of Hong Kong

  41. Variational Method (5) • Truncated stick-breaking process (Ishwaran and James, 2001) • Replace infinity with a truncation level K The Chinese University of Hong Kong

  42. Variational Method (6) Mixture of tokens Binary Conjugate priors A set of binary features The Chinese University of Hong Kong

  43. Variational Method (7) • Solve by coordinate ascent algorithm • One important variational parameters: • How likely does come from the k-th component? • Attribute resolution! The Chinese University of Hong Kong

  44. Variational Method (8) • Another important variational parameter: where • How likely should be extracted? • Attribute extraction! The Chinese University of Hong Kong

  45. Variational Method (9) • Other variational parameters: The Chinese University of Hong Kong

  46. Initialization • What should be extracted? • Make use of a very small amount of prior information about a domain. • Only a few terms about the product attributes • E.g., resolution, light sensitivity • Can be easily obtained, for example, by just highlighting the attributes of one single Web page • Initialization The Chinese University of Hong Kong

  47. EM Algorithm for Layout Parameters • Our framework can consider the page-dependent layout format of text fragments to enhance extraction. • However, the layout information of an unseen Web page is unknown and hence we cannot predefine or estimate the values of . • E-step: Apply coordinate ascent algorithm until convergence to achieve the optimal conditions for all variational parameters. • M-step: Calculate The Chinese University of Hong Kong

  48. Experiments • We have conducted experiments on four different domains: • Digital camera: 85 Web pages from 41 different sites • MP3 player: 96 Web pages from 62 different sites • Camcorder: 111 Web pages from 61 different sites • Restaurant: 29 Web pages from LA-Weekly Restaurant Guide • In each domain, we conducted 10 runs of experiments. • In each run, we randomly selected a Web page and pick a few terms inside for initialization. The Chinese University of Hong Kong

  49. Evaluation on Attribute Resolution • Baseline approach (Bilenko & Mooney SIGKDD 2003): • Agglomerative clustering • Edit distance between text fragments • Evaluation metrics: • Pairwise recall (R) • Pairwise precision (P) • Pairwise F1-measure (F) The Chinese University of Hong Kong

  50. Results of Attribute Resolution The Chinese University of Hong Kong

More Related