From Tessellations to Table Interpretation

From Tessellations to Table Interpretation Ramana C. Jandhyala DocLab, RPI

Introduction • Novel aspects of our work • Focus on computer-constructed web tables • Using commercial software • Describing tables using XY trees • Extracting relationship of headers to content cells • Formalizes the 200 table-experiment conducted by Raghav. These tables were imported from 10 websites into Excel and manually edited into a form that can be processed algorithmically. • Average editing time – 104 sec. • Average table size – 587 cells. • Augmentations not considered!

Rectangular Tessellations • Rectangular Tiling/Discrete Rectangular Tessellation • Partition of an isothetic rectangle into rectangles • Geometry uniquely defined by locations and types of junction points • Number Nall(m) increases exponentially with table size. • XY Tessellations • Special case of rectangular tessellations • Got by successive horizontal and vertical cuts • Number of XY tilings Nxy(m) decrease rapidly (Klarner-Magliveras), i.e. Lim Nxy(m) / Nall(m) = 0 m->inf

Taxonomy of web tables • All tables have a stub, row headings, column headings and data cells. • Some common layouts – admissible tessellations

Taxonomy of web tables (contd.) • Human-understandable tables - NT,S,xy(m), mathematically indefinable and unknown number • Convert them to smaller set of admissible tables – NA,S,xy(m) • Layout-equivalent tables enough for algorithmic analysis.

Taxonomy of web tables (contd.) • Number of different layout-equivalent admissible candidates - NL,S,xy(m) • For now, NL,S,xy(m) <NA,S,xy(m) • Context-free grammars – characterize entire families of layout-equivalent tables

Logical Structure of Tables • XY trees only capture physical layout • To understand a table – need to analyse logical structure, i.e. relationship between header cells and content cells [Wang]. • Wang notation – consists of category trees (headings) and delta cells (content). • Number of category trees – dimensionality of the table • Cartesian product of category trees lead to delta cells. • Size of table – product of number of rows and columns of delta cells

Logical Structure of Tables (contd.) • Well-formed tables – Labeled table candidates for which Wang Notation exists • Most tables not well-formed, but easily convertible into well-formed format using virtual headers. • Analyzing logical structure not sufficient for table understanding!

Our project – front end for creating narrow-domain ontologies by combining information from web tables • Our work based on following inequalities NL,S,xy(m)<NA,S,xy(m) <NT,S,xy(m) <<NS,xy(m) <<Nxy(m) <<Nall(m) • Examples of each class shown in next slide.

Tessellations to XY trees • Horizontally and vertically ordered lists of junction points – not sufficient for reconstructing XY tree! • Do not capture the adjacency topology. • Need coordinates and junction types (NE-corner, T-junction, crossing etc.)

Table to XY tree – EX2XY • Applicable to any tessellation for which XY tree exists. • Input – Excel Table • Output – XY tree (parenthesized notation) • Algorithm: • CutV(R) – cuts a rectangle R vertically and returns leftmost sub-rectangle. • CutH(R) – cuts R horizontally and returns topmost sub-rectangle. • Both used in a pair of procedures P1 and P2, which call each other recursively. • P1 cuts given rectangle vertically and submits first sub-rectangle to P2 for horizontal cuts. Similarly with P2. • Main procedure calls P1 for vertical cuts, and P2 for horizontal cuts.

Example – Original HTML table

Example (contd.) – After import into Excel

Example – After Editing

Parenthetical version of the output ( [ { ::15,2:15,2 ::16,2:16,2 Real gross domestic product, expenditure-based, by province and territory (millions of chained (2002) dollars)::17,2:30,2 } { ::15,3:15,3 ::16,3:16,3 Canada::17,3:17,3 Newfoundland and Labrador::18,3:18,3 Prince Edward Island::19,3:19,3 Nova Scotia::20,3:20,3 New Brunswick::21,3:21,3 Quebec::22,3:22,3 Ontario::23,3:23,3 Manitoba::24,3:24,3 Saskatchewan::25,3:25,3 Alberta::26,3:26,3 British Columbia::27,3:27,3 Yukon::28,3:28,3 Northwest Territories::29,3:29,3 Nunavut::30,3:30,3 } { Year::15,4:15,8 [ 2004::16,4:16,4 2005::16,5:16,5 2006::16,6:16,6 2007::16,7:16,7 2008::16,8:16,8 ] . . . XML version of the output . . <block id='1.1.2.1' range='17,2:30,2'> <content>Real gross domestic product, expenditure-based, by province and territory (millions of chained (2002) dollars)</content> </block> <block id='1.1.2.2' range='17,3:30,3'> <content></content> </block> <block id='1.2.2.1' range='16,4:16,4'> <content>2004</content> </block> <block id='1.2.2.2' range='16,5:16,5'> <content>2005</content> </block> <block id='1.2.2.3' range='16,6:16,6'> <content>2006</content> </block> <block id='1.2.2.4' range='16,7:16,7'> <content>2007</content> </block> . . . A snippet of the output (both parenthetical and XML outputs)

Grammar for tables • The grammar uses nested parenthetical notation (P-notation). • P-notation has 1:1 correspondence with general trees. • For above table, the XY tree sentence is: Sxy = {c [c c] c [c {c [c c]} c {c [c c]}]} (neglecting the textual labels)

Grammar • Grammar for parsing the column headers of all such layout-equivalent tessellations: • S := A (Rule 1) • A := {B} (Rule 2) • B := c [X] B | c [X] (Rules 3 and 4) • X := c X | A X | A | c (Rules 5, 6, 7 and 8) • where • S – start symbol • A – nonterminal that generates all admissible strings for column headers • B – generates >=1 instances of categories in the form c[X] • Each c becomes a root category and X generates its subcategory tree • X generates strings of size >=1 with arbitrary occurrences of c and A. • The derivation for the previous example using a LALR parser is shown on the next slide

Example demonstrates both power and limitation of grammars. • A grammar can recognize broad classes. • But grammars cannot check that headings are properly labels for well-formed tables • If accepted by the grammar, need additional geometric alignment and lexical checks to verify Wang notation.

XY tree to Wang Notation • XY2WANG converts an XY tree generated from a restricted family of admissible tables to Wang Notation. • Example: • Uses an indented table-of-contents format as a data structure.

XY2WANG • Input – XY trees with arbitrary number of categories and arbitrary nesting. • Output – XML version of Wang Notation • For a table T = (C, d), • Category Notation: C = { (A,{(A1,phi),(A2,phi)}),(B,{(B1,phi),(B2,phi),(B3,phi)}) } • Delta mappings δ({A.A1,B.B1}) = d11 δ({A.A1,B.B2}) = d12 …

XY2WANG: Algorithm • Algorithm: • First locate 4 principal regions – stub, row/column headers and content cells. • Extract Wang labeled domains under assumption that each spanning cell is the header of smaller cells either to its right (row headers) or bottom (column headers). • Compute Cartesian product of category paths and match each key to the content of a delta cell.

XY2WANG: Table-of-contents data structure • Example of a table and its corresponding table-of-contents data structure is shown

XY2WANG also handles more complex scenarios like: • Higher Wang dimensionality • Deeper nesting of headers • Repetitive headers • Detection of not well-formed tables • These are included in the following pseudocode

Conclusion • Hierarchical structure of categories and flat structure of data cells is recovered from XY trees. • Geometric and topological equivalence classes on tessellations and their XY trees are defined. • Commonly encountered tables are examples of such classes. • These tables are identified by parsing XY trees with a grammar. • Assuming the header labels are consistent, Wang category notation is extracted.

Future work • Account for aggregates – major component of web tables. • Need to integrate other augmentations (footnotes, units, captions etc.) • Expand on the grammar: current version accounts only for column headers. • Automate the conversion from imported web tables to standard formats. • Semantic interpretation of groups of conceptually overlapping tables based on precise representation of layout-invariant syntax.

Current Work • Converting web tables to standard formats for ease of processing. • Internal conventions: A’, A’’, hybrids • Learning from XY trees using tree edit distance • Learning from existing manipulations. • Ex: The user modifies table T1 to a standard format T1’. The steps are all recorded. Now use this information to predict the standard format of a new table T2.

Current work (contd.) • Relation of tree-edit distance to pre-order and post-order string edit distance • Some interesting results and conjectures, but still half-boiled! • (Result) Pre- and post- order traversals enough for reconstructing a general tree. • (Conjecture) For 2 XY trees, distances between corresponding pre- and post-order strings equal, but not for general trees! • (Conjecture) For 2 XY trees, tree-edit distance equal to pre/post order distances • Are tables with same content, but different layouts, collinear (in terms of string/tree edit distance)? • Developing software to calculate tree edit distances, should clear many things. (Any suggestions?)

From Tessellations to Table Interpretation

From Tessellations to Table Interpretation

Presentation Transcript

Tessellations

TESSELLATIONS

Tessellations.

TESSELLATIONS

Tessellations

Tessellations

Tessellations

Tessellations

Tessellations

Tessellations

Tessellations

INTRODUCTION TO TESSELLATIONS

From measurement to Interpretation

Tessellations

Tessellations

Tessellations

Tessellations

Tessellations

Tessellations

Tessellations

From Tessellations to Table Interpretation

TESSELLATIONS