1 / 12

Table Extraction Using MaxEnt

This article discusses the problem of table extraction and formatting, specifically focusing on HTML tables and plain text tables. It explores how tags can help in understanding HTML tables and proposes a MaxEnt model for table extraction. A data set from the CS department at the University of Massachusetts Amherst is used for training and testing the model. The article also presents an error analysis and suggests future improvements.

minnick
Download Presentation

Table Extraction Using MaxEnt

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Table Extraction Using MaxEnt Zonghui Lian

  2. Introduction • Table extraction • Table format

  3. Problem • HTML table • Tags can help us to understand it • How about plain text table?

  4. title title title separator header header header header datarow datarow datarow datarow datarow datarow An Example

  5. How to define features How to learn model weights MaxEnt

  6. Data Set • CS dept university of Massachusetts Amherst (FedStats.gov) • Training data: 9321 Test data: 1200 • Format

  7. Features • White space • Large gaps /Small gaps • Four space indents • Space percentage • Text feature • Digit percentage • Month and year

  8. Features • Special characters -, +, =, :, |, .

  9. Result

  10. TABLEFOOTNOTE -> NONTABLE DATAROW DATAROW -> SECTIONDATAROW TABLEHEADER -> SUPERHEADER Most error happened when recognizing … [TABLEFOOTNOTE : 0.2719665271966527 DATAROW : 0.12552301255230125 TABLEHEADER : 0.11715481171548117 Error Analysis TABLEFOOTNOTE 1 Includes Hawaii. TABLEFOOTNOTE 2 Includes processing total for dual usage crops.

  11. Future Work • Improve the performance • Features For example Alphabet characters Previous label Next label • Data set size

  12. Future Work • Identity columns • Add tags • Use table understanding algorithm

More Related