Information extraction on real estate rental classifieds
Download
1 / 22

Information Extraction on Real Estate Rental Classifieds - PowerPoint PPT Presentation


  • 89 Views
  • Uploaded on
  • Presentation posted in: General

Information Extraction on Real Estate Rental Classifieds. Eddy Hartanto Ryohei Takahashi. Overview. We want to extract 10 fields:. Security deposit Square footage Number of bathrooms Contact person’s name Contact phone number. Nearby landmarks Cost of parking Date available

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha

Download Presentation

Information Extraction on Real Estate Rental Classifieds

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Information Extraction on Real Estate Rental Classifieds

Eddy Hartanto

Ryohei Takahashi


Overview

  • We want to extract 10 fields:

  • Security deposit

  • Square footage

  • Number of bathrooms

  • Contact person’s name

  • Contact phone number

  • Nearby landmarks

  • Cost of parking

  • Date available

  • Building style / architecture

  • Number of units in building

  • These fields can’t easily be served by keyword search


Approach

  • Hand labeled test set as precision and recall computation base

  • Pattern matching approach with Rapier

  • Statistical approach using HMM with different structures


Demo …


Hidden Markov Models

  • We consider three different HMM structures

  • We train one HMM per field

  • Words in postings are output symbols of HMM

  • Hexagons represent target states, which emit the relevant words for that field


Training Data

  • We use a randomly-selected set of 110 postings to use as the training data

  • We manually label which words in each posting are relevant to each of the 10 fields


HMM Structure #1

  • A single prefix state and single suffix state

  • Prefixes and suffixes can be of arbitrary length


HMM Structure #2

  • Varying numbers of prefix, suffix, and target states


HMM Structure #3

  • Varying numbers of prefix, suffix, and target states

  • Prefixes and suffixes are fixed in length


Cross-Validation

  • We use cross-validation to find the optimal number of prefix, suffix, and target states


Preventing Underflow

  • Postings are hundreds of words long

  • Forward and backward probabilities become incredibly small => underflow

  • To avoid underflow, we normalize the forward probabilities:

  • instead of


Smoothing

  • We perform add-one smoothing for the emission probabilities:


Rapier

  • Rapier automatically learns rules to extract fields from training examples

  • We use the same 110 training postings as for the HMMs


Data Preparation

  • Sentence Splitter (Cognitive Computation Group at UIUC, http://l2r.cs.uiuc.edu/~cogcomp/tools.php): puts one sentence on each line

  • Stanford Tagger (Stanford NLP Group, http://nlp.stanford.edu/software/tagger.shtml): tags each word with part of speech

  • We then manually create a template file for each of the files, with the information for the 10 fields filled in


Test Data

  • We use a randomly-selected set of 100 postings to use as the test data

  • We manually label these 100 postings with the fields


Rapier Results

  • We use Rapier’s “test2” program to evaluate performance on the labeled postings

  • Training Set

    • Precision: 0.990099

    • Recall: 0.408998

    • F-measure: 0.578871

  • Test Set

    • Precision: 0.747126

    • Recall: 0.151869

    • F-measure: 0.252427


Another run at Rapier


HMM Structure#1


HMM Structure#2


HMM Structure#3


Insights

  • Relatively good performance with Rapier

  • Not too good performance with HMM, due to lack of training data (only 0.67% or 100 sampled randomly from 15000 postings) while test data is 10% or 1500 postings sampled from 15000 postings.

  • Limitation of automatic spelling correction although enhanced with California town, city, county names and first person names.

  • Wish the availability of advanced ontology as Wordnet is somewhat limited: recognize entity such as SJSU, Albertson, street names


Question & Answer


ad
  • Login