Information extraction on real estate rental classifieds
This presentation is the property of its rightful owner.
Sponsored Links
1 / 22

Information Extraction on Real Estate Rental Classifieds PowerPoint PPT Presentation


  • 71 Views
  • Uploaded on
  • Presentation posted in: General

Information Extraction on Real Estate Rental Classifieds. Eddy Hartanto Ryohei Takahashi. Overview. We want to extract 10 fields:. Security deposit Square footage Number of bathrooms Contact person’s name Contact phone number. Nearby landmarks Cost of parking Date available

Download Presentation

Information Extraction on Real Estate Rental Classifieds

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Information extraction on real estate rental classifieds

Information Extraction on Real Estate Rental Classifieds

Eddy Hartanto

Ryohei Takahashi


Overview

Overview

  • We want to extract 10 fields:

  • Security deposit

  • Square footage

  • Number of bathrooms

  • Contact person’s name

  • Contact phone number

  • Nearby landmarks

  • Cost of parking

  • Date available

  • Building style / architecture

  • Number of units in building

  • These fields can’t easily be served by keyword search


Approach

Approach

  • Hand labeled test set as precision and recall computation base

  • Pattern matching approach with Rapier

  • Statistical approach using HMM with different structures


Information extraction on real estate rental classifieds

Demo …


Hidden markov models

Hidden Markov Models

  • We consider three different HMM structures

  • We train one HMM per field

  • Words in postings are output symbols of HMM

  • Hexagons represent target states, which emit the relevant words for that field


Training data

Training Data

  • We use a randomly-selected set of 110 postings to use as the training data

  • We manually label which words in each posting are relevant to each of the 10 fields


Hmm structure 1

HMM Structure #1

  • A single prefix state and single suffix state

  • Prefixes and suffixes can be of arbitrary length


Hmm structure 2

HMM Structure #2

  • Varying numbers of prefix, suffix, and target states


Hmm structure 3

HMM Structure #3

  • Varying numbers of prefix, suffix, and target states

  • Prefixes and suffixes are fixed in length


Cross validation

Cross-Validation

  • We use cross-validation to find the optimal number of prefix, suffix, and target states


Preventing underflow

Preventing Underflow

  • Postings are hundreds of words long

  • Forward and backward probabilities become incredibly small => underflow

  • To avoid underflow, we normalize the forward probabilities:

  • instead of


Smoothing

Smoothing

  • We perform add-one smoothing for the emission probabilities:


Rapier

Rapier

  • Rapier automatically learns rules to extract fields from training examples

  • We use the same 110 training postings as for the HMMs


Data preparation

Data Preparation

  • Sentence Splitter (Cognitive Computation Group at UIUC, http://l2r.cs.uiuc.edu/~cogcomp/tools.php): puts one sentence on each line

  • Stanford Tagger (Stanford NLP Group, http://nlp.stanford.edu/software/tagger.shtml): tags each word with part of speech

  • We then manually create a template file for each of the files, with the information for the 10 fields filled in


Test data

Test Data

  • We use a randomly-selected set of 100 postings to use as the test data

  • We manually label these 100 postings with the fields


Rapier results

Rapier Results

  • We use Rapier’s “test2” program to evaluate performance on the labeled postings

  • Training Set

    • Precision: 0.990099

    • Recall: 0.408998

    • F-measure: 0.578871

  • Test Set

    • Precision: 0.747126

    • Recall: 0.151869

    • F-measure: 0.252427


Another run at rapier

Another run at Rapier


Hmm structure 11

HMM Structure#1


Hmm structure 21

HMM Structure#2


Hmm structure 31

HMM Structure#3


Insights

Insights

  • Relatively good performance with Rapier

  • Not too good performance with HMM, due to lack of training data (only 0.67% or 100 sampled randomly from 15000 postings) while test data is 10% or 1500 postings sampled from 15000 postings.

  • Limitation of automatic spelling correction although enhanced with California town, city, county names and first person names.

  • Wish the availability of advanced ontology as Wordnet is somewhat limited: recognize entity such as SJSU, Albertson, street names


Question answer

Question & Answer


  • Login