1 / 33

MINING REAL ESTATE LISTINGS USING ORACLE DATA WAREHOUSING AND PREDICTIVE REGRESSION

MINING REAL ESTATE LISTINGS USING ORACLE DATA WAREHOUSING AND PREDICTIVE REGRESSION. Wuri Wedyawati, Meiliu Lu Department of Computer Science California State University Sacramento, CA 95819-6021 mlu@csus.edu. Outline. Introduction Data Warehousing Building a data warehouse

jesse
Download Presentation

MINING REAL ESTATE LISTINGS USING ORACLE DATA WAREHOUSING AND PREDICTIVE REGRESSION

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MINING REAL ESTATE LISTINGS USING ORACLE DATA WAREHOUSING AND PREDICTIVE REGRESSION Wuri Wedyawati, Meiliu Lu Department of Computer Science California State University Sacramento, CA 95819-6021 mlu@csus.edu

  2. Outline • Introduction • Data Warehousing • Building a data warehouse • MasterDW: the data warehouse • Predictive Regression • Real Estate Price Prediction • Conclusion • Future work

  3. Introduction • The objective is to develop a knowledge discovery system for prospective real estate sellers and buyers to determine their properties price based on local sold listings. • The prediction of properties selling price, is modeled by predictive regression. • Building a data warehouse is a prerequisite for efficient mining of large and operational data like Multiple Listings Services (MLS) – data source for this system.

  4. Data Warehouse • A decision support database that is maintained separately from the organization’s operational database. • Support decision-making by providing a platform of consolidated, historical data for analysis. • Our data warehouse is based on a multidimensional data model called star schema with one large fact table surrounded by a set of dimension tables.

  5. Data Warehousing • Process of building a data warehouse: 1. Extraction 2. Transformation and cleansing 3. Modeling 4. Transport

  6. 1. Extraction • Document the sources of data • Identify the databases and files containing the data of interest • Analyze and document the business meaning of the data, data relationships and business rules • Determine data that need to be extracted • Extract all of subset of the data from the source • Use unload utility • Use data manipulation language statement • Extract the changes made to the source data • Use a recovery log • Use a database trigger

  7. 2. Transformation and Cleansing • Check the integrity of the source data to verify that it conforms to the business rules and relationships identified in extraction step. • Check the accuracy of the source data. • Identify the tasks required for data cleansing. • Transform and integrate the cleaned data into the format required by the target system – data warehouse.

  8. 3. Modeling • Star Schema shows data as a collection of two types: facts and dimensions. • A Facttable is the primary table in a dimensional model and it contains the names of the facts or numerical measures, as well as keys to each of the related dimension tables. Examples of facts: sales, credit cards accounts, residential records. • A Dimension tableis used to describe a specific dimension with a set of attributes. Examples of dimensions: time, students, areas.

  9. An Example Star Schema

  10. AGENTS Dimension Table OFFICES Dimension Table RESIDENTIAL Fact Table AREAS Dimension Table MasterDW Modeling

  11. 4. Transport • Identify the tools and techniques to be used for loading the data into the target system • SQL*Loader utility (for flat file data) • Transportable tablespaces (for Oracle database) • Evaluate the need for data compression and encryption if captured or transformed data is to be transported across a network

  12. RESI.TXT (Data Source) Transformation and Cleansing RESSOLDLOG.TXT (Log File) Update RES.TXT Transformation and Cleansing 2 OFCSRC.TXT AGTSRC.TXT Duplicate Detection OFFICE.TXT AGENT.TXT RESIDENTIAL.TXT AREA.TXT Load Load Load OFFICES TABLE AGENTS TABLE RESIDENTIAL TABLE AREA TABLE MasterDW Data Warehousing

  13. MasterDW Extraction • The operational data source is extracted from Sacramento, El Dorado, Placer, and Yolo Counties Multiple Listings Services (MLS) database. • It captures all the residential data in the source system since January 1, 1998 until January 9, 2004. • The source data is in the “|” delimited flat file and contains of 191 fields and 295787 rows (“RESI.TXT”).

  14. MasterDW Transformation and Cleansing There are four steps : • Transformation and cleansing 1 • Update process for the result of transformation and cleansing 1 • Transformation and cleansing 2 • Duplication detection for office and agent records

  15. 1. Transformation and Cleansing 1 • Listing Price Check If intLP <= 0 Or intLP > 99999999 Then LPCheck = strMLSNo & " : INVALID LP = " & Str(intLP) ElseIf intLP < 10000 Or intLP > 50000000 Then LPCheck = strMLSNo & " : LP EXCEEDS LIMIT = " & Str(intLP) End If • Square Footage Check If intSQFT = 0 Then SQFTCheck = strMLSNo & " : SQFT IS NULL = " & Str(intSQFT) ElseIf intSQFT > 10000 And intLP < 1000000 Then SQFTCheck = strMLSNo & " : SQFT EXCEEDS LIMIT = " & Str(intSQFT) End If

  16. 1. Transformation and Cleansing 1 (cont.) • Listing Date Check If strLD = "0000-00-00" Or Len(strLD) < 8 Then LDCheck = strMLSNo & " : INVALID LD = " & strLD ElseIf DateValue(strLD) < "1900-01-01" Then LDCheck = strMLSNo & " : LD EXCEEDS LIMIT = " & strLD End If • Number of Full Bathroom and Half Bathroom Check If intFull <= 0 Then BathCheck = strMLSNo & " : NO FULL BATHROOM = " & Str(intFull) & " AND " & Str(intHalf) End If

  17. 1. Transformation and Cleansing 1 (cont.) • Number of Bedroom Check If intBed <= 0 Then BedCheck = strMLSNo & " : NO BEDROOM = " & Str(intBed) End If • Year Built Check If Len(strYearBlt) = 0 Then YearBltCheck = strMLSNo & " : NO YEAR BUILT = " & strYearBlt ElseIf Val(strYearBlt) <= 1900 Then YearBltCheck = strMLSNo & " : INVALID YEAR BUILT = " & strYearBlt End If

  18. 1. Transformation and Cleansing 1 (cont.) • Pending Date Check If strSD = "0000-00-00" Or Len(strSD) < 8 Then SDCheck = strMLSNo & " : INVALID SD = " & strSD ElseIf Len(LDCheck(strMLSNo, strLD)) = 0 And DateValue(strSD) < DateValue(strLD) Then SDCheck = strMLSNo & " : SD / LD = " & strSD & " / " & strLD ElseIf Len(PDCheck(strMLSNo, strPD, strLD)) = 0 And DateValue(strSD) < DateValue(strPD) Then SDCheck = strMLSNo & " : SD / PD = " & strSD & " / " & strPD ElseIf DateValue(strSD) < "1990-01-01" Then SDCheck = strMLSNo & " : SD EXCEEDS LIMIT = " & strSD End If • Days on Market Check If intSP <= 0 Or intSP > 99999999 Then SPCheck = strMLSNo & " : INVALID SP = " & Str(intSP) ElseIf intSP < 10000 Or intSP > 50000000 Then SPCheck = strMLSNo & " : SP EXCEEDS LIMIT = " & Str(intSP) ElseIf Len(LPCheck(strMLSNo, intLP)) = 0 And (intSP < intLP / 1.5 Or intSP > intLP * 1.5) Then SPCheck = strMLSNo & " : LP / SP EXCEEDS NORM = " & intLP & " / " & intSP End If

  19. 1. Transformation and Cleansing 1 (cont.) • Sold Date Check If strPD = "0000-00-00" Or Len(strPD) < 8 Then PDCheck = strMLSNo & " : INVALID PD = " & strPD ElseIf Len(LDCheck(strMLSNo, strLD)) = 0 And DateValue(strPD) < DateValue(strLD) Then PDCheck = strMLSNo & " : PD IS LESS THAN LD => PD = " & strPD & " & LD = " & strLD ElseIf DateValue(strPD) < "1990-01-01" Then PDCheck = strMLSNo & " : PD EXCEEDS LIMIT = " & strPD End If • Sold Price Check If (LDCheck(strMLSNo, strLD)) = 0 And Len(PDCheck(strMLSNo, strPD, strLD)) = 0 And DateDiff(DateInterval.Day, DateValue(strPD), DateValue(strLD)) > 730 Then DOMCheck = strMLSNo & " : DOM TOO LARGE = " & DateDiff(DateInterval.Day, DateValue(strPD), DateValue(strLD)) End If

  20. 2. Update Process for the Result of Transformation and Cleansing 1 • 132110169 : LP EXCEEDS LIMIT = 132 (132000) • 30015346 : SQFT EXCEEDS LIMIT = 12700 (1270) • 30015611 : LD EXCEEDS LIMIT = 1920-05-07 (2000-05-07) • 30015755 : NO FULL BATHROOM = 0 AND 3 (3 AND 0) • 102100090 : INVALID YEAR BUILT = 96 (1996) • 30028591 : INVALID YEAR BUILT = 1056 (1956) • 102000035 : PD IS LESS THAN LD => PD = 2000-03-30 & LD = 2020-01-26 (2000-01-26) • 102000035 : SD / LD = 2000-05-31 / 2020-01-26 (2000-01-26) • 122003643 : SD / PD = 2000-11-01 / 2000-11-05 (2000-11-05 / 2000-11-01) • 132001727 : SP EXCEEDS LIMIT = 124 (124000) • 30016715 : LP / SP EXCEEDS NORM = 226000 / 22600000 (226000)

  21. 3. Transformation and Cleansing 2 • The agents table (“AGTSRC.TXT”) fields: Listing agent id, Listing agent name, Listing agent phone 1, Listing agent phone 2, Listing agent phone 3, Listing agent phone type 1, Listing agent phone type 2, Listing agent phone type 3, Listing office id, Listing co-agent id, Listing co-agent name, Listing co-agent phone, Listing co-office id, Selling agent id, Selling agent name, Selling agent phone, Selling office id, Selling co-agent id, Selling co-agent name, Selling co-office id. Example: “SREIDMAR|Marjorie Reid|916-485-5124|916-485- 5124||1|2||LYON01”

  22. 3.Transformation and Cleansing 2 (cont.) • The offices table (“OFCSRC.TXT”) fields: Listing office id, Listing office name, Listing office phone, Listing office address, Listing office zip, Listing co-office id, Listing co-office name, Listing co-office phone, Selling office id, Selling office name, Selling office phone, Selling co-office id, Selling co-office name. Example: “LYON01|Lyon Real Estate|916-481-3840|2580 Fair Oaks Blvd. #20 Sacramento, CA 95825|95825|Sacramento”

  23. 3. Transformation and Cleansing 2 (cont.) • The areas table (“AREA.TXT”) fields: Area number, area name, county. Example: “10819|East Sacramento & Vicinity|Sacramento County”

  24. 3. Transformation and Cleansing 2 (cont.) • The residential table (“RESIDENTIAL.TXT”) fields: Example: “15501835|2367|Glen Ellen|95822|Sacramento|10822 ||Thomas Bros. (PL,SA)|317 D-5|035-0132-012-0000|Residential|Sold|22-Jan-95|28-Jan-95|20-Apr-99|01-Jan-00|20-May-99|01-Jan-00|1549|1700 |764705.9|17.56|2|1|4|130000|159500|SGREENCA ||GCNA||130000|SSTANWIL||CLBA20||1959|3|FHA |Sacramento|Sacramento Unified|Sacramento Unified|Sacramento Unified”

  25. 4. Duplication and Detection for Agent and Office Records • “AGTSRC.TXT” contains duplicate records. An agent can be a selling agent, a buyer agent, or both in a listing. An agent can have more than one listing in “RES.TXT”. Example: “SAKBARIR|Rouhi N. Akbari|916-484-5456|916-223- 7647||1|C||LYON01” • “OFCSRC.TXT” contains duplicate records. An office can be a selling office, a buyer office, or both in a listing. An office can have more than one listing in “RES.TXT”. Example: “LYON01|Lyon Real Estate|916-481-3840|2580 Fair Oaks Blvd. #20 Sacramento, CA 95825|95825|Sacramento”

  26. AGENTS Dimension Table OFFICES Dimension Table RESIDENTIAL Fact Table AREAS Dimension Table MasterDW Modeling: Ready to load the clean data into the 4 tables

  27. MasterDW Transport • Load “AREA.TXT” to AREAS dimension table c:\>sqlldr masterdw/masterdw control=area.ctl log=area.log • Load “OFFICE.TXT” to OFFICE dimension table c:\>sqlldr masterdw/masterdw control=office.ctl log=office.log • Load “AGENT.TXT” to AGENTS dimension table c:\>sqlldr masterdw/masterdw control=agent.ctl log=agent.log • Load “RESIDENTIAL.TXT” to RESIDENTIAL dimension table c:\>sqlldr masterdw/masterdw control=residential.ctl log=residential.log

  28. Predictive Regression Predictiveregression is regression that uses continuous values in the data set to predict unknown or future values of other variables of interest. The objective of regression analysis is to determine the best model that can relate the output variable to various input variables. nn β = [ ∑ (xi – meanx) . (yi – meany) ] / [ ∑ (xi – meanx)2 ] i=1i=1 α = meany - β .meanx y = α + β.x

  29. Regression: input and output • Input Data: X, β, α to be determined by query selection result from MasterDW based on user request parameters • example: • “Select * from Residential where (Status = ‘Sold’) and (Area_Number) = ‘10835’ and (Square_Footage between ‘2000’ and ‘3000’) and (Bedrooms = ‘4’) and (Bathrooms_Full = ‘2’) and (Bathrooms_Half = ‘0’) and (Year_Built = ‘2001’)” • Assumption: Bull housing market • Output result: Y predicted house price today

  30. Interface • Interface: (user => MasterDW => Predictive Regression => user) • Visual Basic .NET is used to create user interface. • The communication between Oracle and .NET framework is established by adding Oracle Provider for OLE DB (OraOLEDB) component as reference.

  31. Conclusion • understand the knowledge domain: Real estate terms and transaction process • Technology used: • Building a data warehouse using Oracle data warehousing tools • Statistical data analysis (predictive regression method) • Visual Basic .NET programming • Oracle Provider for OLE DB (ORAOLEDB)

  32. Future Work • Towards tightly coupling data mining architecture. • Enhance this project by making it an online service for public. • Integrate current market trend factor • Determine what kind of house improvement that a real estate seller can do to increase property value on the market.

More Related