1 / 99

Spatial Data Mining Toolkit for Refining MSDS (aka TopoAssistant)

Spatial Data Mining Toolkit for Refining MSDS (aka TopoAssistant). TEC SBIR Phase I A03-129 Status Update Ranga Ramanujan Sid Kudige Shashi Shekhar Gene Proctor

tuari
Download Presentation

Spatial Data Mining Toolkit for Refining MSDS (aka TopoAssistant)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Spatial Data Mining Toolkit for Refining MSDS(aka TopoAssistant) TEC SBIR Phase I A03-129 Status Update Ranga Ramanujan Sid Kudige Shashi Shekhar Gene Proctor 952-829-5864 (x120) 952-829-5864 (x163) 612-624-8307 202-293-9701 (x113) ranga@atcorp.com skudige@atcorp.com shekhar@cs.umn.edu gproctor@atcorp-dc.com

  2. Agenda SBIR Review 09:00 - 12:00 Kudige Lunch 12:00 - 01:00 ATC R&D Overview 01:00 - 01:45 Ramanujan Spatial Data Mining 01:45 - 02:15 Shekhar Research at UMN Facility Tour 02:15 - 02:45 Proctor

  3. Outline • SBIR goal, motivation and innovations • Phase I results • Phase I prototype demonstration • Technical challenges • Phase II technical approach • Phase II work plan • Summary

  4. Overall SBIR Goal • Develop TopoAssistant tool for assisting Army topographers with refinement of feature data for “just-in-time” MSDS • Phase I Goal • Develop architecture and design of TopoAssistant software tool • Build rapid prototype to establish implementation feasibility • Phase II Goal • Build full-scale operational prototype of TopoAssistant • Phase III Goal • Transition TopoAssistant to fielded system • Team • Sid Kudige - PI • Ranga Ramanujan - Tech. Advisor • Prof. Shashi Shekhar - Consultant • Gene Proctor - Commercialization

  5. Motivation and Payoff • Current process for refining MSDS feature data is time consuming and expensive • Study estimate of 2,400 production hours for DTOP 5 data set for 15’X15’ cell size [Kabinier] • TopoAssistant tool will use innovative spatial data mining techniques to • Significantly automate feature data refinement • Detection of errors in source data • Prediction of positional errors • Prediction of extra/erroneous/missing features • Predicting mislabeled features • Feature attribution • Prediction of missing features (categorical) • Prediction of erroneous/missing attribute values (numerical) • Support timely and cost-effective Army co-production and value adding for MSDS feature data

  6. TopoAssistant Innovations • Novel approach for automating the feature data refinement using spatial data mining techniques • Detection of errors • Spatial outlier detection • statistical/empirical rules • collocation based rules • Feature attribution • Attribute/Location prediction techniques • collocation based rules • Open/Extensible implementation architecture • Plug-in/add-on spatial data mining techniques • C/JMTK framework compliant • Seamless integration with commercial GIS products

  7. Outline • SBIR goal, motivation and innovations • Phase I results • Phase I prototype demonstration • Technical challenges • Phase II technical approach • Phase II work plan • Summary

  8. Phase I Results • Demonstrated TopoAssistant feasibility • Implementation feasibility: Built prototype • Concept feasibility: Designed prototype evaluation methodology for TEC datasets • Concept feasibility: Applied spatial data mining techniques for • Detection of errors • Prediction of positional errors • Prediction of extra/erroneous/missing features • Prediction of mislabeled features • Feature attribution • Prediction of missing features • Identified technical challenges and Phase II approach for addressing them

  9. Implementation Feasibility: Phase I Prototype Architecture FRONT- END SPATIAL DATA MINING COMPONENT OUTLIER DETECTION/ COLLOCATION PACKAGE (Weka) CONVERT SQLTABLES INTO SHAPEFILES BACK-END SPATIAL DATABASE COMPONENT JDBC BRIDGE SPATIAL JOINS USING SQL QUERIES LOAD SQL TABLES INTO POSTGRES/ POSTGIS SHAPEFILE TO SQL CONVERSION (SHP2PGSQL) SHAPEFILE DATASET SHAPEFILES DATA VISUALIZATION COMPONENT VISUALIZE SHAPEFILES WITH ARCEXPLORER INTO MAPS

  10. Architecture Components • Back-end Spatial Database Component • PostGIS - Spatially enables Postgresql table ogis compliant • Shp2pgsql tool - Shapefile to SQL table conversion using • Bulk loader - Load SQL tables into spatially enabled database • Front-end Data Mining Component • Weka - Java based public domain software that implements classical data mining techniques • Custom spatial data mining classes - spatial outlier detection/collocation pattern detection package implemented for Weka • Pgsql2shp - Convert SQL tables returned as a result of outlier detection /collocation pattern detection operation into shapefiles using

  11. Architecture Components • Connector Component - JDBC Bridge • Java client in Weka can access PostGIS “geometry” objects in Postgres database using JDBC extensions bundled with Postgres and PostGIS. • JDBC bridge successfully tested on test machine • Map Visualization Component • ArcExplorer for shapefile visualization

  12. Prototype Evaluation Methodology • Received Korea dataset from TEC • Reviewed dataset using ArcExplorer • Leveraged spatial database component to convert shapefile to SQL script • Loaded table in Postgres/PostGIS • Formulated and ran SQL3/OGIS queries to mine outliers/collocation patterns and compute interest mean • Converted resulting tables into shapefiles • Visualized results using ArcExplorer

  13. TEC Dataset Overview • Korea dataset • Latitude37deg15min to 37deg30min • Longitude 128deg23min51sec to 128deg23min52sec • Layers • Obstacles (Cut, embankment, depression) • Surface drainage (River, stream, island, common open water, ford, dam) • Slope • Soils (Poorly graded gravel, clayey sand, organic silt,disturbed soil) • Vegetation (Land subject to inundation, cropland, rice field, evergreen trees, mixed trees) • Transport (Roads, cart roads, railways)

  14. TEC Dataset Overview • Visualized using ArcExplorer except elevation data • Interpreted feature sets in TEC datasets • Using FACC • Except common open water feature (surface drain layer) • Pattern rich • Numerous spatial outliers • Collocation patterns • Promising test dataset for spatial data mining

  15. Phase I Results • Demonstrated TopoAssistant feasibility • Implementation feasibility: Built prototype • Concept feasibility: Designed prototype evaluation methodology for TEC datasets • Concept feasibility: Applied spatial data mining techniques for • Detection of errors • Prediction of positional errors • Prediction of extra/erroneous/missing features • Prediction of mislabeled features • Feature attribution • Prediction of missing features • Identified technical challenges and Phase II approach for addressing them

  16. Detecting Errors via Spatial Outliers • Motivation - Improve map accuracy by detecting/predicting • Positional errors • Extra/erroneous/missing features • Mislabeled/misclassified features • Spatial outlier detection techniques • Statistical/user defined tests • Collocation patterns

  17. Spatial Outliers Detected • Statistical/user defined tests • Disconnected road • Overlapping road and river

  18. Statistical/Empirically Derived Outliers Positional Error: Disconnected Roads • 6 Disconnected roads discovered • Visual inspection may not reveal disconnect without further zooming • May be indicative of positional error • Distance threshold is 0.001 units Road 2 Road 4 Road 5 Road 3 Legend Road 1 Disconnected Road Road 6

  19. Statistical/Empirically Derived Outliers Positional Error: Disconnected Roads • 6 Disconnected roads discovered • Visual inspection may not reveal disconnect without further zooming • May be indicative of positional error • Distance threshold is 0.001 units Road 2 Disconnect Road 4 Road 5 Disconnect Road 3 Disconnect Disconnect Disconnect Legend Road 1 Disconnected Road Disconnect Road 6

  20. Disconnected Road: Magnified View Road 1 Disconnected

  21. Disconnected Road: Magnified View Disconnected Road 2

  22. Disconnected Road: Magnified View Disconnected Road 3 Disconnected

  23. Disconnected Road: Magnified View Disconnected Road 3

  24. Disconnected Road: Magnified ViewFrontage Road Example End point of road geometry Road 4 Disconnected ? Interesting because end-point of Road 4 doesn’t appear visually to be close to end-point of other road. Or is it ? Afterthought: Road 4 resembles frontage road

  25. Disconnected Road: Magnified View Road 5 Disconnected

  26. Disconnected Road: Magnified View Disconnected Road 6

  27. Disconnected Road:Additional Outlier Discovered Disconnected Outlier ! Road 6

  28. Detecting Disconnected Roads:Empirical Technique Used • Determine and store start-point and end-point of each road in the road table • Calculate distance between start-point and end-point of each road with start-point and end-point of every other road • Flag roads whose ends are at distance less than 0.001 units from each other as outliers

  29. Detecting Disconnected Roads: Spatial Query Fragment CREATE VIEW Road AS SELECT T.id as Road_id, T.the_geom as Road_Geometry, startpoint ( T.the_geom ) as Road_Start_Point, endpoint ( T.the_geom ) as Road_End_Point FROM Road_Line_Table T; CREATE VIEW Disconnected_Road AS SELECT R1.Road_id as Disconnected_Road_id FROM Road R1, Road R2 WHERE ( disjoint ( R1.Road_Geometry, R2.Road_Geometry ) = true ) AND ( distance ( R1.Road_Start_Point, R2.Road_Start_Point ) < 0.001 OR distance ( R1.Road_Start_Point, R2.Road_End_Point ) < 0.001 OR distance ( R1.Road_End_Point, R2.Road_Start_Point ) < 0.001 OR distance ( R1.Road_End_Point, R2.Road_End_Point ) < 0.001 ) ; CREATE TABLE Disconnected_Road_Outlier AS SELECT DISTINCT R.* FROM Road_Line_table R, Disconnected_Road D WHERE R.id = D. Disconnected_Road_id ;

  30. Detecting Disconnected RoadsSpatial Query Performance • Machine used - 1.4 GHz Athlon with 512 MB RAM • Total execution time - 4.5 minutes

  31. Statistical/Empirically Derived OutliersRoad Frequently Crossing River • Road frequently crossing river • Visual inspection may not reveal outlier without further zooming • May be indicative of positional error • Threshold = 0.001 units Road 3 Legend River Road Road 1 Road 2

  32. Statistical/Empirically Derived OutliersRoad Frequently Crossing River • Road frequently crossing river • May be indicative of positional error Road 3 Outlier Legend River Outlier Road Road 1 Outlier Road 2

  33. Road Frequently Crossing River: Magnified View Outlier Outlier Road 1 Legend River Road Bridge

  34. Road Frequently Crossing River: Magnified View Road 2 Legend River Outlier Road Bridge

  35. Road Frequently Crossing River: Magnified View Legend River Road 3 Road Bridge Outlier

  36. Detecting Road Frequently Crossing River:Empirical Technique Used • Determine intersections of roads and rivers • Identify location pairs • If the distance between any two location pairs is less than 0.001 units, it is classified as an outlier • Ensure that there is no bridge geometry feature between the two location pairs

  37. Detecting Road Frequently Crossing RiverSpatial Query Fragment CREATE VIEW Road_River_Cross_Geometry AS SELECT T.id as Road_Cross_RiverID, intersection ( T.the_geom, S.the_geom ) as Road_Cross_River FROM Road_Line_Table T, River_Area_Table S WHERE intersects ( T.the_geom, S.the_geom ) = true ; CREATE VIEW Roads_Crossing_River_Frequently AS SELECT R1.Road_Cross_RiverID AS Road_Cross_River_OutlierID, FROM Road_River_Cross_Geomtery R1, Road_River_Cross_Geometry R2 WHERE disjoint ( R1.Road_Cross_River, R2.Road_Cross_River) AND distance ( R1.Road_Cross_river, R2.Road_Cross_River ) < 0.001 ; CREATE TABLE Road_Crossing_River_Outlier AS SELECT DISTINCT T.* FROM Road_Line_Table T, Roads_Crossing_River_Frequently R WHERE T.id = R. Road_Cross_River_OutlierID;

  38. Detecting Road Frequently Crossing River Spatial Query Performance • Machine used - 1.4 GHz Athlon with 512 MB RAM • Total execution time - 5 minutes

  39. River Becoming Stream: Predicting Mislabeled Features • Streams usually become rivers but rivers rarely become streams unless a lake is nearby • River becoming a stream is a local spatial outlier Stream River

  40. Detecting River Becoming Stream:Empirical Technique Used • Determine intersections of rivers and streams • If there are no lakes at distance less than 0.01 units near the intersection points classify the river feature as an outlier

  41. Phase I Results • Demonstrated TopoAssistant feasibility • Implementation feasibility: Built prototype • Concept feasibility: Designed prototype evaluation methodology for TEC datasets • Concept feasibility: Applied spatial data mining techniques to • Detection of errors • Prediction of positional errors • Prediction of extra/erroneous/missing features • Prediction of mislabeled features • Feature attribution • Prediction of missing features • Identified technical challenges and Phase II approach for addressing them

  42. Feature Attribution via Collocation • Motivation - Improve feature attribution by • Prediction of missing features • Approach - collocation patterns • Collocation patterns detected • Crop land/rice fields: ends of roads/cart roads/rivers/streams • Road collocated with river/stream

  43. Detecting Collocation Patterns:Algorithmic Basis • To calculate the degree of collocation we use a measure called interest measures • E.g., 96.5 % of the cropland are close to road/river • Interest measure represents conditional probability i.e., is the probability of finding a road or river nearby, there being a cropland is 0.965 • Cropland not close to road/river may predict missing road or river feature • Cropland not close to road/river may also indicate positional error of cropland

  44. Predicting Missing Features using Collocation Patterns • Cropland collocated with river, stream or road • May predict missing river, stream or road features River/stream Cropland Road Non collocated cropland

  45. Spatial Outlier Detection using Collocation Patterns • Cropland collocated with river, stream or road • Cropland outlier may also predict positional error of cropland River/stream Cropland Road Croplandoutlier

  46. Cropland/Road/River: Interest Measure • Total number of cropland features = 199 • Distance threshold = 0.001 • 96.5 % of all cropland features collocated with road or river

  47. Cropland/Road/River Collocation Pattern:Technique Used • Cropland pattern detected using collocation pattern detection techniques • Step 1: Cropland areas collocated with cart road/road determined • Step 2: Cropland areas collocated with stream/river determined • Step 3: Cropland areas collocated with cart road/road or stream/river determined • Cropland outliers are cropland areas which are not collocated with either road, cartroad, stream or river features

  48. Cropland/Road/River Collocation Pattern: Spatial Query Fragment CREATE TABLE Cropland_River_Collocate AS SELECT C.* FROM River_Area_Table R, Veg_Area_Table C WHERE (C.f_code_des = 'Cropland' AND distance ( C.the_geom,R.the_geom) < 0.01) OR (C.f_code_des = 'Rice Field' AND distance ( C.the_geom,R.the_geom)<0.01); CREATE TABLE Cropland_Stream_Collocate AS SELECT C.* FROM Stream_Line_Table R, Veg_Area_Table C WHERE ( C.f_code_des = 'Cropland' AND distance ( C.the_geom,R.the_geom) < 0.001) OR ( C.f_code_des = 'Rice Field' AND distance ( C.the_geom,R.the_geom) < 0.001) ; CREATE TABLE Cropland_Road_Collocate AS SELECT C.* FROM Road_Line_Table R, Veg_Area_Table C WHERE (C.f_code_des = 'Cropland' AND distance ( C.the_geom,R.the_geom) < 0.001) OR (C.f_code_des = 'Rice Field' AND distance ( C.the_geom,R.the_geom)<0.001); CREATE TABLE Cropland_Cartroad_Collocate AS SELECT C.* FROM Cartroad_Line_Table R, Veg_Area_Table C WHERE (C.f_code_des = 'Cropland' AND distance ( C.the_geom,R.the_geom) < 0.001) OR (C.f_code_des = 'Rice Field' AND distance ( C.the_geom,R.the_geom)<0.001);

  49. Cropland/Road/River Collocation Pattern: Spatial Query Performance • Machine used - 1.4 GHz Athlon with 512 MB RAM • Total execution time - 13.5 minutes

  50. Collocation Pattern: Roads with Rivers • Road collocated with river/stream • Pondering if it could be used to predict anything ? • May predict missing streams River/Stream Collocated Roads Non collocated Roads

More Related