1 / 35

Database Research: Data Mining & Other Areas

Database Research: Data Mining & Other Areas. Dr. Aparna Varde Ph.D., Computer Science, WPI, MA Assistant Professor, Computer Science, VSU, VA Presentation at Montclair State University, NJ May 2, 2008. Agenda. Database Systems Introduction to Databases and Research Areas Data Mining

zulema
Download Presentation

Database Research: Data Mining & Other Areas

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Database Research: Data Mining & Other Areas Dr. Aparna Varde Ph.D., Computer Science, WPI, MA Assistant Professor, Computer Science, VSU, VA Presentation at Montclair State University, NJ May 2, 2008

  2. Agenda • Database Systems • Introduction to Databases and Research Areas • Data Mining • Research Problem in Graphical Data Mining • Other Areas • Data Warehousing • Web Databases

  3. Data in Various Forms Flat Files (Unprocessed) Documents (Processed) Raw Data (Handwritten) Images (Complex) Human Mind (Too much data) Simple Tables (Organized)

  4. Need for Databases • Integration of data • Efficient storage • Fast retrieval • Ease of modification • Security of information • Recovery from failures

  5. Database System Environment Users Application Programs/ Queries Database System DBMS (Database Management System) Database

  6. Roles in the Database World Database Administrator Database Application Programmer Database User Database Researcher

  7. Examples of Database Research Areas • Query Processing and Optimization • Privacy and Security • Storage and Indexing • Data Mining • Data Warehousing • Web Databases

  8. Data Mining • Discovering knowledge from data • Non-trivial process of finding novel and interesting patterns in large datasets to guide future decisions • Types of Data • Numbers • Graphs • Images • Text

  9. Data Mining Techniques • Association Rule Mining • Discovering relationships of the type A => B • Clustering • Grouping objects based on similarity • Classification • Predicting the class of a target

  10. Graphical Data Mining Problem • Experimental results in scientific domains plotted as graphs • Users pose queries for predictive analysis: • Given input conditions, predict most likely graph • Given desired graph, predict most likely conditions • Need for mining graphical data to discover knowledge

  11. Proposed Approach: AutoDomainMine

  12. AutoDomainMine: Prediction of Graph

  13. AutoDomainMine: Prediction of Conditions

  14. Main Tasks Task 1 AutoDomainMine Learning Strategy of Integrating Clustering and Classification [AAAI-06 Poster, ACM SIGART’s ICICIS-05] Task 2 Learning Domain-Specific Distance Metrics for Graphs [ACM KDD’s MDM-05, MTAP-06 Journal] Task 3 Designing Semantics-Preserving Representatives for Clusters [ACM SIGMOD’S IQIS-06, ACM CIKM-06]

  15. Learning Distance Metrics for Graphs • Various distance metrics • Absolute position of points • Statistical observations • Critical features • Issues • Not known what metrics apply • Multiple metrics may be relevant • Need for distance metric learning in graphs Example of domain-specific problem

  16. Proposed Distance Metric Learning Approach: LearnMet • Given • Training set with actual clusters of graphs • Additional Input • Components: distance metrics applicable to graphs • LearnMet Metric • D = ∑wiDi

  17. Evaluate Accuracy • Use pairs of graphs • A pair (ga,gb) is • TP - same predicted, same actual cluster: (g1, g2) • TN - different predicted, different actual clusters: (g2,g3) • FP -same predicted cluster, different actual clusters: (g3,g4) • FN - different predicted, same actual clusters: (g4,g5)

  18. Evaluate Accuracy (Contd.) • How do we compute error for whole set of graphs? • For all pairs • Error Measure • Failure Rate FR • FR = (FP+FN) / (TP+TN+FP+FN) • Error Threshold (t) • Extent of FR allowed • If (FR < t) then clustering is accurate

  19. Adjust the Metric • Weight Adjustment Heuristic: for each Di • New wi = wi – sfi (DFNi/DFN + DFPi/DFP) [KDD’s MDM-05]

  20. Testing of LearnMet • Details: MTAP-06 • Effect of pairs per epoch (ppe) • G = number of graphs, e.g., = 25 • GC2 = total number of pairs, e.g., = 300 • Select subset of GC2 pairs per epoch • Observations • Highest accuracy with middle range of ppe • Learning efficiency best with low ppe Accuracy of Learned Metrics over Test Set Learning Efficiency over Training Set

  21. User Surveys of the AutoDomainMine System • Formal user surveys in different applications • Evaluation Process • Compare estimation with real data in test set • If they match estimation is accurate • Observations • Estimation Accuracy around 90 to 95 % Accuracy: Estimating Conditions Accuracy: Estimating Graphs

  22. Related Work • Similarity Search [HK-01, WF-00] • Non-matching conditions could be significant • Mathematical Modeling [M-95, S-60] • Existing models not applicable under certain situations • Case-based Reasoning [K-93, AP-03] • Adaptation of cases not feasible with graphs • Learning nearest neighbor in high-dimensional spaces: [HAK-00] • Focus is dimensionality reduction, do not deal with graphs • Distance metric learning given basic formula: [XNJR-03] • Deal with position-based distances for points, no graphs involved • Similarity search in multimedia databases [KB-04] • Use various metrics in different applications, do not learn a single metric • Image Rating: [HH-01] • User intervention involved in manual rating • Semantic Fish Eye Views: [JP-04] • Display multiple objects in small space, no representatives • PDA Displays in Levels of Detail: [BGMP-01] • Do not evaluate different types of representatives

  23. Data Warehousing DW View • Data Warehouse • Subject-oriented, integrated repository of relevant data from various information sources Mediator R11 R12 R21 R22 R23 R31 IS1 IS2 IS3

  24. Research Problem in Data Warehousing • View Maintenance (VM) • Keeping warehouse view consistent with respect to change in sources • Incremental VM • Update warehouse as the source data changes • Propagate only the updates, not all data • Concurrency Conflicts • Two or more sources / relations try to send updates at the same time • Problem • Solve concurrency conflicts in view maintenance in multi-source multi-relation environments

  25. Proposed Solution: MEDWRAP (MEDiator WRAPper compensation) V Data Warehouse Mediator (Multi-Source VM Algorithm) rIS1 rIS2 rIS3 Wrapper (Single-Source VM Algorithm) Wrapper (Single-Source VM Algorithm) Wrapper (Single-Source VM Algorithm) rR31 rR11 rR21 rR22 R11 R21 R22 R23 R31 R32 IS1 IS2 IS3

  26. Advantages of MEDWRAP • Generic for any compensation based algorithms • Allows sources to be semi-autonomous • Sources do not participate in maintenance beyond processing queries and reporting updates • No locking needed • Low Storage Cost • Additional views not stored at wrappers • Copies of source relations not stored at warehouse • Efficient Processing Time • No need to re-compute whole view • Details in DEXA-2002 paper

  27. Related Work • RV: Re-computation of View (Traditional) • Rewrite all tuples, not only affected ones • Highly inefficient if done for every update • SM: Self Maintenance [Q-96, G-96] • DW stores copies of source relations for maintenance • Huge storage at warehouse • Version Control: [K-99, C-00] • Versions of transactions / tuples stored at wrappers • Latest version used to answer queries • Huge storage at source wrappers

  28. Web Databases • Management of Data on the Web • XML, the eXtensible Markup Language • Widespread standard in storing and publishing data • Domain-specific markup languages designed with XML tag sets • Standardization bodies extend these to include additional semantics • Aspects such domain knowledge, XML constraints are important

  29. Domain-specific Markup Language • Medium of communication for potential users of the domain • Follows XML syntax • Encompasses the semantics of the domain • Examples • MML: Medical Markup Language • ChemML: Chemical Markup Language Industries Markup Language Publishers Consumers Research Organizations Universities

  30. Markup Language Development Steps 1. Acquisition of Domain Knowledge - Familiarity with related markups 2. Data Modeling - E.g.,Entity Relationship models 3. Requirements Specification - E.g.,Interviews with Domain Experts 4. Ontology Creation - Analogous to pilot version of software 5. Revision of Ontology - Alpha version 6. Schema Definition - Beta version 7. Reiteration of Schema until Standardization - Release Version Snapshot of Final Schema with data storage

  31. Desired Features of Markup Languages • Avoidance of Redundancy • No duplicate information • Non-Ambiguous Presentation of Data • Issues such as synonymy & polysemy • Easy Interpretability of Data • E.g. in scientific domains, store experimental input conditions before results • Incorporation of Domain-Specific Requirements • E.g. conflicts such as: in financial domains, a person can be either insolvent or asset-holder but not both • Extensibility of the Markup • Users should be able to capture additional semantics

  32. Application of XML Constraints • Sequence Constraint • To control the order of tags • Choice Constraint • To use either one tag or the other • Key Constraint • To identify an attribute as a unique primary key • Occurrence Constraint • To declare minimum and maximum occurrences

  33. Convenient Access to Information • Data stored using XML based markup languages can be easily accessed using languages such as • XQuery: XML Query Language • XSLT: XML Stylesheet Language Transformations • XPath: XML Path Language • Details on markup language development • Chapter on “XML Based Markup Languages for Specific Domains” by Varde et al. in book “XML Based Support Systems”, Springer 2008

  34. Related Work • Semantic Extensions of XML for Advanced Applications [YKB-2001] • Versions and Standards of HTML [B-95] • The Latest MML (Medical Markup Language) Version 2.3 - XML based Standard for Medical Data Exchange/ Storage [GATSSTSNY-2003] • XQuery 1.0: An XML Query Language [BFFRS-2003] • Handbook of Modern Finance [SL-2004] • Propagating XML Constraints to Relations [DFHQ-2003]

  35. Conclusions and Ongoing Work • Data Mining • Graphical Data Mining Area, AutoDomainMine approach • Ongoing Work • Feature Selection in Image Mining (with colleagues in VSU and WPI: NSF Grants involved) • Mining Genomic and Proteomic Data (with ISB: Institute of Systems Biology) • Data Warehousing • View Maintenance Area, MEDWRAP approach • Ongoing Work • Data Warehouse Maintenance in real time environments (with researchers at Microsoft Search Labs) • Web Databases • Book Chapter on XML Based Markup Languages for Specific Domains • Ongoing Work • Development of Domain-specific markups (with NIST: National Institute of Standards and Technology)

More Related