1 / 40

Self-Organised Data Mining – 20 Years after GUHA-80

Self-Organised Data Mining – 20 Years after GUHA-80. Martin Kejkula KEG 8 th April 2004 http://gama.vse.cz/keg/. Agenda. Idea of Self-Organised Data Mining GUHA-80 revival Process of Self-Organised Data Mining Key factors for Self-Organised Data Mining Metabase, Knowledge Base, etc.

rianna
Download Presentation

Self-Organised Data Mining – 20 Years after GUHA-80

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Self-Organised Data Mining–20 Years after GUHA-80 Martin Kejkula KEG 8th April 2004 http://gama.vse.cz/keg/

  2. Agenda • Idea of Self-Organised Data Mining • GUHA-80 revival • Process of Self-Organised Data Mining • Key factors for Self-Organised Data Mining • Metabase, Knowledge Base, etc. • Proposed EverMiner system for Self-Organised Data Mining

  3. Introduction • Motivation: support X-Miner users • Best practices, known problems collection • Muller, Lemke: Self-Organising Data Mining (2000) • My thesis: • Design/test strings of jobs for EverMiner • Formalization/using heuristics

  4. References (1) • Hájek, P. – Havránek, T.: GUHA 80: An Application of Artificial Intelligence to Data Analysis. Computers and Artificial Intelligence, Vol. 1, 1982, pp. 107-134 • Hájek, P. – Ivánek, J.: Artificial Intelligence and Data Analysis. Proc. COMPSTAT’82, Wien, Physica Verlag 1982, pp. 54-60

  5. References (2) • Hájek, P. – Havránek, T.: GUHA-80 – An Application of Artificial Intelligence to Data Analysis. Matematické středisko biologických ústavů ČSAV, Praha, 1982 • Jirků, P. – Havránek, T.: On Verbosity Levels in Cognitive Problem Solvers. Proc. Computational Linguistics, 1982, http://acl.eldoc.ub.rug.nl/mirror/C/C82/

  6. References (3) • Rauch, J.: EverMiner – studie projektu. Dokumentace projektu LISp-Miner, 2003. • Mueller, J.-A. – Lemke, F.: Self-Organising Data Mining. Extracting Knowledge from Data. Dresden, Berlin, 2000.

  7. GUHA-80: Main Features • Application of artificial intelligence to exploratory data analysis • To generate interesting views onto given empirical data (recognize interesting logical patterns) • Views: relevant, useful

  8. GUHA-80 Sources (1) • GUHA • Automatically generate all interesting hypotheses • Lenat’s AM • Jobs (tasks) • Agenda of jobs • Hundreds of heuristical rules • Concepts

  9. GUHA-80 Sources (2) • GUHA-80 vs. Lenat’s AM • Data • Data-processing procedures • Statistical program packages • Effective modules

  10. GUHA-80 Paradigm • Open-ended data analysis • To maximize interestingness value • Hundreds of heuristic rules • Guide to define and study next step • Access potentially relevant rules, Find truly relevant rules, Follows truly relevant rules

  11. Interestingness in GUHA-80 • No explicit definition • Determined by interplay • Heuristical rules • Weighting mechanisms • Testing in practice (adequately behaviour?) • No algorithm, but constraints

  12. Principles of GUHA-80 • Domain dependence (…exploratory data analysis) • Join human possibilities with machine • More heuristics are relevant • Interactivity with user • Non routine (GUHA-80 not for every-day data processing)

  13. GUHA-80 Structure (1)

  14. GUHA-80 Structure (2) • Input empirical data • Input parameters • How understood “interestingness” • Effective modules (system’s knowledge) • Clustering procedures • GUHA procedures • Agenda of jobs (priority/weight)

  15. GUHA-80 Structure (3) • Heuristics: optimal way to realize a job • Changing system of concepts • Hierarchy of concepts (applicability) • Possible unification of heuristics, jobs,…

  16. GUHA-80 Input • Data • Input information • Decompositions/orderings of sets of quantities • Help understand “interestingness”

  17. GUHA-80 Effective modules • Evaluation of usual statistical characteristics,… • Complicated procedures • Synthesis of parameters (“job on job”)

  18. GUHA-80 • Hundreds of heuristic rules • No explicit definition of interestingness (exploration in a space) • Interactivity with the user • Non-routine character

  19. Process of S-O Data Mining EmpiricalData Domain Knowledge,… Chains of Data & Knowledge Processing Tasks All Interesting Views, Patterns DataSource, TimeTransf, SumatraTT, 4ft, KL, CF, …

  20. Process of S-O Data Mining

  21. Key Factors of S-O Data Mining • Data Preparation • Modeling • Evaluation • Knowledge Base • Domain Knowledge

  22. Data Preparation • Discretization • Attribute Type dependent: • Nominal/Ordinal/Interval/Ratio • Type of coefficient dependent • Discretization-Modeling Cycle (KL, 4ft, CF,…) • Known problem with intervals of categories without values • Usually not one target attribute

  23. Attribute type dependent discretization • Nominal • Classes of values • Ordinal • Extrem/missing values • Type of coefficient • Usually not one target attribute

  24. Intervals of Categories without Values

  25. Intervals of Categories without Values Solution: • Statistics – extrem values • 4ft Task: correlations, implications • Potentially interesting patterns

  26. Extrem/Missing Values 4ft: Find associations between extrem/missing values (impl/correl) CF, KL: Find patterns with extrem/missing values

  27. Data Preparation • Classes of attributes • Partial cedents • Associations between attributes in one class • Associations between partial cedents

  28. Evaluation-Modeling • Input information for partial cedents • Mining for Interesting Patterns • Exceptions • Missing values • Extrem values • Discovered hypotheses • Groups of hypotheses • Coverage hypotheses/input data

  29. Heuristic Rules (1) • Examples: • IF more extrem/missing values found, search for association with extrem/missing values • IF 0 hypotheses found, set-up less strong quantifier (p, Base) values • IF subset of input data not covered by hypotheses THEN search for associations covering these data

  30. Heuristic Rules (2) • Examples: • IF nominal type of column (input data matrix) AND no associated table for discretization THEN each value is one category (attribute creation) • Use “subset” coefficient type for nominal attributes

  31. Metabase, Knowledge Base • Metadata (Knowledge): • Results of Previous X-Miner Tasks • Domain Knowledge • Interaction with User (learning?)

  32. GUHA-80 vs. X-Miner (1) • Task parameters (partial cedents, …) • SW, HW • Experiences with LM applications,…

  33. GUHA-80 vs. X-Miner (2) • More complex heuristics

  34. EverMiner – Features • Based on LispMiner (X-Miners) • Agenda of jobs, priority/strings • Heuristics • Interaction with user • Enables to repeat the process on new data (“check” vs. new KDD process)

  35. EverMiner – where we are • Experiences (Medicine, traffic, shares, sociology,…) • Heuristics collection (www, brainstorming) • Co-operation with data preparation experts (FEL, SumatraTT) • Testing “Strings of jobs” (learning)

  36. Discussion

More Related