1 / 112

Information Mining Prof. Dr.-Ing. Raimar J. Scherer Institute of Construction Informatics Dresden, 04.05.2005

Information Mining Prof. Dr.-Ing. Raimar J. Scherer Institute of Construction Informatics Dresden, 04.05.2005. Quality of the Data. 2. 1. interval. ratio. interval. ratio. try to transfer. Quality of Attributes. Quality of Attributes = semantical

norman
Download Presentation

Information Mining Prof. Dr.-Ing. Raimar J. Scherer Institute of Construction Informatics Dresden, 04.05.2005

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information MiningProf. Dr.-Ing. Raimar J. SchererInstitute of Construction InformaticsDresden, 04.05.2005

  2. Quality of the Data 2 1 interval ratio interval ratio try to transfer

  3. Quality of Attributes Quality of Attributes = semantical  importance (weight) usually not givenimplicitely assumed: each is equally important (e.g. wighting factor = 1.0) better, explicit transfer into numeric Example: project aim (cost, duration, reputation) Implicit: project aim = 1.0 x cost + 1.0 x duration + 1.0 x reputation Explicit: e.g. project aim = 2 x cost + 1.0 x duration + 1.5 x reputation

  4. Identification of patterns (principles) Deduction of structures (Rules, models) Forecasting of behaviour (application of model) Data Mining = Procedure of machine learning methods

  5. Data Mining = Procedure of machine learning methods • Example of • a pattern • a structure generalised theory by which the observations are explained using the theory, the information, the data to simulate and forecast not observed scenarios

  6. Terminology Data = Recorded facts Information = set of patterns or expectations Knowledge = accumulation of set of expectations Wisdom = usefulness, related to the knowledge

  7. Platon‘s Cave Analogy He can only observe shadows and has to interpret what the original „thing“ / „meaning“ is. The problem: We can never see (record) the whole reality, but only an uncomplete mapping Shadow of dancing people dancing people

  8. Data Structure for Formalization of Information and Knowledge 1 Object = thing with a certain meaning and a certain appearance given byits attributes given by the data of the attributes given byits name thing can be • a real object, e.g. windows • a behaviour,e.g. • opened, closed • transparent, clear • aging and it can be • a behaviour due to the interaction of several things, e.g. • window is opening and closing due to the wind • window is aging due to rain, wind, sun, operation (good/bad) by humans

  9. What can we observe? • Object • Geometric form • Colour • Material • Positions • Relationship • Location (in the wall) • Topology (to the ground) • Behaviour • stress distribution • deflection • vibration • aging • and so on ... Each is described by one or more attributes Each attribute is expressed by a datum (value) from a set of data (values) Some or each attribute can be modelled as an (sub-)object

  10. Closed World This means • We already know what a window is and we are evaluating the observed data according to windows • We already know the (possible) sets of the attributes • We already know the (possible) classes constituted by the values of the attributes If we know that we are describing / observing windows we can evaluate the attributes of the schema (concept) window and determine which kind of window the particular one is, i.e. we classify the particular window in one of the several classes represented by the values of the attributes Hence we have a closed (pre determined) world and therefore we can do straightforward classification

  11. Open World If we do not know what we observe (e.g. image analysis) but we have recorded a lot of data (made a lot of fotos, where each foto consists of many pixels) we can nevertheless identify windows – but also doors, gates, etc. instead of windows (!) – when we extend our procedure by two steps, namely • Analyse the sets of data to find similarities / dis-similarities between the sets, by partitioning each set of data in subsets and compare the sub-sets. This is called identification / analysis of patterns • Generalise the patterns and find an objective structure (theory) which explains the patterns, i.e. synthesize the result of the patterns. This is called to build a concept.A concept can be the schema of an object with its attributes and with the value range of each attribute (in an ideal way)A concept is a schema of an object and hence a class structure. • Classify further observations (as explained in the beginning) in order to • Identify the particular object, if the „thing“ in question is an object • Forecast the object behaviour, if the „thing“ in question is the behaviour • identify the relationship between the objects, if there are more than one

  12. Hierarchy of Methods Knowledge Management Information Mining Data Mining Machine Learning Data Analysis Signal Processing Statistics Data Collection Sensors (-systems) Design of observation

  13. Data Collection => Fact Table • Fact Table (or records) Example: Relation (behaviour) weather-play

  14. Knowledge Representation Knowledge is usually represented by rules. A rule has the form • Premisses (if) • Conclusion (then) The 4 main form to represent (the rules, which contain the) knowledge are: • Decision Tables • Decision Trees • Classification Rules • Association Rules

  15. Knowledge Representation - Decision Tables • Decision Tables (look-up tables)Looks like a fact table.The only difference is that:- Each row is interpreted as one rule - Each attribute is combined with an AND

  16. Decision Tables In decision tables all possible combinations of values of all attributes have to be explicitelly represented (ideal) Where m is the number of attributes and nai is the number of values for attribute ai This means for the given example of the relation „weather-play“ which has m=4 attributes (outlook, temperature, humidity, windy, play), that there exist 3 x 3 x 2 x 2 = 36 combinations For a new set of attribute values we have only look-up in the table, i.e. find that row, which shows a 100% match with the given set and we can read the result, namely play=„yes/no“ This is ideal

  17. Objectives of Decision Making Usually we do not know all combinations. For real problems there can be several 1000! Therefore we reduce the possible number of combination to the most important ones. Doing this by only deleting rows in the decision table we are ending up with information / knowledge gaps, and we would have paritioned our world in a deciteable part and an undeciteable part. The latter would be called „stupid.“ This is not what we want to have. In addition we are usually never able to observe all possible cases, and hence we would have natural gaps. Our objective is always to end up with a decision, whether correct or false, but never with obstain (if not explicitely allowed). Of course, we want to avoid or at least minimize false decisions

  18. Generalisation Therefore we have in to generalize the remaining rows in such a way, that they cover all the decisions of the deleted and unknown (not observed) rows without ending-up with less as possible wrong decision If we make the generalisation not allowing wrong decisions for all observed cases we would have an overdetermined problem, which may also contain some attriutes or attribute combinations which are dependent, i.e. there are identical rules in the rule base. However • It is hard to find all or enough dependent combinations • To find the dependent combinations we first would have set-up the full decision table • Usually we want to reduce the ideal decision table much more than only by the dependent combinations • Usually we never can observe all possible cases, i.e. we always have natural gaps

  19. Shortcomings of Generalization Therefore we have to merge several rows to one row which is possible. The most simple way is to neglect (the values of) one or more attributes. This is the most simple way of generalization (remark: it is the only way of generalization in relational data banks). Say we keep only outlook, the decision table reduces to outlook play sunny no rainy yes overcast yes As a consequence, we make some wrong decision. But we fulfil the first and main objective, namely we are able to make always a decision. For our example, this would lead for the 14 given combinations (i.e. our known world) to 2+2+0=4 wrong decisions

  20. Shortcomings of Generalization We reduced in the given example the 36 possible combinations (rows), each expressed by 36 rules like If outlook = sunnyand temperature = hotand humidity = highand windy = falsethen play = no to 3 simple rules if outlook = sunnythen play = no if outlook = rainythen play = yes if outlook = overcastthen play = yes

  21. Range of wrong Decision We know, that for 4 out of 36 possible cases, we would make a wrong decision, i.e. for 10%. However, we do not know how much further wrong decision we will do, namely 0 or 22 wrong decisions or something in-between, because we know only that we do have an observation gap of 22 cases. This statement is based on the assumption, that we described our problem (UoD) completely by 4 attributes. However, if we take into consideration that the UoD may be biased and say it would be governed by 5 attributes, i.e. 1 additional attribute we do not know, than we would have an unknown range of 36 x number of value range of the unknown attribute can take. Remark: A hint for an unknown attribute is given, if there are two rows in the decision table with identical values, but two different decision (play=yes / play=no)

  22. Liability of knowledge We can now apply native statistics in order to estimate the number of further wrong decisions. Namely, when we assume that our known world represented by 14 rules • is a representative part for the whole world, i.e. the sample is representative for the Universe of Discourse UoD • the rules are unbiased, i.e. all known rules are error free • all attributes are known, i.e. the UoD is unbiased Then we can estimate that 10 out of 36 decision would be wrong. What we did can be explained by statistical theory, namely we evaluated the mean value of wrong decision in our known world, assumed that this mean value is the true value of the total world (UoD) and forecast the number of wrong decision using the mean value. Note: We do not consider any uncertainty here.

  23. Decision Trees We have seen from the decision tables that each value set of the attributes, i.e. each row in the table can be expressed as one rule with a simple semantic, namely if {all attributes are true, i.e. show a certain value} then {classify b} if {and ai = vj } then bl =vk This means we have used a sequential system for our rule system. However it is well known that also parallel systems may be possible. There the status of only one attribute is evaluated, i.e. is checked against all possible values, before in a separate step the next attribute is considered. Applied this to a rule system we come up with nested rules. The graphical representation of nested rules is a tree structure and we call this new representation a decision tree

  24. General Structure of Decision Trees In general terms a decision tree can be expressed as if { state a1} := {a1 = v1} then if { state a2} …….. ………………… {a1 = vm} then if { state a2} …….. end if And in each branch vj this has to be repeated for the next attribute ai and so on for all i=1,N

  25. If we know all combinations of the UoD and we want to express them all (what is our ideal goal in order to avoid wrong decision) like we did it for the decision table, ranking of attributes, i.e. what is operated first, second ..., is not a matter at all. Then we can apply straight-forward the general formula and using any arbitrary order. For convenience we can choose i=1,2,3...N and we will end up in a tree, like Ranking in a Decision Tree a1 a2 a3 a4 .... an Y

  26. Normalisation to binary Decision Tree For several conviences (memory amount, processing time, search time, etc.) the multi branching tree is transformed into a binary tree, or already built-up as a binary tree from the beginning This means, that we have applied the following transformation rule for all non-binary branches if M>2 then if (state ai=vj) then yes else ... no which means that we divide the value range at each layer in two halfspaces, namely the actually considered value and in the other up to now not considered values. This results in a tree explicited below only for a1 and a2

  27. Explosion of layer through binary tree representation a1v1 yes no a1v2 yes no not possible a1v3 yes no a2v1 a2v1 a2v1 a2v1 yes no yes no yes no yes no a3v1 a2v2 a3v1 a2v2 a3v1 a2v2 a3v1 a2v2 yes no ……………………………….. a4v1 a3v2

  28. Shortcomings of simple explicit binary decision trees • we do not want to consider all combinations but only the important ones, i.e. generalize our explicit knowledge space. • we usually do not know all combinations , which can be interpreted as an un-controlled generalization. A property of binary decision trees is the replication of sub-trees which for every attribute value generates a new replication, namely replication of subtrees = number of values - 1and this repeats for each attribute in each subtree again and again As long as we want to (and can) express all combinations there would be no shortcoming but Both lead to the result that • the ranking of the attributes is important. • attributes are no longer sorted in layers but mixed to receive an optimal structure of the tree

  29. In decision trees the generalisation process is much more visible and hence controllable. Generalisation means e.g. deleting a subtree and substituting with only one decision Generalised Decision Tree a1 a2 a3 a4 an

  30. Classification (Rules) Representation of knowledge by classification rules means if {aiand/or/not/etc aj} then {bk} i,j=1,N k=1,M We have already used this representation when we explained the meaning of decision tables and decision trees. Hence decision tables and decision trees are only another ("visual") representation for a set of rules.

  31. Classification (Rules) This is true with the exception, that it is straight forward to transform decision trees and tables to classification rules, but transforming classification into decision tables, we have to explicite all ors into ands because decision tables are look up tables and therefore uses only ands.

  32. Classification (Rules) For decision trees a straightforward transformation is formal possible and correct, but advances of decision trees get lost, which is an optimised arrangement, namely either • size of the tree is minimised or • readability is maximised (e.g. for each attribute one layer) or • a combination of both, e.g. optimised for the human understanding

  33. Ranking Dependence As long as we have all cases represented ranking of • rows in decision tables • attributes and values in decision trees • classification rules in rule bases are not influencing the result. However we • do not have all cases (observations) • want to reduce rows, branches, rules by generalisation This results in a ranking dependency problem. This holds also for classification rules – which may be overseen, because at a first glance a rule maybe seen as selfstanding, independent of knowledge, which is definitely not the case. Each rule is always embedded in its context, represented by other rules and expressed by the ranging. This means the solution is always path dependent!So ranking is already a part of the representation of the knowledge.

  34. Association Rules 1 Association rules express the relationship between arbitrary attribute states if {ai = state1} then {aj = state2} for all i  j and i=1,N , j=1,M, where ai,aj{A,B} If we would restrict all aj to be only elements of B, i.e. if {ai=state1} then {bj=state2} then we would have a classification rule. Hence, classification rules are a subset of association rules

  35. Association Rules 2 With association rules, we can combine any attribute state (attribute value) with any other attribute state or any grouping of attributes states. There is no any limitation. As a consequence, we allow dependencies (or redundancies ) between the rules. It would be not wise to express all or even many association rules, because we would produce an uncontrollable sub-space of the inherent knowledge with many redundant rules, i.e. - some information is not expressed at all - some information is expressed once - some information is expressed several times. Therefore we will lose our basic weighting criteria, namely - that each rule is equally important - that the importance of an attribute or an attribute value is the frequency of its appearance in the rules - both may be generalized by adding an verifyable arbitrary weighting factor

  36. Objectives of Association Rules Associate rules should only be applied as a shortcut in addition to a clearly specified minimum rule set without redundancies. Such shortcuts are used - for important relationships - often appearing relationships - simplified solutions in order to considerably reduce the search time.

  37. Examples of Association Rules Examples: If temperature = low Then humidity = normal If windy = false and play = no Then outlook = sunny and humidity = high If humidity = high and windy = false and play = no Then outlook = sunny All are correct expressions (correct "knowledge" expressed in a rule).

  38. Coverage and Accuracy Coverage (or strength or support) is the number of instances for which a rule predicts correctly Accuracy (or confidence) is the ratio of instances the rule predicts correctly (consequences) related to all instances it applies for (premise). correct consequences (coverage) accuracy = correct premises (applications)

  39. Coverage and Accuracy Examples if temperature = cool then humidity = normal applies = 4 (temp=cool) coverage = 4 (humidity=normal | temp=cool) accuracy = 1,0 (100%) if outlook = sunny then play = yes applies = 5 coverage = 2 accuracy = 40% if outlook = sunny andtemperature = mild then play = yes applies = 2 coverage = 1 accuracy = 50%

  40. Rules with exceptions With the possibility to formulate exceptions like but not for {ai = state} we are able to refine general applicable rules in a very efficient way, namely we divide the value range of the attribute ai in two halfspaces, namely 1.) the value=state and 2.) all the rest of values by specifying only one value, which means we increase the coverage and reduce as less as possible the application, hence we maximise acuracy. This means that we can start with a simple and very general rule and sharpen it by adding exceptions. true exclude(=false)

  41. Rules with Exceptions Example if outlook = rainy then play = yes when we add and windy = not true: when we would instead add: and temperature = not cool: applications = 5 coverage = 3 accuracy = 60% applications = 3 coverage = 3 accuracy = 100% applications = 3 coverage = 2 accuracy = 66%

  42. Rules with Relations Propositional Rules Up to now, we only evaluated each attribute separately, i.e. we compared the value of the attribute with a given value set. Such rules are called propositional rules and they have the same power as the proposition calculus of logic reasoning. Relational Rules Sometime it is convenient to compare two attributes like If {ai} is in some relation to {aj} then {bi} This implies that ai and aj show the same unity or can be transformed into one and the same unity. The comparison can be any Boolean operation. Example If (height > length) then object = column

  43. Rules for Numerical Values a1 All numerically valued attributes can be dealt with in the same manner as with nominal values, to which we apply the half-space principle, namely • divide the range of values into two half spaces • the two half spaces have not necessarely be symmetric or have to contain equal number of values • repeat this recursively until enough small intervals remain • number of recursion is independent between branches This leads straightforward to a binary decision-tree.

  44. Rules for Numerical Values a2 a1 Equivalently, we can pre-divide the range of values into equally (a2) or arbitrarily (a1) sized intervals and we directly show in which interval the observed attribute value fits. This is equivalent to a multi-branching tree. The test of equality (=) is possible but not feasible, because it is e.q. for R, an arbitrarily rare event. Semantically and ordinally ranked data can be dealt with in the same way. There, the test of equivalence may be feasible, because of the very limited number of values.

  45. Instance-based Representation In contrast to rule-based representation, where we test each observation of equality, namely If {ai =state}, then For instance-based representation, we test the attribute value set t against the distance to a given state with n sets and evaluate the minimal distance. Each attribute value set (= vector) with as many components as attributes, e.g. one row of the fact table s1=[sunny, hot, high, false]T

  46. Instance-based Representation The state is now a given set of attribute value sets, e.g. 10 rows of the fact table state=[s1...sn] So we test a vector t against a vector set [s1...sn] if {distance(t,si)=min} then {bt=bsi} and we use the consequences of the closest vector for the prognosis (decision) Remark: nominal values are usually transformed in true=0, false=1

  47. Comparison of Instance-based to Rule-based s2 s1 s3 Compared to the rule-based representation of numerically valued data problems, where we deal with fixed intervals, here we deal with non-fixed intervals - but only at a first glance. In fact, we have also fixed boundaries, namely boundaries defined by halfway between the state vectors. The only difference is that the intervalls are arbitrary in size and that we do not explizitely define the intervals, but we define the center (lines) of the intervals (classes). Another advantage to the explicitely expressed interval procedure of the rule-based representation is that we can easily add an additional instance for better representing the knowledge space, i.e. for refinement of the space

  48. Requirements for Instance-based Representation There are 3 requirements 1) We do need a metric. 2) All attributes have to be presentable in one and the same metric. 3) We do need a distance metric, also called norm.

  49. Norm A norm is defined as a mapping which fullfil the 3 requirements The most well-known norm is the Euklid Norm (=geometric distance in the Euklid space) with d=a-b, di=ai-bi

  50. Norm in general terms the Euklid norm can be written This can be generalised to the Diagonal-Norm with where ai are arbitrary values, which can be explained as weighting factors (for each component of the instance vector) We can now imagine about off-diagonal values, namely we can include in our distance measure dependencies between attributes when we set the off-diagonal values not zero, i.e. we evaluate relationships between attributes (relational rules)

More Related