A Kit For Knowledge Discovery

A Kit For Knowledge Discovery

Data, Data everywhere yet ... • I can’t find the data I need • data is scattered over the network • many versions, subtle differences • I can’t get the data I need • need an expert to get the data • I can’t understand the data I found • available data poorly documented • I can’t use the data I found • results are unexpected • data needs to be transformed from one form to other

? • There are sequence of steps (with eventual feedback loops) that should be followed to discover knowledge (e.g., patterns) in data. • Achieving Standardized Process Model

1 2 3 • Legitimate • Innovative • Probably useful • Accurate understandable patterns in data. What is KDD ? Knowledge Discovery in Data is the significant method of evaluating

__ ____ __ ____ __ ____ Patterns and Rules Knowledge Discovery Process Interpretation & Evaluation Knowledge Data Mining Knowledge Integration RawData Transformation Selection & Cleaning Understanding Transformed Data Target Data DATA Ware house

Clustering Based On Attributes Events Correlation – Association Sequencing Events ~ Later Predictions Outcomes of Data Mining Forecasting Future Classification on Recognizing patterns

Data Mining • Look for hidden patterns and trends in data that is not immediately apparent from summarizing the data

Data Mining + = Interestingness criteria Hidden patterns Data

Data Mining Type of Patterns + = Interestingness criteria Hidden patterns Data

Data Mining Type of data Type of Interestingness criteria + = Interestingness criteria Hidden patterns Data

What is a Data Warehouse? A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context.

Information Data What is Data Warehousing? A process of transforming data into information and making it available to users in a timely enough manner to make a difference

3 Data Mining Process Problem Definition Data Integration & Cleaning Model Framing & Evaluation Knowledge Discovery 1 2 4

Data Mining Task Basic Operations in DM • Descriptive: • Clustering / Similarity Matching • Association rules • Deviation detection • Predictive: • Regression • Classification • Collaborative Filtering

Why Machine Learning Growing flood of online data Budding industry Progress in algorithms and theory • Data mining: using historical data to improve decision • medical records ⇒ medical knowledge • log data to model user • Software applications we can’t program by hand • autonomous driving • speech recognition • Self customizing programs • Newsreader that learns user interests

Machine Learning Unsupervised Data have no target attribute. Explore Data to find Patterns Text Unsupervised Supervised Data Mining Machine Learning Supervised Discover patterns in the data. Presence of Target Attribute

Applications Of Data Mining

Applications of Data Mining • Fraud/Non-Compliance Anomaly detection • Isolate the factors that lead to fraud, waste and abuse • Target auditing and investigative efforts more effectively • Credit/Risk Scoring • Intrusion detection • Recruiting/Attracting customers • Maximizing profitability (cross selling, identifying profitable customers) • Service Delivery and Customer Retention • Build profiles of customers likely to use which services

Tools For Data Mining LinkOut NCBI Sequin Rapid Miner LibSvm ADaM etc….

Why Weka Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes.

About WEKA Waikato Environment for Knowledge Analysis (WEKA) Developed by the Department of Computer Science, University of Waikato, New Zealand Machine learning/data mining software coded in Java Used for research, education, and applications Exclusively for KDD. Various Versions are available such as Version 2.3, 1998; Version 3.0, 1999; Version 3.4, 2003; Version 3.6, 2008.

Weka GUI Chooser

A Vital Part In Weka ww.themegallery.com Explorer

Weka !!!!!!!! Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. Perfectly suited for developing new machine learning schemes.

Explorer Weka’s Structural Layout Knowledge Flow Simple CLI Experimenter Performing experiments and conducting statistical tests between learning schemes Supports the same functions as the Explorer but with drag-and-drop Provides a simple command-line interface that allows direct execution of WEKA An environment for exploring data with WEKA

Algorithms www.themegallery.com

WEKA ! File WEKA stores data in flat files (ARFF format). Easy to transform EXCEL file to ARFF format. ARFF file consists of a list of instances ARFF file can be created using Notepad or Word. Attribute Relation File Format (ARFF) • Name of the dataset is with @relation • Attribute information is with @attribute • Data is with @data.

Sample ARFF

Select Attributes 5 Associate 4 Cluster 3 Classify 2 Preprocess 1 Intrinsic Operations

Pre-Processing

Preprocessing • Changing Data formats as per the Needs. • Varies as Per Mining Datasets. • Some of the Preprocessing Steps • Adding/removing attributes • Attribute value substitution • Discretization (MDL, Kononenko, etc.) • Time series filters (delta, shift) • Sampling, randomization • Missing value management • Normalization and other numeric transformations

Algorithms

Opening Files Current Relation Operations Browse for the data file in local file system. • Relations • Instances • Schema • Attributes • Filters Pre-Processing

Weka – Formulating Files

Dataset -.txt Format

Weka ~ Dataset’s

Missing Values

GenericObjectEditor • A Property Editor for objects as editable in the GenericObjectEditor configuration file, which lists possible values that can be selected from, and themselves configured. • The configuration file is called "GenericObjectEditor.props" and may live in either the location given by "user.home" or the current directory (this last will take precedence), and a default properties file is read from the weka distribution.

Weka ~ GenericObjectEditor • This Editor allows configure a filter. • Same kind of dialog box is used to configure other objects, such as classifiers and clusterers.

Sample - Cluster Attributes for Cluster

Weka’s Viewer

PCA Analysis

Pre-Processing Retrievals Before After

Retrieving Significant Attributes

Select Attribute !

Algorithms

Feature Selection • Some columns are noisy or redundant. This noise makes it more difficult to discover meaningful patterns from the data; • To discover quality patterns, most data mining algorithms require much larger training data set on high-dimensional data set. • Feature selection, also known as variable selection, feature reduction, attribute selection or variable subset selection, • is the technique of selecting a subset of relevant features for building robust learning models

Attribute Selection • Attribute selection involves searching through all possible combinations of attributes in the data to find which subset of attributes works best for prediction. • To do this, two objects must be set up: • The evaluator determines what method is used to assign a worth to each subset of attributes. • The search method determines what style of search to be done • The Attribute Selection Mode box has two options: • 1. Use full training set. • 2. Cross-validation.

Attribute Selection • Very flexible: arbitrary combination of search and evaluation methods • Both filtering and wrapping methods • Search methods • best-first • genetic • ranking ... • Evaluation mmeasures • Relief • information gain • gain ratio ...

Applying Algorithm

A Kit For Knowledge Discovery

A Kit For Knowledge Discovery

Presentation Transcript

Data Preparation for Knowledge Discovery

Knowledge Discovery in Databases

Knowledge Discovery

Federated Search: A Tool for Knowledge Discovery

Data Preparation for Knowledge Discovery

Frameworks and Algorithms for Regional Knowledge Discovery

Knowledge Modeling and Discovery

Enabling Knowledge Discovery in a Virtual Universe

N Tropy : A Framework for Knowledge Discovery in a Virtual Universe

Building a Knowledge Discovery System

Knowledge discovery process

N Tropy : A Framework for Knowledge Discovery in a Virtual Universe

Summary of Knowledge Discovery for Semantic Web

Knowledge Discovery Understanding

Knowledge Discovery

Knowledge Discovery

Knowledge Discovery

Knowledge discovery process

Knowledge Discovery