A Kit For Knowledge Discovery. Data, Data everywhere yet. I can’t find the data I need data is scattered over the network many versions, subtle differences I can’t get the data I need need an expert to get the data I can’t understand the data I found available data poorly documented
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Data, Data everywhere yet ... • I can’t find the data I need • data is scattered over the network • many versions, subtle differences • I can’t get the data I need • need an expert to get the data • I can’t understand the data I found • available data poorly documented • I can’t use the data I found • results are unexpected • data needs to be transformed from one form to other
? • There are sequence of steps (with eventual feedback loops) that should be followed to discover knowledge (e.g., patterns) in data. • Achieving Standardized Process Model
1 2 3 • Legitimate • Innovative • Probably useful • Accurate understandable patterns in data. What is KDD ? Knowledge Discovery in Data is the significant method of evaluating
__ ____ __ ____ __ ____ Patterns and Rules Knowledge Discovery Process Interpretation & Evaluation Knowledge Data Mining Knowledge Integration RawData Transformation Selection & Cleaning Understanding Transformed Data Target Data DATA Ware house
Clustering Based On Attributes Events Correlation – Association Sequencing Events ~ Later Predictions Outcomes of Data Mining Forecasting Future Classification on Recognizing patterns
Data Mining • Look for hidden patterns and trends in data that is not immediately apparent from summarizing the data
Data Mining + = Interestingness criteria Hidden patterns Data
Data Mining Type of Patterns + = Interestingness criteria Hidden patterns Data
Data Mining Type of data Type of Interestingness criteria + = Interestingness criteria Hidden patterns Data
What is a Data Warehouse? A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context.
Information Data What is Data Warehousing? A process of transforming data into information and making it available to users in a timely enough manner to make a difference
3 Data Mining Process Problem Definition Data Integration & Cleaning Model Framing & Evaluation Knowledge Discovery 1 2 4
Data Mining Task Basic Operations in DM • Descriptive: • Clustering / Similarity Matching • Association rules • Deviation detection • Predictive: • Regression • Classification • Collaborative Filtering
Why Machine Learning Growing flood of online data Budding industry Progress in algorithms and theory • Data mining: using historical data to improve decision • medical records ⇒ medical knowledge • log data to model user • Software applications we can’t program by hand • autonomous driving • speech recognition • Self customizing programs • Newsreader that learns user interests
Machine Learning Unsupervised Data have no target attribute. Explore Data to find Patterns Text Unsupervised Supervised Data Mining Machine Learning Supervised Discover patterns in the data. Presence of Target Attribute
Applications of Data Mining • Fraud/Non-Compliance Anomaly detection • Isolate the factors that lead to fraud, waste and abuse • Target auditing and investigative efforts more effectively • Credit/Risk Scoring • Intrusion detection • Recruiting/Attracting customers • Maximizing profitability (cross selling, identifying profitable customers) • Service Delivery and Customer Retention • Build profiles of customers likely to use which services
Tools For Data Mining LinkOut NCBI Sequin Rapid Miner LibSvm ADaM etc….
Why Weka Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes.
About WEKA Waikato Environment for Knowledge Analysis (WEKA) Developed by the Department of Computer Science, University of Waikato, New Zealand Machine learning/data mining software coded in Java Used for research, education, and applications Exclusively for KDD. Various Versions are available such as Version 2.3, 1998; Version 3.0, 1999; Version 3.4, 2003; Version 3.6, 2008.
A Vital Part In Weka ww.themegallery.com Explorer
Weka !!!!!!!! Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. Perfectly suited for developing new machine learning schemes.
Explorer Weka’s Structural Layout Knowledge Flow Simple CLI Experimenter Performing experiments and conducting statistical tests between learning schemes Supports the same functions as the Explorer but with drag-and-drop Provides a simple command-line interface that allows direct execution of WEKA An environment for exploring data with WEKA
WEKA ! File WEKA stores data in flat files (ARFF format). Easy to transform EXCEL file to ARFF format. ARFF file consists of a list of instances ARFF file can be created using Notepad or Word. Attribute Relation File Format (ARFF) • Name of the dataset is with @relation • Attribute information is with @attribute • Data is with @data.
Select Attributes 5 Associate 4 Cluster 3 Classify 2 Preprocess 1 Intrinsic Operations
Preprocessing • Changing Data formats as per the Needs. • Varies as Per Mining Datasets. • Some of the Preprocessing Steps • Adding/removing attributes • Attribute value substitution • Discretization (MDL, Kononenko, etc.) • Time series filters (delta, shift) • Sampling, randomization • Missing value management • Normalization and other numeric transformations
Opening Files Current Relation Operations Browse for the data file in local file system. • Relations • Instances • Schema • Attributes • Filters Pre-Processing
GenericObjectEditor • A Property Editor for objects as editable in the GenericObjectEditor configuration file, which lists possible values that can be selected from, and themselves configured. • The configuration file is called "GenericObjectEditor.props" and may live in either the location given by "user.home" or the current directory (this last will take precedence), and a default properties file is read from the weka distribution.
Weka ~ GenericObjectEditor • This Editor allows configure a filter. • Same kind of dialog box is used to configure other objects, such as classifiers and clusterers.
Sample - Cluster Attributes for Cluster
Pre-Processing Retrievals Before After
Feature Selection • Some columns are noisy or redundant. This noise makes it more difficult to discover meaningful patterns from the data; • To discover quality patterns, most data mining algorithms require much larger training data set on high-dimensional data set. • Feature selection, also known as variable selection, feature reduction, attribute selection or variable subset selection, • is the technique of selecting a subset of relevant features for building robust learning models
Attribute Selection • Attribute selection involves searching through all possible combinations of attributes in the data to find which subset of attributes works best for prediction. • To do this, two objects must be set up: • The evaluator determines what method is used to assign a worth to each subset of attributes. • The search method determines what style of search to be done • The Attribute Selection Mode box has two options: • 1. Use full training set. • 2. Cross-validation.
Attribute Selection • Very flexible: arbitrary combination of search and evaluation methods • Both filtering and wrapping methods • Search methods • best-first • genetic • ranking ... • Evaluation mmeasures • Relief • information gain • gain ratio ...