Frequent word combinations mining and indexing on hbase
Download
1 / 25

Frequent Word Combinations Mining and Indexing on HBase - PowerPoint PPT Presentation


  • 98 Views
  • Uploaded on

Frequent Word Combinations Mining and Indexing on HBase. Hemanth Gokavarapu Santhosh Kumar Saminathan. Introduction. Many projects use Hbase to store large amount of data for distributed computation

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Frequent Word Combinations Mining and Indexing on HBase' - clare-ware


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Frequent word combinations mining and indexing on hbase

Frequent Word Combinations Mining and Indexing on HBase

HemanthGokavarapu

Santhosh Kumar Saminathan


Introduction
Introduction

  • Many projects use Hbase to store large amount of data for distributed computation

  • The Processing of these data becomes a challenge for the programmers

  • The use of frequent terms help us in many ways in the field of machine learning

  • Eg: Frequently purchased items, Frequently Asked Questions, etc.


Problem
Problem

  • These projects on Hbase create indexes on multiple data

  • We are able to find the frequency of a single word easily using these indexes

  • It is hard to find the frequency of a combination of words

  • For example: “cloud computing”

  • Searching these words separately may lead to results like “scientific computing”, “cloud platform”


Objective
Objective

  • This project focuses on finding the frequency of a combination of words

  • We use the concept of Data mining and Apriori algorithm for this project

  • We will be using Map-Reduce and HBase for this project.


Survey topics
Survey Topics

  • Apriori Algorithm

  • HBase

  • Map – Reduce


Data mining
Data Mining

What is Data Mining?

  • Process of analyzing data from different perspective

  • Summarizing data into useful information.


Data mining1
Data Mining

How Data Mining works?

  • Data Mining analyzes relationships and patterns in stored transaction data based on open – ended user queries

    What technology of infrastructure is needed?

    Two critical technological drivers answers this question.

  • Size of the database

  • Query complexity


Apriori algorithm
Apriori Algorithm

  • Apriori Algorithm – Its an influential algorithm for mining frequent item sets for Boolean association rules.

  • Association rules form an very applied data mining approach.

  • Association rules are derived from frequent itemsets.

  • It uses level-wise search using frequent item property.



Apriori algorithm problem description
Apriori Algorithm & Problem Description

If theminimum support is 50%, then {Shoes, Jacket} is the only 2- itemset that satisfies the minimum support.

If the minimum confidence is 50%, then the only two rules generated from this 2-itemset, that have confidence greater than 50%, are:

Shoes  Jacket Support=50%, Confidence=66%

Jacket  Shoes Support=50%, Confidence=100%


Apriori algorithm example

Database D

L1

C1

Scan D

C2

C2

L2

Scan D

L3

C3

Scan D

Apriori Algorithm Example

Min support =50%


Apriori advantages disadvantages
Apriori Advantages & Disadvantages

  • ADVANTAGES:

    Uses larger itemset property

    Easily Parallelized

    Easy to Implement

  • DISADVANTAGES:

    Assumes transaction database is memory resident

    Requires many database scans


Hbase
HBase

What is HBase?

  • A Hadoop Database

  • Non - Relational

  • Open-source, Distributed, versioned, column-oriented store model

  • Designed after Google Bigtable

  • Runs on top of HDFS ( Hadoop Distributed File System )


Map reduce
Map Reduce

  • Framework for processing highly distributable problems across huge datasets using large number of nodes. / cluster.

  • Processing occur on data stored either in filesystem ( unstructured ) or in Database ( structured )



Mapper and reducer
Mapper and Reducer

  • Mappers

    • FreqentItemsMap

    • -Finds the combination and assigns the key value for each combination

    • CandidateGenMap

    • AssociationRuleMap

  • Reducer

    • FrequentItemsReduce

    • CandidateGenReduce

    • AssociationRuleReduce


  • Flow chart
    Flow Chart

    Start

    Find Frequent Items

    Find Candidate Itemsets

    Find Frequent Items

    No

    Set Null?

    Yes

    Generate Association Rules


    Schedule
    Schedule

    • 1 week – Talking to the Experts at Futuregrid

    • 1 Week – survey of HBase, Apriori Algorithm

    • 4 Weeks -- Kick start on implementing Apriori Algorithm

    • 2 Weeks – Testing the code and get the results.



    Conclusion
    Conclusion

    • The execution takes more time for the single node

    • As the number of mappers getting increased, we come up with better performance

    • When the data is very large, single node execution takes more time and behaves weirdly



    Known issues
    Known Issues

    • When the frequency is very low for large data set the reducer takes more time

    • Eg: A text paragraph in which the words are not repeated often.


    Future work
    Future Work

    • The analysis can be done with Twister and other platform

    • The algorithm can be extended for other applications that use machine learning techniques


    References
    References

    • http://en.wikipedia.org/wiki/Text_mining

    • http://en.wikipedia.org/wiki/Apriori_algorithm

    • http://hbase.apache.org/book/book.html

    • http://www2.cs.uregina.ca/~dbd/cs831/notes/itemsets/itemset_apriori.html

    • http://www.codeproject.com/KB/recipes/AprioriAlgorithm.aspx

    • http://rakesh.agrawal-family.com/papers/vldb94apriori.pdf



    ad