Frequent word combinations mining and indexing on hbase
This presentation is the property of its rightful owner.
Sponsored Links
1 / 25

Frequent Word Combinations Mining and Indexing on HBase PowerPoint PPT Presentation


  • 50 Views
  • Uploaded on
  • Presentation posted in: General

Frequent Word Combinations Mining and Indexing on HBase. Hemanth Gokavarapu Santhosh Kumar Saminathan. Introduction. Many projects use Hbase to store large amount of data for distributed computation

Download Presentation

Frequent Word Combinations Mining and Indexing on HBase

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Frequent word combinations mining and indexing on hbase

Frequent Word Combinations Mining and Indexing on HBase

HemanthGokavarapu

Santhosh Kumar Saminathan


Introduction

Introduction

  • Many projects use Hbase to store large amount of data for distributed computation

  • The Processing of these data becomes a challenge for the programmers

  • The use of frequent terms help us in many ways in the field of machine learning

  • Eg: Frequently purchased items, Frequently Asked Questions, etc.


Problem

Problem

  • These projects on Hbase create indexes on multiple data

  • We are able to find the frequency of a single word easily using these indexes

  • It is hard to find the frequency of a combination of words

  • For example: “cloud computing”

  • Searching these words separately may lead to results like “scientific computing”, “cloud platform”


Objective

Objective

  • This project focuses on finding the frequency of a combination of words

  • We use the concept of Data mining and Apriori algorithm for this project

  • We will be using Map-Reduce and HBase for this project.


Survey topics

Survey Topics

  • Apriori Algorithm

  • HBase

  • Map – Reduce


Data mining

Data Mining

What is Data Mining?

  • Process of analyzing data from different perspective

  • Summarizing data into useful information.


Data mining1

Data Mining

How Data Mining works?

  • Data Mining analyzes relationships and patterns in stored transaction data based on open – ended user queries

    What technology of infrastructure is needed?

    Two critical technological drivers answers this question.

  • Size of the database

  • Query complexity


Apriori algorithm

Apriori Algorithm

  • Apriori Algorithm – Its an influential algorithm for mining frequent item sets for Boolean association rules.

  • Association rules form an very applied data mining approach.

  • Association rules are derived from frequent itemsets.

  • It uses level-wise search using frequent item property.


Algorithm flow

Algorithm Flow


Apriori algorithm problem description

Apriori Algorithm & Problem Description

If theminimum support is 50%, then {Shoes, Jacket} is the only 2- itemset that satisfies the minimum support.

If the minimum confidence is 50%, then the only two rules generated from this 2-itemset, that have confidence greater than 50%, are:

Shoes  Jacket Support=50%, Confidence=66%

Jacket  Shoes Support=50%, Confidence=100%


Apriori algorithm example

Database D

L1

C1

Scan D

C2

C2

L2

Scan D

L3

C3

Scan D

Apriori Algorithm Example

Min support =50%


Apriori advantages disadvantages

Apriori Advantages & Disadvantages

  • ADVANTAGES:

    Uses larger itemset property

    Easily Parallelized

    Easy to Implement

  • DISADVANTAGES:

    Assumes transaction database is memory resident

    Requires many database scans


Hbase

HBase

What is HBase?

  • A Hadoop Database

  • Non - Relational

  • Open-source, Distributed, versioned, column-oriented store model

  • Designed after Google Bigtable

  • Runs on top of HDFS ( Hadoop Distributed File System )


Map reduce

Map Reduce

  • Framework for processing highly distributable problems across huge datasets using large number of nodes. / cluster.

  • Processing occur on data stored either in filesystem ( unstructured ) or in Database ( structured )


Map reduce1

Map Reduce


Mapper and reducer

Mapper and Reducer

  • Mappers

    • FreqentItemsMap

    • -Finds the combination and assigns the key value for each combination

    • CandidateGenMap

    • AssociationRuleMap

  • Reducer

    • FrequentItemsReduce

    • CandidateGenReduce

    • AssociationRuleReduce


  • Flow chart

    Flow Chart

    Start

    Find Frequent Items

    Find Candidate Itemsets

    Find Frequent Items

    No

    Set Null?

    Yes

    Generate Association Rules


    Schedule

    Schedule

    • 1 week – Talking to the Experts at Futuregrid

    • 1 Week – survey of HBase, Apriori Algorithm

    • 4 Weeks -- Kick start on implementing Apriori Algorithm

    • 2 Weeks – Testing the code and get the results.


    Results

    Results


    Conclusion

    Conclusion

    • The execution takes more time for the single node

    • As the number of mappers getting increased, we come up with better performance

    • When the data is very large, single node execution takes more time and behaves weirdly


    Screenshot

    Screenshot


    Known issues

    Known Issues

    • When the frequency is very low for large data set the reducer takes more time

    • Eg: A text paragraph in which the words are not repeated often.


    Future work

    Future Work

    • The analysis can be done with Twister and other platform

    • The algorithm can be extended for other applications that use machine learning techniques


    References

    References

    • http://en.wikipedia.org/wiki/Text_mining

    • http://en.wikipedia.org/wiki/Apriori_algorithm

    • http://hbase.apache.org/book/book.html

    • http://www2.cs.uregina.ca/~dbd/cs831/notes/itemsets/itemset_apriori.html

    • http://www.codeproject.com/KB/recipes/AprioriAlgorithm.aspx

    • http://rakesh.agrawal-family.com/papers/vldb94apriori.pdf


    Questions

    Questions?


  • Login