Frequent word combinations mining and indexing on hbase
1 / 25

Frequent Word Combinations Mining and Indexing on HBase - PowerPoint PPT Presentation

  • Uploaded on

Frequent Word Combinations Mining and Indexing on HBase. Hemanth Gokavarapu Santhosh Kumar Saminathan. Introduction. Many projects use Hbase to store large amount of data for distributed computation

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Frequent Word Combinations Mining and Indexing on HBase' - clare-ware

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Frequent word combinations mining and indexing on hbase

Frequent Word Combinations Mining and Indexing on HBase


Santhosh Kumar Saminathan


  • Many projects use Hbase to store large amount of data for distributed computation

  • The Processing of these data becomes a challenge for the programmers

  • The use of frequent terms help us in many ways in the field of machine learning

  • Eg: Frequently purchased items, Frequently Asked Questions, etc.


  • These projects on Hbase create indexes on multiple data

  • We are able to find the frequency of a single word easily using these indexes

  • It is hard to find the frequency of a combination of words

  • For example: “cloud computing”

  • Searching these words separately may lead to results like “scientific computing”, “cloud platform”


  • This project focuses on finding the frequency of a combination of words

  • We use the concept of Data mining and Apriori algorithm for this project

  • We will be using Map-Reduce and HBase for this project.

Survey topics
Survey Topics

  • Apriori Algorithm

  • HBase

  • Map – Reduce

Data mining
Data Mining

What is Data Mining?

  • Process of analyzing data from different perspective

  • Summarizing data into useful information.

Data mining1
Data Mining

How Data Mining works?

  • Data Mining analyzes relationships and patterns in stored transaction data based on open – ended user queries

    What technology of infrastructure is needed?

    Two critical technological drivers answers this question.

  • Size of the database

  • Query complexity

Apriori algorithm
Apriori Algorithm

  • Apriori Algorithm – Its an influential algorithm for mining frequent item sets for Boolean association rules.

  • Association rules form an very applied data mining approach.

  • Association rules are derived from frequent itemsets.

  • It uses level-wise search using frequent item property.

Apriori algorithm problem description
Apriori Algorithm & Problem Description

If theminimum support is 50%, then {Shoes, Jacket} is the only 2- itemset that satisfies the minimum support.

If the minimum confidence is 50%, then the only two rules generated from this 2-itemset, that have confidence greater than 50%, are:

Shoes  Jacket Support=50%, Confidence=66%

Jacket  Shoes Support=50%, Confidence=100%

Apriori algorithm example

Database D



Scan D




Scan D



Scan D

Apriori Algorithm Example

Min support =50%

Apriori advantages disadvantages
Apriori Advantages & Disadvantages


    Uses larger itemset property

    Easily Parallelized

    Easy to Implement


    Assumes transaction database is memory resident

    Requires many database scans


What is HBase?

  • A Hadoop Database

  • Non - Relational

  • Open-source, Distributed, versioned, column-oriented store model

  • Designed after Google Bigtable

  • Runs on top of HDFS ( Hadoop Distributed File System )

Map reduce
Map Reduce

  • Framework for processing highly distributable problems across huge datasets using large number of nodes. / cluster.

  • Processing occur on data stored either in filesystem ( unstructured ) or in Database ( structured )

Mapper and reducer
Mapper and Reducer

  • Mappers

    • FreqentItemsMap

    • -Finds the combination and assigns the key value for each combination

    • CandidateGenMap

    • AssociationRuleMap

  • Reducer

    • FrequentItemsReduce

    • CandidateGenReduce

    • AssociationRuleReduce

  • Flow chart
    Flow Chart


    Find Frequent Items

    Find Candidate Itemsets

    Find Frequent Items


    Set Null?


    Generate Association Rules


    • 1 week – Talking to the Experts at Futuregrid

    • 1 Week – survey of HBase, Apriori Algorithm

    • 4 Weeks -- Kick start on implementing Apriori Algorithm

    • 2 Weeks – Testing the code and get the results.


    • The execution takes more time for the single node

    • As the number of mappers getting increased, we come up with better performance

    • When the data is very large, single node execution takes more time and behaves weirdly

    Known issues
    Known Issues

    • When the frequency is very low for large data set the reducer takes more time

    • Eg: A text paragraph in which the words are not repeated often.

    Future work
    Future Work

    • The analysis can be done with Twister and other platform

    • The algorithm can be extended for other applications that use machine learning techniques