Adding
Download
1 / 34

Adding Search to the Hadoop Ecosystem - PowerPoint PPT Presentation


  • 156 Views
  • Uploaded on

Adding Search to the Hadoop Ecosystem. Gregory Chanan ( gchanan AT cloudera.com) LA HUG Sept 2013. Agenda. Big Data and Search – setting the stage Cloudera Search Architecture Component deep dive Security Conclusion. Why Search?. Hadoop for everyone Typical case:

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Adding Search to the Hadoop Ecosystem' - magar


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  • Adding Search to the

  • Hadoop Ecosystem

Gregory Chanan (gchanan AT cloudera.com)

LA HUG Sept 2013


Agenda
Agenda

  • Big Data and Search – setting the stage

  • Cloudera Search Architecture

  • Component deep dive

  • Security

  • Conclusion


Why search
Why Search?

  • Hadoop for everyone

  • Typical case:

    • Ingest data to storage engine (HDFS, HBase, etc)

    • Process data (MapReduce, Hive, Impala)

  • Experts know MapReduce

  • Savvy people know SQL

  • Everyone knows Search!


Why search1
Why Search?

An Integrated Part of the Hadoop System

One pool of data

One security framework

One set of system resources

One management interface


Benefits of search
Benefits of Search

  • Improved Big Data ROI

    • An interactive experience without technical knowledge

    • Single data set for multiple computing frameworks

  • Faster time to insight

    • Exploratory analysis, esp. unstructured data

    • Broad range of indexing options to accommodate needs

  • Cost efficiency

    • Single scalable platform; no incremental investment

    • No need for separate systems, storage


What is cloudera search
What is Cloudera Search?

  • Full-text, interactive search with faceted navigation

  • Apache Solr integrated with CDH

    • Established, mature search with vibrant community

    • In production environments for years

  • Open Source

    • 100% Apache, 100% Solr

    • Standard SolrAPIs

  • Batch, near real-time, and on-demand indexing

  • Released 1.0 (GA) last week!


Cloudera search components
Cloudera Search Components

  • HDFS/MR/Lucene/Solr/SolrCloud

  • Indexing

    • Near Real Time (NRT) indexing

    • Batch

  • ETL – Cloudera Morphlines

  • Querying


Apache hadoop
Apache Hadoop

  • Apache HDFS

    • Distributed file system

    • High reliability

    • High throughput

  • Apache MapReduce

    • Parallel, distributed programming model

    • Allows processing of large datasets

    • Fault tolerant


Apache lucene
Apache Lucene

  • Full text search

    • Indexing

    • Query

  • Traditional inverted index

  • Batch and Incremental indexing

  • We are using version 4.4 in current release


Apache solr
Apache Solr

  • Search service built using Lucene

    • Ships with Lucene (same TLP at Apache)

  • Provides XML/HTTP/JSON/Python/Ruby/… APIs

    • Indexing

    • Query

    • Administrative interface

    • Also rich web admin GUI via HTTP


Apache solrcloud
Apache SolrCloud

  • Provides distributed Search capability

  • Part of Solr (not a separate library/codebase)

  • Shards – provide scalability

    • partition index for size

    • replicate for query performance

  • Uses ZooKeeper for coordination

    • No split-brain issues

    • Simplifies operations


Solrcloud architecture
SolrCloud Architecture

  • If sent to a replica, the document is forwarded to the leader for processing.

  • If the machine is a leader, SolrCloud determines which shard the document should go to, forwards the document the leader for that shard, indexes the document for this shard, and forwards the index notation to itself and any replicas.


Distributed search on hadoop
Distributed Search on Hadoop

SolrCloud

ZK

Hue UI

query

Flume

index

Solr

Custom UI

query

Solr

index

HBase

query

Custom App

Solr

index

MR

HDFS

Hadoop Cluster


Indexing
Indexing

  • Near Real Time (NRT)

    • Flume

    • HBase Indexer

  • Batch (MR)


Indexing1
Indexing

  • Near Real Time (NRT)

    • Flume

    • HBase Indexer

  • Batch (MR)


Near real time indexing with flume
Near Real Time Indexing with Flume

Other

Log File

Log File

  • Solr and Flume

  • Data ingest at scale

  • Flexible extraction and mapping

  • Indexing at data ingest

Flume Agent

Flume Agent

HDFS

Indexer

Indexer


Apache flume morphlinesolrsink
Apache Flume - MorphlineSolrSink

  • A Flume Source…

    • Receives/gathers events

  • A Flume Channel…

    • Carries the event – MemoryChannel or reliable FileChannel

  • A Flume Sink…

    • Sends the events on to the next location

  • Flume MorphlineSolrSink

    • Integrates Cloudera Morphlines library

      • ETL, more on that in a bit

    • Does batching

    • Results sent to Solr for indexing


Indexing2
Indexing

  • Near Real Time (NRT)

    • Flume

    • HBase Indexer

  • Batch (MR)


Near real time indexing of apache hbase
Near Real Time Indexing of Apache HBase

planet-sized tabular data

immediate access & updates

fast & flexible informationdiscovery

=

+

Search

BIG DATA DATAMANAGEMENT

HBase

HBase Indexer(s)

Solr server

Solr server

Trigger

Solr server

interactive load

Solr server

Solr server

HDFS


Lily hbase indexer
Lily HBase Indexer

  • Collaboration between NGData & Cloudera

    • NGData are creators of the Lily data management platform

  • Lily HBase Indexer

    • Service which acts as a HBase replication listener

      • HBase replication features, such as filtering, supported

    • Replication updates trigger indexing of updates (rows)

    • Integrates Cloudera Morphlines library for ETL of rows

    • AL2 licensed on githubhttps://github.com/ngdata


Indexing3
Indexing

  • Near Real Time (NRT)

    • Flume

    • HBase Indexer

  • Batch (MR)


Scalable batch indexing
Scalable Batch Indexing

GOLIVE

Solr server

  • Solr and MapReduce

  • Flexible, scalable batch indexing

  • Start serving new indices with no downtime

  • On-demand indexing, cost-efficient re-indexing

Solr server

Index shard

Index shard

Indexer

HDFS

Indexer

Files

Files


Mapreduce indexer
MapReduce Indexer

MapReduce Job with two parts

1) Scan HDFS for files to be indexed

  • Much like Unix “find” – see HADOOP-8989

  • Output is NLineInputFormat’ed file

    2) Mapper/Reducer indexing step

  • Mapper extracts content via Cloudera Morphlines

  • Reducer indexes documents via embedded Solr server

  • Originally based on SOLR-1301

    • Many modifications to enable linear scalability


Mapreduce indexer golive
MapReduce Indexer “golive”

  • Cloudera created this to bridge the gap between NRT (low latency, expensive) and Batch (high latency, cheap at scale) indexing

  • Results of MR indexing operation are immediately merged into a live SolrCloud serving cluster

    • No downtime for users

    • No NRT expense

    • Linear scale out to the size of your MR cluster


Cloudera morphlines
Cloudera Morphlines

  • Open Source framework for simple ETL

  • Simplify ETL

    • Built-in commands and library support (Avro format, HadoopSequenceFiles, grok for syslog messages)

    • Configuration over coding

  • Standardize ETL

  • Ships as part Cloudera Developer Kit (CDK)

    • It’s a Java library

    • AL2 licensed on githubhttps://github.com/cloudera/cdk


  • Cloudera morphlines architecture
    Cloudera Morphlines Architecture

    Morphlines can be embedded in any application…

    SolrCloud

    Solr

    Flume, MR Indexer, HBase indexer, etc...

    Or your application!

    Logs, tweets, social media, html, images, pdf, text….

    Anything you want to index

    Solr

    Morphline Library

    Solr


    Extraction and mapping
    Extraction and Mapping

    syslog

    Flume Agent

    • Modeled after Unix pipelines

    • Simple and flexible data transformation

    • Reusable across multiple index workloads

    • Over time, extend and re-use across platform workloads

    Event

    Solr sink

    Record

    Morphline Library

    Command: readLine

    Record

    Command: grok

    Record

    Command: loadSolr

    Document

    Solr


    Morphline example syslog with grok
    Morphline Example – syslog with grok

    morphlines : [

     {

       id : morphline1

    importCommands : ["com.cloudera.**", "org.apache.solr.**"]

       commands : [

         { readLine {} }                    

         {

    grok {

    dictionaryFiles : [/tmp/grok-dictionaries]                               

             expressions : {

               message : """<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_timestamp} %{SYSLOGHOST:syslog_hostname} %{DATA:syslog_program}(?:\[%{POSINT:syslog_pid}\])?: %{GREEDYDATA:syslog_message}"""

             }

           }

         }

         { loadSolr {} }     

        ]

     }

    ]

    Example Input

    <164>Feb  4 10:46:14 syslog sshd[607]: listening on 0.0.0.0 port 22

    Output Record

    syslog_pri:164

    syslog_timestamp:Feb  4 10:46:14

    syslog_hostname:syslog

    syslog_program:sshd

    syslog_pid:607

    syslog_message:listening on 0.0.0.0 port 22.


    Current c ommand l ibrary
    Current Command Library

    • Integrate with and load into Apache Solr

    • Flexible log file analysis

    • Single-line record, multi-line records, CSV files

    • Regex based pattern matching and extraction

    • Integration with Avro

    • Integration with Apache Hadoop Sequence Files

    • Integration with SolrCell and all Apache Tika parsers

    • Auto-detection of MIME types from binary data using Apache Tika


    Current c ommand library cont
    Current Command Library (cont)

    • Scripting support for dynamic java code

    • Operations on fields for assignment and comparison

    • Operations on fields with list and set semantics

    • if-then-else conditionals

    • Asmall rules engine (tryRules)

    • String and timestamp conversions

    • slf4j logging

    • Yammer metrics and counters

    • Decompression and unpacking of arbitrarily nested container file formats

    • Etc…


    Querying
    Querying

    • Built-in solr web UI

    • Write your own

    • Hue


    Simple customizable search interface
    Simple, Customizable Search Interface

    • Hue

    • Simple UI

    • Navigated, faceted drill down

    • Customizable display

    • Full text search, standard Solr API and query language


    Security
    Security

    • Upstream Solr doesn’t really deal with security

    • Search 1.0 supports kerberos authentication

      • Similar to Oozie / WebHDFS

    • Actively working on Index-level authorization using Apache Sentry

    • Future: more granular authorization


    Conclusion
    Conclusion

    • Cloudera Search now Generally Available (1.0)

      • Free Download

      • Extensive documentation

      • Send your questions and feedback to [email protected]

      • Take the Search online training

    • Cloudera Manager Standard (i.e. the free version)

      • Simple management of Search

      • Free Download

    • QuickStart VM also available!


    ad