slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Hive 实战 PowerPoint Presentation
Download Presentation
Hive 实战

Loading in 2 Seconds...

play fullscreen
1 / 19

Hive 实战 - PowerPoint PPT Presentation


  • 283 Views
  • Uploaded on

Hive 实战. 数据平台及产品部 少杰. Agenda. 简介 Hive QL Hive 扩展 SQL vs HQL. 简介. 分布式计算 MapReduce 编程模型 Hadoop Hive. 简介. Hive 系统结构. 简介. 数据流 (in taobao) 数据源: weblog/db/… 数据同步: jdbcdump 报表计算 / 预处理 /ETL : Hive 数据入库: dbloader. Hive QL. 数据类型 Primitive int / bigint / smallint / tinyint boolean

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

Hive 实战


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
    Presentation Transcript
    1. Hive实战 数据平台及产品部 少杰

    2. Agenda • 简介 • Hive QL • Hive扩展 • SQL vs HQL

    3. 简介 • 分布式计算 • MapReduce编程模型 • Hadoop • Hive

    4. 简介 • Hive系统结构

    5. 简介 • 数据流(in taobao) • 数据源:weblog/db/… • 数据同步:jdbcdump • 报表计算/预处理/ETL:Hive • 数据入库:dbloader

    6. Hive QL • 数据类型 • Primitive • int / bigint / smallint / tinyint • boolean • double / float • string • Array • Map • Struct • No precision / length config • No date / datetime type

    7. Hive QL • DDL – create table • CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name [(col_name data_type, ...)] [PARTITIONED BY (col_name data_type, ...)] [ [ROW FORMAT row_format] [STORED AS file_format] | [ WITH SERDEPROPERTIES (...) ] ] [LOCATION hdfs_path]

    8. Hive QL • DDL – create table example CREATE TABLE page_view( viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User‘ ) COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' STORED AS SEQUENCEFILE;

    9. Hive QL • DML – load data • LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]

    10. Hive QL • DML – insert • INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement • FROM from_statement INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 [INSERT OVERWRITE TABLE tablename2 [PARTITION ...] select_statement2] ... • INSERT OVERWRITE TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement • (HDFS)不支持UPDATE!

    11. Hive QL • DML/DDL – add partition • ALTER TABLE table_name ADD PARTITION (partcol1=val1, partcol2=val2 ...) [LOCATION 'filepath' ]

    12. Hive QL • Query - select • SELECT [ALL | DISTINCT] select_expr, select_expr, ... FROM table_reference [WHERE where_condition] [GROUP BY col_list] [ CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list] ] [LIMIT number] • 不支持exist in子查询

    13. Hive QL • Query - join • join_table: table_reference JOIN table_factor [join_condition] | table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN table_reference join_condition • table_reference: table_factor | join_table • table_factor: tbl_name [alias] | table_subquery alias | ( table_references ) • join_condition: ON equality_expression ( AND equality_expression )* equality_expression: expression = expression • 等值Join • 合并Join的原则 • NULL值处理

    14. Hive QL • Query - subqueries • SELECT ... FROM (subquery) name ... • select_statement UNION ALL select_statement UNION ALL select_statement ...

    15. Hive扩展 • UDFs • 类别 • UDF - 1:1 • UDAF – N:1 • (UDTF) • Implement UDF • extends UDF / GenericUDF • implement evaluate() function • Implement UDAF • extends UDAF / GenericUDAF • implement • iterate • merge • terminatePartial • terminate

    16. Hive扩展 • Transform • FROM ( FROM src MAP expression (',' expression)* USING 'my_map_script' ( AS colName (',' colName)* )? ( clusterBy? | distributeBy? sortBy? ) src_alias ) REDUCE expression (',' expression)* USING 'my_reduce_script' ( AS colName (',' colName)* )?

    17. Hivevs SQL • 语义 • 无关系约束(第一范式?) • 不支持exist in子查询 • 只支持等值Join • 数据类型

    18. Hive优化器 • Partition Pruning (ppr)分区裁减where(pt=‘’) • Predicate Push down (ppd) • Column Pruning (cp) • Mapjoin transformer

    19. Hive优化 • 数据偏斜 • MapJoin缺点:1内存2小表数*MAP数是否太大 • Group by (distinct) skew • 内存优化 • 驱动表:优化内存,将大表作为驱动表即a join b b为驱动表 • I/O优化 • Map aggregation • MR任务合并 • multi-insert节省两次m/r的扫描 • multi-groupby • multi-distinct