Apache Tajo

What Is Apache Tajo ? ● A data warehouse system ● Open source / Apache 2.0 license ● Stores data on HDFS and others ● For low latency big data queries / ETL ● Supports SQL ● No release since May 2016

Tajo Catalog Storage ● Tajo can store catalog information in – Apache Derby ( default ) – MySQL – MariaDB – In-memory – Hive Catalog / HiveMetaStore ● Derby is the default with storage under /tmp

Tajo Data Storage ● Tajo can store data in the following locations – HDFS – Amazon S3 – Openstack Swift – Hbase – RDBMS ● It is also possible to register user defined storage – Place user define jar file in tajo/extlib – Copy modified conf/storage-site.json.template into conf/storage-site.json

Tajo Shell ● Tajo provides a shell for instance manipulation – Issue meta commands i.e. \l ( list db ) – Issue HDFS commands – Use \set to set session variables – Issue \admin administration commands – Issues commands interactively or batch – Run as a background process

Tajo Cluster Architecture ● A Tajo cluster has – One or more TajoMaster servers – One or more TajoWorker servers ● TajoMaster coordinates TajoWorkers ● TajoWorkers carry out processing ● More TajoWorkers mean more processing capacity ● Capacity scales linearly

Tajo Cluster Architecture

Tajo TajoMaster Architecture ● A TajoMaster process has a – QueryCoordinator ●Decides whether each query should be executed in a distributed way or be executed immediately in TajoMaster – Resource Tracker ●Manages membership of cluster nodes – Client Service Provider ●Routes client API calls to proper QueryCoordinator or ResourceTracker

Tajo TajoWorker Architecture ● A TajoWorker process has a – NodeResourceManager ●Manages resource of worker node – TaskManager ●Launches task to the TaskExecutor ●Uses multiple threads equal to the number of cpu cores – TaskExecutor ●Creates TaskContainers for workload – NodeStatusUpdater ●Updates the current status when resources change

Tajo Table Spaces ● Tajo supports Table Spaces – Data may be stored in multiple locations – i.e. HDFS, S3, or Hbase – It might be stored in multiple formats – i.e. CSV, Parquet, or ORC ● TableSpaces provide a way to – Easily handle data stored on different storage types – In various file formats

Tajo Table Spaces

Tajo Table Spaces ● Multiple tablespaces exist for a data source ● A tablespace contains multiple tables while a table has only one tablespace ● External tables don't have any tablespaces because they have their own storage information ● A database can contain tables of different tablespaces

Tajo Table Spaces

Available Books ● See “Big Data Made Easy” Apress Jan 2015 – See “Mastering Apache Spark” ● Packt Oct 2015 – See “Complete Guide to Open Source Big Data Stack ● “Apress Jan 2018” – ● Find the author on Amazon www.amazon.com/Michael-Frampton/e/B00NIQDOOM/ – Connect on LinkedIn ● www.linkedin.com/in/mike-frampton-38563020 –

Connect ● Feel free to connect on LinkedIn –www.linkedin.com/in/mike-frampton-38563020 ● See my open source blog at open-source-systems.blogspot.com/ – ● I am always interested in – New technology – Opportunities – Technology based issues – Big data integration

Apache Tajo

Apache Tajo

Presentation Transcript

Apache Sandesha and Apache Axis2

Apache

Apache

Apache

Apache

Apache

Apache

APACHE

Apache

Apache

Apache

TRABAJO SOBRE EL RÍO TAJO:

APACHE

WEBQUEST RIO TAJO

Apache

Apache

Apache Tomcat

Apache

Apache HADOOP

APACHE

Apache OFBiz

Apache Ant