140 likes | 147 Views
This presentation gives an overview of the Apache Tajo project. It explains Tajo architecture in relation to Hadoop/Hive and ETL. <br> <br>Links for further information and connecting<br><br>http://www.amazon.com/Michael-Frampton/e/B00NIQDOOM/<br><br>https://nz.linkedin.com/pub/mike-frampton/20/630/385<br><br>https://open-source-systems.blogspot.com/
E N D
What Is Apache Tajo ? ● A data warehouse system ● Open source / Apache 2.0 license ● Stores data on HDFS and others ● For low latency big data queries / ETL ● Supports SQL ● No release since May 2016
Tajo Catalog Storage ● Tajo can store catalog information in – Apache Derby ( default ) – MySQL – MariaDB – In-memory – Hive Catalog / HiveMetaStore ● Derby is the default with storage under /tmp
Tajo Data Storage ● Tajo can store data in the following locations – HDFS – Amazon S3 – Openstack Swift – Hbase – RDBMS ● It is also possible to register user defined storage – Place user define jar file in tajo/extlib – Copy modified conf/storage-site.json.template into conf/storage-site.json
Tajo Shell ● Tajo provides a shell for instance manipulation – Issue meta commands i.e. \l ( list db ) – Issue HDFS commands – Use \set to set session variables – Issue \admin administration commands – Issues commands interactively or batch – Run as a background process
Tajo Cluster Architecture ● A Tajo cluster has – One or more TajoMaster servers – One or more TajoWorker servers ● TajoMaster coordinates TajoWorkers ● TajoWorkers carry out processing ● More TajoWorkers mean more processing capacity ● Capacity scales linearly
Tajo TajoMaster Architecture ● A TajoMaster process has a – QueryCoordinator ●Decides whether each query should be executed in a distributed way or be executed immediately in TajoMaster – Resource Tracker ●Manages membership of cluster nodes – Client Service Provider ●Routes client API calls to proper QueryCoordinator or ResourceTracker
Tajo TajoWorker Architecture ● A TajoWorker process has a – NodeResourceManager ●Manages resource of worker node – TaskManager ●Launches task to the TaskExecutor ●Uses multiple threads equal to the number of cpu cores – TaskExecutor ●Creates TaskContainers for workload – NodeStatusUpdater ●Updates the current status when resources change
Tajo Table Spaces ● Tajo supports Table Spaces – Data may be stored in multiple locations – i.e. HDFS, S3, or Hbase – It might be stored in multiple formats – i.e. CSV, Parquet, or ORC ● TableSpaces provide a way to – Easily handle data stored on different storage types – In various file formats
Tajo Table Spaces ● Multiple tablespaces exist for a data source ● A tablespace contains multiple tables while a table has only one tablespace ● External tables don't have any tablespaces because they have their own storage information ● A database can contain tables of different tablespaces
Available Books ● See “Big Data Made Easy” Apress Jan 2015 – See “Mastering Apache Spark” ● Packt Oct 2015 – See “Complete Guide to Open Source Big Data Stack ● “Apress Jan 2018” – ● Find the author on Amazon www.amazon.com/Michael-Frampton/e/B00NIQDOOM/ – Connect on LinkedIn ● www.linkedin.com/in/mike-frampton-38563020 –
Connect ● Feel free to connect on LinkedIn –www.linkedin.com/in/mike-frampton-38563020 ● See my open source blog at open-source-systems.blogspot.com/ – ● I am always interested in – New technology – Opportunities – Technology based issues – Big data integration