HCatalog Architecture

Summary -

In this topic, we described about the below sections -

Architecture

HCatalog is a Hadoop storage and table management layer. HCatalog enables different data processing tools like Pig, MapReduce for Users. Users can easily read and write data on the grid by using the tools enabled by HCatalog.

Users can directly load the tables using pig or MapReduce and no need to worry about re-defining the input schemas. HCatalog exposes the tabular data of HCatalog metastore to other Hadoop applications. Apache HCatalog is a project enabling non-HCatalog scripts to access HCatalog tables.

The users need not worry about where or in what format their data is stored. HCatalog table concept provides a relational view of data in the Hadoop Distributed File System (HDFS) to the users.

HCatalog can displays data from RCFile format, text files, or sequence files in a tabular view. HCatalog also provides APIs to access these tables metadata by external systems.

Architecture -

HCatalog is built on top of the HCatalog metastore. HCatalog incorporates HCatalog's DDL. HCatalog provides read and write interfaces for the different data processing tools like Pig, MapReduce for Users.

HCatalog uses HCatalog's command line interface for issuing data definition and metadata exploration commands.

Interfaces -

The HCatalog interface for Pig consists of HCatLoader and HCatStorer interfaces. HCatLoader accepts a table to read data from RCFile format, text files, or sequence files following the load statement with a partition filter statement.

HCatStorer accepts a table to write to and optionally a specification of partition keys to create a new partition. There is no HCatalog-specific interface. HCatalog can read data in HCatalog directly because HCatalog uses HCatalog's metastore.

Data is defined using HCatalog's command line interface (CLI). The HCatalog CLI supports all HCatalog DDL. HCatalog does not require MapReduce to execute and allows users to create, alter, drop tables, etc.

Data Model -

HCatalog provides a relational view of data. Data is stored in tables and these tables can be placed in databases. Tables can be partitioned on one or more keys.

New partitions can be added to a table and partitions can be dropped from a created table. Partitioned tables have no partitions at the time of creation. Unpartitioned tables effectively have one default partition that should be created at the time of table creation.

Partitions also contain the data records. Partitions are multi-dimensional and not hierarchical. Records are divided into columns. Columns have a name and a datatype. HCatalog supports the same datatypes as HCatalog.