Summary -

In this topic, we described about the below sections -

HCatalog provides a data transfer API. The data transfer can be parallel for input and output. The data transfer in HCatalog can do without using MapReduce. This API provides a way to read data from a Hadoop cluster or write data into a Hadoop cluster.

For the same process API uses a basic storage concept of tables and rows. The HCatalog data transfer API is designed to enable integration of external systems with Hadoop. The data transfer API has below three important classes.

  • HCatReader – reads data from a Hadoop cluster
  • HCatWriter – writes data into a Hadoop cluster
  • DataTransferFactory – generates reader and writer instances

Let us discuss about each one in detail.

HCatReader -

HCatReader is an abstract class internal to HCatalog. HCatReader extracts away the complexities of the fundamental system from where the records are to be retrieved. Reading is a two-step process in which the first step occurs on the master node of an external system.

The second step is done in parallel on multiple slave nodes. Reads are done on a “ReadEntity”. Before starting to read, user need to define a ReadEntity from which to read.

This can be done through ReadEntity.Builder. User can specify a database name, table name, partition, and filter string.

Example:

ReadEntity.Builder builder = new ReadEntity.Builder();
ReadEntity entity = builder.withDatabase("mydb")
.withTable("mytbl").build();

The above code snippet defines a ReadEntity object (“entity”), comprising a table named “mytbl” in a database named “mydb”, which can be used to read all the rows of this table.

Note! The above table must exist in HCatalog prior to the start of this operation.

After defining a ReadEntity, user get an instance of HCatReader using the ReadEntity and cluster configuration like below -

HCatReader reader = DataTransferFactory
.getHCatReader(entity, config);

The next step is to get a ReaderContext from reader as follows -

ReaderContext cntxt = reader.prepareRead();

All of the above steps happen on the master node. The master node then serializes the ReaderContext object and sends it to all the slave nodes. Slave nodes use the reader context to read data.

HCatWriter -

HCatWriter abstraction is internal to HCatalog. HCatWriter is to facilitate writing to HCatalog from external systems. Don't try to instantiate this directly.

Instead, use DataTransferFactory. Writing is a two-step process in which the first step happens on the master node. Subsequently, the second step happens in parallel on slave nodes. Writes are done on a “WriteEntity” which can be built in a fashion similar to reads: -

WriteEntity.Builder builder = new WriteEntity.Builder();
WriteEntity entity = builder.withDatabase("mydb")
.withTable("mytbl").build();

The above code creates a WriteEntity object (“entity”) which can be used to write into a table named “mytbl” in the database “mydb”. After creating a WriteEntity, the next step is to get a WriterContext -

HCatWriter writer = DataTransferFactory
.getHCatWriter(entity, config);
WriterContext info = writer.prepareWrite();

All of the above steps happen on the master node. The master node then serializes the WriterContext object and makes it available to all the slaves. On slave nodes, user need to obtain an HCatWriter using WriterContext like below -

HCatWriter writer = DataTransferFactory
.getHCatWriter(context);

Then, writer takes an iterator as the argument for the write method -

writer.write(hCatRecordItr);