HCatalog Loader and Storer

Summary -

In this topic, we described about the below sections -

Loader and Storer
- HCatLoader
- HCatStorer

The HCatLoader interfaces are used with Pig scripts to read the data from HCatalog-managed tables. The HCatStorer interfaces are used with Pig scripts to write data in HCatalog-managed tables. No HCatalog-specific setup is required for the above two interfaces.

Refer Pig tutorial for basic knowledge on PIG.

HCatLoader

HCatLoader is used with Pig scripts to read data from HCatalog-managed tables. HCatLoader is accessed via a Pig load statement.

Syntax in Pig 0.14+:

A = LOAD 'tablename' USING org.apache.hive.
 hcatalog.pig.HCatLoader();

User must specify the table name in single quotes like LOAD 'tablename'. If user using a non-default database, user must specify input as 'dbname.tablename'. The Hive metastore lets user to create tables without specifying a database.

If tables created in this way, then the database name is 'default' and is not required when specifying the table for HCatLoader. Restrictions apply to the types of columns HCatLoader can read from HCatalog-managed tables. HCatLoader can read all the Hive data types.

HCatStorer

HCatStorer is used with Pig scripts to write data to HCatalog-managed tables. HCatStorer is accessed via a Pig store statement.

A = LOAD ...
	B = FOREACH A ...
	...
	...
	my_processed_data = ...
	STORE my_processed_data INTO 'tablename'
   USING org.apache.hive.hcatalog.pig.HCatStorer();

User must specify the table name in single quotes like LOAD 'tablename'. Both the database and table must be created prior to running user created Pig script. If user using a non-default database user must specify the input as 'dbname.tablename'.

The Hive metastore lets user create tables without specifying a database. If user created tables this way, then the database name is 'default' and user do not need to specify the database name in the store statement. For the USING clause, user can have a string argument that represents key/value pairs for partition.

This is a mandatory argument when user writing to a partitioned table and the partition column is not in the output column. The values for partition keys should NOT be quoted. If partition columns are present in data they need not be specified as a STORE argument.