HCatalog Loader and Storer

HCatalog is like a helper that connects Pig scripts to Hive tables. It lets you easily load data from Hive into Pig and store processed data from Pig back into Hive. The two key components for this are HCatLoader and HCatStorer. They handle all the necessary tasks behind the scenes, like managing schemas, partitions, and data formats. So that you don’t have to worry about any of it.

HCatLoader

HCatLoader is used in Pig to load data from Hive tables. It reads the table’s structure and content using Hive’s metadata and makes it available in Pig as if it’s just another data file. You just need to specify the table name and Pig will fetch the data automatically. The advantage is that you don’t need to worry about where the data is stored or its format—HCatalog takes care of that for you.

Syntax in Pig 0.14+:

A = LOAD 'my_name' USING org.apache.hive.hcatalog.pig.HCatLoader();

If the table is in a specific database, just use 'mydb.my_table' instead of just 'my_table'.

This means that even if the table is in the Hive mydb database, and not the default one, you can still access it from Pig by just referencing the full name (mydb.my_table). No need to write code for parsing input files or analyzing schema—HCatalog takes care of it.

HCatStorer

HCatStorer is the tool used to save data from Pig back into a Hive-managed table. It works like the reverse of HCatLoader. After you've processed the data in Pig—you can use HCatStorer to write that data directly into a Hive table.

STORE result INTO 'my_table' USING org.apache.hcatalog.pig.HCatStorer();

This will store the data into the Hive table named my_table. If you are using a table in another database (not the default), use something like 'mydb.my_table'.

It ensures the processed data is immediately available for queries from Hive, Pig, or other tools that use Hive tables.

Handling Partitioned Tables

Sometimes, the Hive table you want to use has partitions—for example, it might store sales data by region or date. If the table is partitioned, HCatStorer gives you two ways to handle it. If the Pig output doesn’t include the partition column, you can specify a static partition while storing the data, like this:

STORE result INTO 'sales_data' 
USING org.apache.hcatalog.pig.HCatStorer('region=us');

In this case, the data will be saved into the region=us partition. But if the Pig output does include the partition column (like region), you don’t need to specify it in the HCatStorer. It will automatically figure out where each record goes based on the partition field values.

Full Example:

Below is a single-file Java program (HCatLoaderStorerDemo.java) that demonstrates the two HCatalog Pig adapters—HCatLoader (input) and HCatStorer (output) with the Hive tables student_details as input, student_output as output.

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.Statement;
import java.util.Properties;

import org.apache.pig.ExecType;
import org.apache.pig.PigServer;

/**
 * Full demo using HCatLoader and HCatStorer in Pig.
 * - Input Table: student_details
 * - Output Table: student_output (students with age < 20)
 */
public class HCatLoaderStorerExample {

    public static void main(String[] args) throws Exception {

        // Step 1: Create Hive tables and insert data
        prepareHiveTables();

        // Step 2: Run Pig job using HCatLoader and HCatStorer
        runPigJob();

        System.out.println("\n✔ Data processing completed.");
        System.out.println("→ Use 'SELECT * FROM student_output;' 
		      in Hive to see results.");
    }

    private static void prepareHiveTables() throws Exception {
        // Load Hive JDBC driver
        Class.forName("org.apache.hive.jdbc.HiveDriver");

        // Connect to HiveServer2 (adjust host/port as needed)
        Connection conn = DriverManager.getConnection("jdbc:hive2://localhost:10000/default", "hive", "");

        Statement stmt = conn.createStatement();

        // Create input table and load demo data
        stmt.execute("DROP TABLE IF EXISTS student_details");
        stmt.execute("CREATE TABLE student_details ("
                   + "student_id INT, student_name STRING, age INT, grade STRING"
                   + ") STORED AS RCFILE");

        stmt.execute("INSERT INTO TABLE student_details VALUES "
                   + "(1, 'Ananya', 19, 'A'),"
                   + "(2, 'Rahul', 21, 'B'),"
                   + "(3, 'Sara', 18, 'A'),"
                   + "(4, 'John', 20, 'C')");

        // Drop output table if it exists
        stmt.execute("DROP TABLE IF EXISTS student_output");

        // Create output table
        stmt.execute("CREATE TABLE student_output ("
                   + "student_id INT, student_name STRING, age INT, grade STRING"
                   + ") STORED AS RCFILE");

        stmt.close();
        conn.close();

        System.out.println("✔ Hive tables created and sample data loaded.");
    }

    private static void runPigJob() throws Exception {
        // Step 1: Set minimal Hadoop & Hive configuration
        Properties properties = new Properties();
        properties.setProperty("fs.defaultFS", "hdfs://localhost:9000");
        properties.setProperty("mapreduce.framework.name", "local");
        properties.setProperty("hive.metastore.uris", "thrift://localhost:9083");

        // Step 2: Start embedded PigServer
        PigServer pigServer = new PigServer(ExecType.MAPREDUCE, properties);

        // Step 3: Register required HCatalog JARs (adjust path based on your setup)
        pigServer.registerJar("/usr/lib/hcatalog/share/hcatalog/hcatalog-core.jar");
        pigServer.registerJar("/usr/lib/hcatalog/share/hcatalog/hcatalog-pig-adapter.jar");

        // Step 4: Load, filter, and store using HCatalog Pig adapter
        pigServer.registerQuery("students = LOAD 'student_details' "
                              + "USING org.apache.hive.hcatalog.pig.HCatLoader();");

        pigServer.registerQuery("young_students = FILTER students BY age < 20;");

        pigServer.registerQuery("STORE young_students INTO 'student_output' "
                              + "USING org.apache.hive.hcatalog.pig.HCatStorer();");

        pigServer.shutdown();
        System.out.println("✔ Pig job executed successfully.");
    }
}

Compile -

javac -classpath ".:<all_required_jars>" HCatLoaderStorerExample.java

Run -

java -classpath ".:<all_required_jars>" HCatLoaderStorerExample

Output -

SELECT * FROM student_output;

You should see:

+-------------+---------------+-----------+-------------+
| student_id  | student_name  | age       | grade       |
+-------------+---------------+-----------+-------------+
| 1           | Ananya        | 19        | A           |
| 3           | Sara          | 18        | A           |
+-------------+---------------+-----------+-------------+