HCatalog Loader and Storer
HCatalog is like a helper that connects Pig scripts to Hive tables. It lets you easily load data from Hive into Pig and store processed data from Pig back into Hive. The two key components for this are HCatLoader and HCatStorer. They handle all the necessary tasks behind the scenes, like managing schemas, partitions, and data formats. So that you don’t have to worry about any of it.
HCatLoader
HCatLoader is used in Pig to load data from Hive tables. It reads the table’s structure and content using Hive’s metadata and makes it available in Pig as if it’s just another data file. You just need to specify the table name and Pig will fetch the data automatically. The advantage is that you don’t need to worry about where the data is stored or its format—HCatalog takes care of that for you.
Syntax in Pig 0.14+:
A = LOAD 'my_name' USING org.apache.hive.hcatalog.pig.HCatLoader();
If the table is in a specific database, just use 'mydb.my_table' instead of just 'my_table'.
This means that even if the table is in the Hive mydb
database, and not the default one,
you can still access it from Pig by just referencing the full name (mydb.my_table).
No need to write code for parsing input files or analyzing schema—HCatalog takes care of it.
HCatStorer
HCatStorer is the tool used to save data from Pig back into a Hive-managed table. It works like the reverse of HCatLoader. After you've processed the data in Pig—you can use HCatStorer to write that data directly into a Hive table.
STORE result INTO 'my_table' USING org.apache.hcatalog.pig.HCatStorer();
This will store the data into the Hive table named my_table. If you are using a table in another database (not the default), use something like 'mydb.my_table'.
It ensures the processed data is immediately available for queries from Hive, Pig, or other tools that use Hive tables.
Handling Partitioned Tables
Sometimes, the Hive table you want to use has partitions—for example, it might store sales data by region or date. If the table is partitioned, HCatStorer gives you two ways to handle it. If the Pig output doesn’t include the partition column, you can specify a static partition while storing the data, like this:
STORE result INTO 'sales_data'
USING org.apache.hcatalog.pig.HCatStorer('region=us');
In this case, the data will be saved into the region=us partition. But if the Pig output does include the partition column (like region), you don’t need to specify it in the HCatStorer. It will automatically figure out where each record goes based on the partition field values.
Full Example:
Below is a single-file Java program (HCatLoaderStorerDemo.java) that demonstrates the two HCatalog Pig adapters—HCatLoader (input) and HCatStorer (output) with the Hive tables student_details as input, student_output as output.
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.Statement;
import java.util.Properties;
import org.apache.pig.ExecType;
import org.apache.pig.PigServer;
/**
* Full demo using HCatLoader and HCatStorer in Pig.
* - Input Table: student_details
* - Output Table: student_output (students with age < 20)
*/
public class HCatLoaderStorerExample {
public static void main(String[] args) throws Exception {
// Step 1: Create Hive tables and insert data
prepareHiveTables();
// Step 2: Run Pig job using HCatLoader and HCatStorer
runPigJob();
System.out.println("\n✔ Data processing completed.");
System.out.println("→ Use 'SELECT * FROM student_output;'
in Hive to see results.");
}
private static void prepareHiveTables() throws Exception {
// Load Hive JDBC driver
Class.forName("org.apache.hive.jdbc.HiveDriver");
// Connect to HiveServer2 (adjust host/port as needed)
Connection conn = DriverManager.getConnection("jdbc:hive2://localhost:10000/default", "hive", "");
Statement stmt = conn.createStatement();
// Create input table and load demo data
stmt.execute("DROP TABLE IF EXISTS student_details");
stmt.execute("CREATE TABLE student_details ("
+ "student_id INT, student_name STRING, age INT, grade STRING"
+ ") STORED AS RCFILE");
stmt.execute("INSERT INTO TABLE student_details VALUES "
+ "(1, 'Ananya', 19, 'A'),"
+ "(2, 'Rahul', 21, 'B'),"
+ "(3, 'Sara', 18, 'A'),"
+ "(4, 'John', 20, 'C')");
// Drop output table if it exists
stmt.execute("DROP TABLE IF EXISTS student_output");
// Create output table
stmt.execute("CREATE TABLE student_output ("
+ "student_id INT, student_name STRING, age INT, grade STRING"
+ ") STORED AS RCFILE");
stmt.close();
conn.close();
System.out.println("✔ Hive tables created and sample data loaded.");
}
private static void runPigJob() throws Exception {
// Step 1: Set minimal Hadoop & Hive configuration
Properties properties = new Properties();
properties.setProperty("fs.defaultFS", "hdfs://localhost:9000");
properties.setProperty("mapreduce.framework.name", "local");
properties.setProperty("hive.metastore.uris", "thrift://localhost:9083");
// Step 2: Start embedded PigServer
PigServer pigServer = new PigServer(ExecType.MAPREDUCE, properties);
// Step 3: Register required HCatalog JARs (adjust path based on your setup)
pigServer.registerJar("/usr/lib/hcatalog/share/hcatalog/hcatalog-core.jar");
pigServer.registerJar("/usr/lib/hcatalog/share/hcatalog/hcatalog-pig-adapter.jar");
// Step 4: Load, filter, and store using HCatalog Pig adapter
pigServer.registerQuery("students = LOAD 'student_details' "
+ "USING org.apache.hive.hcatalog.pig.HCatLoader();");
pigServer.registerQuery("young_students = FILTER students BY age < 20;");
pigServer.registerQuery("STORE young_students INTO 'student_output' "
+ "USING org.apache.hive.hcatalog.pig.HCatStorer();");
pigServer.shutdown();
System.out.println("✔ Pig job executed successfully.");
}
}
Compile -
javac -classpath ".:<all_required_jars>" HCatLoaderStorerExample.java
Run -
java -classpath ".:<all_required_jars>" HCatLoaderStorerExample
Output -
SELECT * FROM student_output;
You should see:
+-------------+---------------+-----------+-------------+ | student_id | student_name | age | grade | +-------------+---------------+-----------+-------------+ | 1 | Ananya | 19 | A | | 3 | Sara | 18 | A | +-------------+---------------+-----------+-------------+