Summary -

In this topic, we described about the below sections -

The HCatInputFormat is used to read the data from HDFS. The HCatOutputFormat interfaces is used write the resultant data into HDFS after processing using MapReduce job.

Let us discuss about the Input and Output format interfaces in detail.

HCatInputFormat -

The HCatInputFormat is used to read data from HCatalog-managed tables with MapReduce jobs. HCatInputFormat exposes a Hadoop 0.20 MapReduce API for reading data.

The HCatInputFormat API includes the below methods -

  • setInput
  • setOutputSchema
  • getTableSchema

To use HCatInputFormat to read data, first instantiate an InputJobInfo with the necessary information from the table being read and then call setInput with the InputJobInfo.

User can use the setOutputSchema method to include a projection schema, to specify the output fields. If a schema is not specified, all the columns in the table will be returned. User can use the getTableSchema method to determine the table schema for a specified input table.

HCatOutputFormat -

HCatOutputFormat is used to write data to HCatalog-managed tables with MapReduce jobs. HCatOutputFormat disclosures a Hadoop 0.20 MapReduce API for writing data to a table. MapReduce job uses HCatOutputFormat to write output. The default OutputFormat configured for the table will be used. The new partition is published to the table after the job completes.

The HCatOutputFormat API includes the below methods -

  • setOutput
  • setSchema
  • getTableSchema

The first call on the HCatOutputFormat must be setOutput.

If any other call except HCatOutputFormat as a first call will throw an exception saying the output format is not initialized. The schema for the data being written out is specified by the setSchema method. User must call this method, providing the schema of data you are writing.