Hive Partitioning

Summary -

In this topic, we described about the below sections -

Partitioning
- Adding a partition
  - Adding a partition while table creation
  - Adding a partition after table creation
- Dropping a partition

Hive organizes the tables into partitions. Partition is a way of dividing a table into coarse-grained parts based on the value of partition column. Partitions are very useful to get the data faster using queries.

Hive has the capability to partition the data to increase the performance.

Partitioned column values divided a table into the segments.
Entire partitions can be ignored at query time.

User should properly create the partitions.

While inserting data user must specify a partition.

At query time, Hive will automatically filter out partitions whenever appropriate.

A table may or may not be partitioned in multiple dimensions.

Tables/ partitions may be further divided into buckets. Buckets are an extra structure to the data that may be used for more efficient queries. Partitions are defined at table creation time using PARTITION BY clause which takes a list of column definitions.

Let us take an example to discuss in detail about the partitions concept. Assume we have a requirement to create student table with the columns student number, name, class for a school.

Adding a partition:

A partition can be added while table creation or can be add to the table already created by using alter. Now let us see how the partition can be defined while creation.

Adding a partition while table creation -

Syntax -

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
  [(col_name data_type [COMMENT 'col_comment'], ...)]
  [COMMENT 'table_comment']
  [PARTITIONED BY (col_name data_type [COMMENT 'col_comment'], ...)]
  [WITH SERDEPROPERTIES ('key1'='value1', 'key2'='value2', ...)]
  [
   [ROW FORMAT row_format] [STORED AS file_format]
  ]
  [LOCATION 'hdfs_path']
  [TBLPROPERTIES ('key1'='value1', 'key2'='value2', ...)]
  [CACHED IN 'pool_name']

Example -

Assume we have a requirement to create student information with the columns student number, name, class, grade for a school.

Std No	Std Name	Std Class	Std Grade
1	Pawan	10	A
2	Srinivas	10	A
3	Sridhar	10	A
4	Kumar	10	B

Hive> CREATE TABLE std_details_i 
(std_no INT, std_name STRING, std_class INT,
 std_grade STRING)
COMMENT ‘Student Information’
PARTITIONED BY (std_class INT);
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’
LINES TERMINATED BY ‘\n’
STORED AS TEXTFILE;

Once the above statement successfully executed, the std_db.std_details table created successfully with partition.

Adding a partition after table creation -

Std_details already created without declared any partition column. The partition column can be added by using ALTER TABLE like below.

Syntax:

ALTER TABLE name [PARTITION (partition_spec)]
  SET { FILEFORMAT PARQUET | TEXTFILE | RCFILE | SEQUENCEFILE | AVRO
  | LOCATION 'hdfs_path_of_directory'
  | TBLPROPERTIES (table_properties)
  | SERDEPROPERTIES (serde_properties) }

Example -

Assume we have a requirement to create student information with the columns student number, name, class, grade for a school.

Std No	Std Name	Std Class	Std Grade
1	Pawan	10	A
2	Srinivas	10	A
3	Sridhar	10	A
4	Kumar	10	B

Below command to add the partition to the table already created earlier.

Hive> ALTER TABLE std_details 
 ADD PARTITION (std_class=’1’);

Once the above statement successfully executed, the partition added to std_db.std_details table. Similarly, multiple partitions for each class can be set by using ADD PARTITION.

Dropping a partition:

The following syntax used to drop the partition which is already existed on the table.

Syntax -

ALTER TABLE table_name DROP [IF EXISTS]
 PARTITION partition_spec, 
 PARTITION partition_spec;

Example -

Assume we have a requirement to create student information with the columns student number, name, class, grade for a school.

Std No	Std Name	Std Class	Std Grade
1	Pawan	10	A
2	Srinivas	10	A
3	Sridhar	10	A
4	Kumar	10	B

Below command to drop the partition already created.

Hive> ALTER TABLE std_details DROP [IF EXISTS]
    PARTITION (std_class=’1’);

Once the above statement successfully executed, the partition deleted on std_db.std_details table.