Summary -

In this topic, we described about the below sections -

Why?

People who don’t know java and wanted to analyze the data in HDFS will have some problems in to crate mappers and reducers to analyze the data. Hive is an eco-system project build on top of map reduce to create some flexibility to the users who don’t know java.

History?

Hive is originally developed by jeff team at facebook. Hive is actually developed for an need to manage and analyze the data producing by facebook users every day. After many trails, the team selected Hadoop for storage and processing the large data.

Hive was created to analyze the data by querying large data of facebook. Then today, Hive is a successful Apache project under Hadoop which is using by many organizations to analyze and summarize the large volumes of data.

What is Hive?

Hive is a Hadoop based data warehousing infrastructure. Hive is a framework for data warehousing on top of Hadoop. Hive is designed to support the data summarization, querying and analysis of large volumes of data.

Hive provides simple query language Hive QL for ad-hoc querying the large volumes of data. Hive QL is SQL based language and user who knows the SQL can easily familiar with queries to summarize the data. Hive QL also supports the map/reduce programmers to pad their custom mapper and reducers to proceed with analysis to summarize the data.

How Hive works?

Hive contains Hive query language and Hive interpreter. To analyze the data, write the query in Hive QL and submit the query to Hive interpreter. Hive converts the query to map reduce job and submits to job tracker. Hive interpreter runs on client machine.

Hive Structure?

Each table created in Hive represents a directory. Hive adds the schema definition to the data represented in HDFS. Hive internally uses derby (single user) or mysql (shared) database as a storing structure based on the type.

Hive Advantages?

  • Hive allows user to query the data without knowing the java or mapreduce.
  • Designed for easy-to-use concept.
  • Scalability.
  • Ability to bring structure to various formats.

Hive Disadvantages?

  • Hive doesn’t provide real time queries.
  • Querying small amount of data may take minutes.

Hive Limitations?

  • Single row INSERT not supported.
  • There is no support for UPDATE or DELETE.
  • Limited number of built-in functions.
  • No data types for Date or Time.