CBR talks you through the popular data warehouse software.
1. What is Hive?
Apache Hive is data warehouse software that is built on top of Apache Hadoop to structure big data in the form of query, summarisation and analysis.
Hive operates on a SQL-like language of its own: Hive Query Language, also known as HiveQL or HQL, which supports the MapReduce jobs to source data stored within the databases run on Hadoop.
Operating as an open source volunteer project since 2008, it has a team of developers, or Hive committers, who contribute to the code and run tests to improve the software.
The software used to be known as Hadoop Hive as it was a subproject of Hadoop, but the committers involved in the volunteer project have been so forth coming and that it is has graduated to become a top-level project of its own.
2. Why should you use Hive?
Hortonworks says that a typical use case for Hive is when you need to take large amounts of polystructred data and place it into a structure and view that is easier to use by the business analysts.
As well as enabling ad-hoc queries, summarisation and data analysis. HQL can also be extended with customer scalar function (user-defined functions) which turn multiple rows in databases into
You should not use Hive for real-time queries and row-level updates as it does not have the speed. It is better used for batch jobs over large sets of immutable data, such as web logs.
3. What are the benefits of using Hive?
According to the Apache Hive wiki, as HQL is similar to SQL language, it allows programmers who are familiar with the MapReduce framework to be able to plug in their custom mappers and reducers to perform more sophisticated analysis that may not be supported by the built-in capabilities of the language.
The Hive system is also easily scalable, making it well suited to managing extensible big data sets. Hadoop developer Hortonworks says this means that more commodity machines can be added to the cluster without a corresponding reduction in performance.
It is also highly informative as familiar JDBC and ODBC drivers allow lots of applications to pull Hive data easily for reporting, meaning it can used across a variety of apps.