Apache Hive

From Wikipedia, the free encyclopedia - View original article

Hive
Stable release0.13.1 / June 6, 2014 (2014-06-06)
Development statusActive
Written inJava
Operating systemCross-platform
LicenseApache License 2.0
Websitehive.apache.org
 
Jump to: navigation, search
This article is about a data warehouse infrastructure. For the Java application framework, see Apache Beehive.
Hive
Stable release0.13.1 / June 6, 2014 (2014-06-06)
Development statusActive
Written inJava
Operating systemCross-platform
LicenseApache License 2.0
Websitehive.apache.org

Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.[1] While initially developed by Facebook, Apache Hive is now used and developed by other companies such as Netflix.[2][3] Amazon maintains a software fork of Apache Hive that is included in Amazon Elastic MapReduce on Amazon Web Services.[4]

Features[edit]

Apache Hive supports analysis of large datasets stored in Hadoop's HDFS and compatible file systems such as Amazon S3 filesystem. It provides an SQL-like language called HiveQL while maintaining full support for map/reduce. To accelerate queries, it provides indexes, including bitmap indexes.[5]

By default, Hive stores metadata in an embedded Apache Derby database, and other client/server databases like MySQL can optionally be used.[6]

Currently, there are four file formats supported in Hive, which are TEXTFILE, SEQUENCEFILE, ORC and RCFILE.[7][8]

Other features of Hive include:

HiveQL[edit]

While based on SQL, HiveQL does not strictly follow the full SQL-92 standard. HiveQL offers extensions not in SQL, including multitable inserts and create table as select, but only offers basic support for indexes. Also, HiveQL lacks support for transactions and materialized views, and only limited subquery support.[9][10] There are plans for adding support for insert, update, and delete with full ACID functionality.[11]

Internally, a compiler translates HiveQL statements into a directed acyclic graph of MapReduce jobs, which are submitted to Hadoop for execution.[12]

See also[edit]

References[edit]

  1. ^ Venner, Jason (2009). Pro Hadoop. Apress. ISBN 978-1-4302-1942-2. 
  2. ^ Use Case Study of Hive/Hadoop
  3. ^ OSCON Data 2011, Adrian Cockcroft, "Data Flow at Netflix" on YouTube
  4. ^ Amazon Elastic MapReduce Developer Guide
  5. ^ Working with Students to Improve Indexing in Apache Hive
  6. ^ Lam, Chuck (2010). Hadoop in Action. Manning Publications. ISBN 1-935182-19-6. 
  7. ^ Facebook's Petabyte Scale Data Warehouse using Hive and Hadoop
  8. ^ Yongqiang He, Rubao Lee, Yin Huai, Zheng Shao, Namit Jain, Xiaodong Zhang and Zhiwei Xu. "RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems" (PDF). 
  9. ^ White, Tom (2010). Hadoop: The Definitive Guide. O'Reilly Media. ISBN 978-1-4493-8973-4. 
  10. ^ Hive Language Manual
  11. ^ Implement insert, update, and delete in Hive with full ACID support
  12. ^ Hive A Warehousing Solution Over a MapReduce Framework

External links[edit]