Apache HBase

HBase is an open source, non-relational, distributed database modeled after Google's BigTable and written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS (Hadoop Distributed Filesystem), providing BigTable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data (small amounts of information caught within a large collection of empty or unimportant data, such as finding the 50 largest items in a group of 2 billion records, or finding the non-zero items representing less than 0.1% of a huge collection).

Use Apache HBase™ when you need random, realtime read/write access to your Big Data. This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google's Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.

Features of Apache HBase

• Linear and modular scalability.

• Strictly consistent reads and writes.

• Automatic and configurable sharding of tables

• Automatic failover support between RegionServers.

• Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase tables.

• Easy to use Java API for client access.

• Block cache and Bloom Filters for real-time queries.

• Query predicate push down via server side Filters

• Thrift gateway and a REST-ful Web service that supports XML, Protobuf, and binary data encoding options

• Extensible jruby-based (JIRB) shell

• Support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or via JMX

added 8 years 10 months ago

Contents related to 'Apache HBase'

Apache Hadoop: Apache Hadoop is an open-source software framework for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware.