Friday, April 2, 2010

External MySQL Storage Engines for Data Warehousing
















There are three storage engines built and maintained by third-party storage providers that can be used to
meet a few of the most common data warehousing use cases. These are InnoDB, NitroEDB, and
BrightHouse

InnoDB
First is InnoDB, which is a transactional storage engine maintained by Oracle. InnoDB obviously supports
OLTP applications, but it can also serve analytic purposes too and therefore may be applicable to the
OLTP/analytic hybrid use case identified by Gartner.

NitroEDB
Nitrosecurity is a provider of security information management solutions, which are undergirded by a very
powerful database management system. The Nitrosecurity database has been designed with a number of
very unique characteristics so it can support the enormous amount of data volumes and concurrent queries
that come as part of a security management software package.
Nitrosecurity (www.nitrosecurity.com) and MySQL have partnered to produce the NitroEDB storage
engine, which has been created to handle the needs of real-time data warehousing customers. The
NitroEDB engine has the ability to receive huge volumes of inserts while supporting concurrent queries
against the same data at the same time, all with no performance penalty whatsoever. In addition, NitroEDB
contains specialized indexes (N-tree, Microcluster) that are designed to deliver extremely fast response
times for aggregate styled queries (SUM, AVG, etc.) due to the fact that the aggregate information is
actually stored within the index itself.

Brighthouse
The last third party storage engine applicable for MySQL data warehousing is Brighthouse that is
produced by a partnership between MySQL and Infobright (www.infobright.com). At it core, BrightHouse is
a highly compressed column-oriented datastore that incorporates MySQL technology, although the engine
uses its own load and unload utilities rather than MySQL’s mainly because (1) compression and
decompression are done on load and unload and (2) the BrightHouse Knowledge Grid is created on load.
In addition, BrightHouse also uses its own optimizer instead of the MySQL optimizer because the
BrightHouse optimizer knows how to use the information that is stored in its Knowledge Grid to optimize the
execution of queries against the BrightHouse engine.


The Brighthouse engine consists of 4 key layers:


1. BrightHouse is a column-oriented
data store. This means that
instead of the data being stored
row by row, it is stored column by
column. There are many
advantages to column-orientation,
not the least of which is the ability
to do more efficient data
compression because since each
column stored a single data type
(as opposed to rows that typically
contain several data types),
compression can be optimized for
each particular data type. The
data itself within the columns is
stored by 65K item groupings. We
refer to each of these groupings
as Data Packs. The use of Data
Packs improves data compression
and is also critical to how
Infobright resolves complex
queries.


2. Data Pack Nodes (DPNs) contain a set of statistics stored related to the data that is stored and
compressed in each of the Data Packs. There is always a 1 to 1 relationship between Data Packs and DPNs.

3. Knowledge Nodes are a further set of metadata related to Data Packs, columns or table combinations.
The set of these Knowledge Nodes taken together is called the Knowledge Grid.

4. The BrightHouse Optimizer uses the Knowledge Grid to determine the minimum set of Data Packs
which need to be decompressed in order to satisfy a given query. In some cases, the information
contained in the Knowledge Grid is sufficient to resolve the query, in which case nothing is
decompressed.
The Brighthouse storage engine’s design is perfect for the massive amounts of data and queries that are
the standard fare of traditional or historical data warehousing.


No comments: