Tuesday, May 17, 2011

4 MySQL’s Data Warehousing Strategy

4 MySQL’s Data Warehousing Strategy

MySQL’s data warehousing strategy can be simply stated as delivering on these three goals:
• Support the most common data warehousing use cases
• Partner with major business intelligence and data movement/integration vendors
• Offer a highly attractive total cost of ownership for data warehousing installations
Each of these objectives w
ill now be explored in more detail in the sections that follow.


4.1 The Technical Blueprint behind MySQL and Data Warehousing

The technical attributes of MySQL’s data warehousing strategy include four things. First, a commitment to an open source methodology that helps maximize innovation. Second, a solid RDBMS feature set that lends itself to data warehousing. Third, a unique architecture that underpins the MySQL database server and allows it to support each major data warehousing use case through one common interface layer. And fourth, a strategy for scaling data warehouses is supplied that eliminates the mixed workload challenge identified by Gartner.

4.1.1 Open Source equals Maximum Innovation

The superiority of the open source model vs. traditional software development approaches has ceased being a debate, with industry analysts and others recognizing the advantages that come with adopting an open source framework. Having a committed and loyal user base that uses, supports, and extends a piece of software allows much greater innovation to occur
within a product, plus it helps the software evolve and improve much more quickly.
In the case of MySQL, the best of both worlds is found. MySQL is a formal company, owns the rights to the MySQL database server, and internally develops the product. But in addition, a huge community of MySQL users are constantly testing, submitting bug reports, and tendering enhancements (which undergo a rigorous review and approval process) for the server.
Finally, MySQL has a large and growing ISV partner network that also develops software and
improvements for the MySQL server (something that will be demonstrated in greater detail later in this paper)

With three strong avenues of development converging into one product, the end result is greater innovation, faster release cycles, and a product that is able to more quickly meet the exact needs of modern data warehousing applications.

4.1.2 MySQL’s Core Feature Set for Data Warehousing

MySQL contains a solid core feature set that is suitable for all data warehousing use cases. The following are just some of the features in the MySQL database server that help enable data warehousing:

• Data/Index partitioning – available in MySQL 5.1 and higher; supports range, hash, key, list, and composite partitioning. Partition “pruning” is available, which involves MySQL only examining the partitions it needs to satisfy a particular query instead of an entire table or index. Partition management is also supported (ADD PARTITIION, DROP PARTITION, etc.)
• No practical storage limits – for example, 1 tablespace=110TB limit
• Automatic storage management – autogrowth data files, etc.
• ANSI-SQL support for all datatypes – including BLOB and XML
• Built-in Replication – simple and easy to configure
• Main memory tables – keeps all data in-resident in RAM; perfect for dimension tables
• Support for a variety of indexes – B-tree, fulltext, clustered, hash, GIS
• Multiple-configurable data/index caches
• Pre-loading of index data into index caches
• Unique query cache – caches result set + query, not just data and therefore provides near
instantaneous response times for repetitive queries like those used in data warehousing
• Parallel data load – loads multiple files at the same time
• Multi-insert DML – allows array-style processing via normal INSERT commands
• Data compression – provides enormous storage savings
• Read-only tables – protects sensitive data
• Encryption – further protection for sensitive data
• Cost-based optimizer – eliminates need for rule-based query writing
• Wide platform support – no need for special hardware or operating systems

Support for other data warehousing features such as materialized views and other like objects will be provided in upcoming versions of the server.

4.1.3 A Visionary Architecture for Data Warehousing

A number of IT professionals choose MySQL as their data warehousing database
because it offers a new and flexible paradigm of database management. One key technical differentiator between MySQL and other database platforms – whether they are proprietary or open source – is the pluggable storage engine architecture of MySQL.
The MySQL pluggable storage engine architecture allows a database professional to select a specialized storage engine for a particular application need while being completely shielded from the need to manage any specific application coding requirements. The pluggable storage engine architecture provides a standard set of management and support services that are common among all underlying storage engines.

The storage engines themselves are the components of the database server that actually perform actions on the underlying data that is maintained at the physical server level.
















This efficient and modular architecture provides large performance and manageability benefits for those wishing to specifically target a particular application need or data warehousing use case. The technical advantage experienced by the application provider is clearly evident in this respect as unnecessary overhead is avoided when only certain engines are chosen for application use.

For example, the MySQL server has built-in storage engines that are designed for:
• ACID transactional applications
• Non–transactional applications (that can therefore insert and read data faster)
• Main memory operations for very fast read times
• Clustered / High-availability database environments
• Compressing historical data down to a very small footprint
• Referencing Non-DBMS flat files as part of the MySQL database
• And several other uses

In addition, a single database or application can use different storage engines at the same time for maximum effect, with a single command being all that’s needed to change from one engine to another.

Finally, an innovative business can develop its own customized storage engine designed to exactly meet the needs of their particular application.
The storage engine power of MySQL results in the benefit of having many different databases in the same box to choose from, with true ‘write once’ ability in terms of building application code. No other database management system offers a visionary architecture that affords such power and flexibility in designing a data warehousing application like MySQL.

4.1.3.1 Do Storage Engines Make a Difference?

Some may wonder if the different storage engines really make a difference in terms of performance,storage use, etc. As just a simple example, the Archive storage engine of MySQL is designed to efficiently handle large volumes of inserts and compress data down to a small footprint. Below is an example of the difference in INSERT performance that can be had by simply using the Archive storage engine over two other popular MySQL storage engines:


The Archive storage engine has 50% more INSERT throughput over MyISAM (MySQL’s default storage engine) and 255% more than InnoDB (A third party transactional storage engine owned by Oracle).
In addition, an identically structured 11 million row table takes up 1GB of space if InnoDB is used, 795MB if MyISAM is used, but only 148MB if the Archive storage engine is used. As such a simple example shows, storage engines in MySQL can make a difference in data warehousing applications.

Next let’s quickly profile internal MySQL storage engines (those developed internally by MySQL AB) as well as third-party storage engines that can be used for data warehousing.


4.1.4 Internal MySQL Storage Engines for Data Warehousing
Although any MySQL engine can be used for data warehousing, the ones that specifically lend themselves to the data warehousing arena include:
• MyISAM
• Archive
• Memory
• CSV
• Merge
• Federated

4.1.4.1 MyISAM

The MyISAM storage engine is the default engine for MySQL. It offers high-speed query/insert capability, is non-transactional, has table level locking (although it does have a concurrent insert feature that allows inserts to be performed without blocking queries), and has good support for indexes (B-tree, full-text, etc.)
MyISAM is a good general engine for data marts and traditional data warehouses.

4.1.4.2 Archive
The Archive storage engine has already been covered in brief detail. It compresses data by up to 80% and therefore offers good storage savings. In addition, it supplies fast table scans for large tables (>1GB), and offers MVCC and row-level locking. Archive is also unique in that it only allows data to be inserted and read, but never selectively modified (i.e. no UPDATE or DELETE), which makes it good for auditing data or other sensitive information that should not be manipulated in any way.

4.1.4.3 Memory
The Memory engine is what one would expect – an engine that keeps all data in memory at all times. Main memory tables are especially useful in data warehouses for dimension tables that are the object of joins and single table scans, with MySQL memory tables offering very fast response times for both full table scans and index lookups (B-tree and hash indexes are supported).

4.1.4.4 CSV
CSV tables allow flat file data that exists in comma delimited form to be accessed via SQL from within the MySQL server. Data in CSV format can be instantly loaded into MySQL by simply creating a table object that mirrors the format of the CSV flat file, renaming the CSV flat file on the operating system to the name of the MySQL CSV table object, and all the data is mmediately made available to the MySQL server. It doesn’t matter if the file as one billion or 100 billion records, the data is instantaneously available for use. In addition, data in CSV tables can be manipulated via DML inside MySQL or edited outside of the server with file editors (and the proper file privileges).

4.1.4.5 Merge
The Merge engine is how basic data partitioning was accomplished in versions of MySQL below
5.1. A DBA can combine 1-n identical MyISAM tables together to form one table object that can inserted, updated, deleted, and queried. Each underlying MyISAM table can be placed on a separate physical disk drive than the others and can have its own indexes. In MySQL 5.1 and higher, the Merge engine is still available, however DBA’s can also utilize normal data partitioning (horizontal partitioning of rows with range, hash,key, list, and composite being supported).

4.1.4.6 Federated

Finally, the Federated engine allows data warehouse designers to create one logical database out of many different physical database servers. Working in much the same was as Oracle database links or SQL Server Linked Servers do, the federated engine provides the ability to create distributed links to other physical database servers and reference their objects as if they existed on the initial source server.
Other internal storage engines are developed and maintained b
y MySQL (Falcon, Blackhole, etc.), but the above represent the engines that lend themselves best to a data warehousing installation.

4.1.5 External MySQL Storage Engines for Data Warehousing

There are three storage engines built and maintained by third-party storage providers that can be used to meet a few of the most common data warehousing use cases. These are InnoDB, NitroEDB, and BrightHouse.

4.1.5.1 InnoDB

First is InnoDB, which is a transactional storage engine maintained by Oracle. InnoDB obviously supports OLTP applications, but it can also serve analytic purposes too and therefore may be applicable to the OLTP/analytic hybrid use case identified by Gartner.

4.1.5.2 NitroEDB

Nitrosecurity is a provider of security information management solutions, which are undergirded by a very powerful database management system. The Nitrosecurity database has been designed with a number of very unique characteristics so it can support the enormous amount of data volumes and concurrent queries that come as part of a security management software package.

Nitrosecurity (www.nitrosecurity.com) and MySQL have partnered to produce the NitroEDB storage engine, which has been created to handle the needs of real-time data warehousing customers. The NitroEDB engine has the ability to receive huge volumes of inserts while supporting concurrent queries against the same data at the same time, all with no performance penalty whatsoever. In addition, NitroEDB contains specialized indexes (N-tree, Microcluster) that are designed to deliver extremely fast response times for aggregate styled queries (SUM, AVG, etc.) due to the fact that the aggregate information is actually stored within the index itself.

4.1.5.3 Brighthouse
The last third party storage engine applicable for MySQL data warehousing is Brighthouse that is produced by a partnership between MySQL and Infobright (www.infobright.com). At it core, BrightHouse is a highly compressed column-oriented datastore that incorporates MySQL technology, although the engine uses its own load and unload utilities rather than MySQL
’s mainly because
(1) compression and decompression are done on load and unload and
(2) the BrightHouse Knowledge Grid is created on load.
In addition, BrightHouse also uses its own optimizer instead of the MySQL optimizer because the BrightHouse optimizer knows how to use the information that is stored in its Knowledge Grid to optimize the execution of queries against the BrightHouse engine.

The Brighthouse engine consists of 4 key layers:

1. BrightHouse is a column-oriented data store. This means that instead of the data being stored
row by row, it is stored column by column. There are many advantages to column-orientation,
not the least of which is the ability to do more efficient data compression because since each
column stored a single data type (as opposed to rows that typically contain several data types),
compression can be optimized for each particular data type. The data itself within the columns is
stored by 65K item groupings. We refer to each of these groupings as Data Packs
. The use of Data Packs improves data compression and is also critical to how Infobright resolves complex queries.

2. Data Pack Nodes (DPNs) contain a set of statistics stored related to the data that is stored and compressed in each of the Data Packs. There is always a 1 to 1 relationship between Data Packs and DPNs.
3. Knowledge Nodes are a further set of metadata related to Data Packs, columns or table combinations.
The set of these Knowledge Nodes taken together is called the Knowledge Grid.

4. The BrightHouse Optimizer uses the Knowledge Grid to determine the minimum set of Data Packs which need to be decompressed in order to satisfy a given query. In some cases, the information contained in the Knowledge Grid is sufficient to resolve the query, in which case nothing is decompressed.
The Brighthouse storage engine’s design is perfect for the massive amounts of data and queries that are the standard fare of traditional or historical data warehousing.

4.1.6 MySQL Storage Engine Comparison Grid for Data Warehousing
A MySQL storage engine reference guide for data warehousing is provided below for a quick at-a-glance overview of the various internal and external engines offered by MySQL, which includes a brief descript
ion of their characteristics and the data warehouse use case they are best suited for.

No comments: