Tuesday, May 17, 2011

2. A Look at Data Warehouse Workload

2 A Look at Data Warehouse Workloads
In their Magic Quadrant for Data Warehouse Database Management Systems, 2006 study, Gartner Group identified several data warehousing workloads that typified the various use cases found at customer sites who were actively involved in business intelligence. These workloads were:
• Continuous (near-real-time) data loading — similar to an online transaction processing (OLTP)
workload
• Large numbers of standard reports ranging in the thousands per day, requiring Structured Query Language (SQL) tuning and index creation
• An increasing number of true ad hoc query users with a random, unpredictable use of the data
• Analytics and BI-oriented functionality in OLTP applications
Translating these workloads into practical data warehousing installations yields the following use cases:
1. Small, semi real-time data marts
2. Continuous, real-time/query data warehousing
3. Traditional, standard reporting warehousing
4. Massive historical, with ad-hoc queries warehousing
5. BI, analytic in OLTP applications (an emerging trend)
Let’s next examine each of these use cases in detail so the various critical distinctions can be understood.

2.1 Examining Data Warehouse Use Cases
The first standard data warehouse use case is the data mart. Data marts are normally characterized by
two things: (1) their size (2) their focus. Data marts tend to be smaller in data volume than data
warehouses and they tend to be more narrow in their scope, oftentimes only holding data for a particular area of a company or even a subset of an area in a company. A data mart’s update frequency (how often its data is refreshed by daily/hourly transactional activity) depends on the need of the business area using its information to make decisions. If key decisions necessitate that the most up-to-date data as possible be present, real-time feeds into a data mart may be observed. Otherwise, daily or weekly refreshes will be normative.
The real-time data warehouse has seen increased popularity in recent years, mainly due to the increased desire to have the most current information as possible at the fingertips to navigate and outsmart the competition. The real-time data warehouse’s attributes include constant resource contention between incoming data refreshes and queries being made against the same set of data objects, hourly and daily increases in storage, purge rituals of unneeded data, and an audience that can be either narrow or broad in focus depending on the subject areas that reside in the warehouse.

The traditional data warehouse, as its name implies, is the use case most think about when it comes to data warehousing. Usually sporting very large data volumes, having infrequent (defined as not hourly and sometimes not daily) refresh rates, and serving a wide and varied audience, the traditional data warehouse is what most businesses start with (after a data mart) when implementing a data store to use for business intelligence purposes.

The historical data warehouse is somewhat new in nature and has been born out of semi-recent
mandates that require many businesses to keep large amounts of historical information at the ready for government or other business-compliance purposes. The historical data warehouse typically has data volumes that are multiples of traditional data warehouses, see semi-frequent data refreshes, but usually have less query traffic than data marts, real-time warehouses, or traditional data warehouses.


The analytic OLTP warehouse application is what some see as a dangerous return to the days when databases that were used for rapid transactional activity became the target for resource-intensive analytic queries also. The mix of these workloads usually spelled death for applications needing rapid response times to serve customers who demanded quick completions for their requests. The analytic OLTP warehouse is usually a database that back-ends a standard OLTP application, but also contains objects that are fed from the transactional objects and are designed to support business intelligence queries.

There may be other niche data warehouse installations, but the above represent the vast majority of what are found in IT organizations

2.2 Upcoming Challenges for Data Warehousing
Each particular installation above possesses at least one of the workloads identified by Gartner and sometimes more than one. It is when a data warehouse installation contains a mixture of workloads, Gartner says, that issues begin to arise in terms of both management and performance. Gartner states:
“The four workload types are creating issues for vendors, more than the actual size of the DW,
even manifesting in database sizes less than 1TB. In addition to service-level expectations, the
size and duration of "useful" data for each community often differs significantly, forcing every
aspect of the DW environment — from input/output (I/O) channel balancing through disk
management and into memory and processor allocation — to become involved. During the next
three years, mixed workload performance will become the single most important performance issue in data warehousing.”

5.Addressing the mixture of workloads can result in financial cost issues as it oftentimes involves the splitting of data warehouse subject areas between different servers and databases. This is just one reason why IT professionals are now turning to open source solutions to solve their data warehousing needs. But cost is not the only reason for choosing a RDBMS like MySQL for data warehousing. A number of factors come into play that make MySQL an attractive choice in this arena including a strong and successful company history, a solid core feature set that lends itself to data warehousing, a unique architecture and design that solves the problem of mixing workloads in data warehouses, and a growing partner network of business intelligence software and tools.

No comments: