Data Warehouse Design Best Practices

Author Introduction: Jennifer Marsh is a software developer, programmer and technology writer and occasionally blogs for Rackspace Hosting.

Data warehousing is an expensive project with high costs and a great return on investment — if executed properly. Data warehousing isn’t for the faint of heart. It is a process that connects several databases together in a single data center for higher performance output, low energy use, cost savings and terabytes (and sometimes even petabytes) of data storage space. Having an effective method of data preparation can help to make the handling and interpretation of data in business more streamlined.

When moving to a data warehouse, the architectural design and the storage structure implementation are integral to the performance and scalability of future development projects. Cloud data warehouse hosting offers scalability and reliance that internal warehouses can’t offer, and they only use resources on demand, lowering the company’s costs. Due to the importance of a cloud data warehouse, most companies want to make sure that they have a backup server that houses their data. Some companies may use NAS systems or Network Attached Systems to house their data, making it available and easy to access. However, there is the unfortunate possibility of data becoming corrupted or lost. When this happens companies will turn to NAS or QNAP data recovery specialists who may be able to help recover any data that was lost.

Maintain Data Integrity

When moving to a new cloud data warehouse, get specifications from business units, so the database administrator can identify the tables required for the new warehouse design. After identifying the important data, create a table design that maintains the data’s integrity and avoids redundancy.

Data redundancy means the same data is available in several places in the warehouse. So if one value is not updated, then it can cause processing issues. Relational databases don’t need data redundancy, maintaining more efficient queries.

Data Quality

When developers are tired or the move to the data warehouse passes the deadline, data quality is forgotten. Data quality demands maintaining an efficient data set for performance and to obtain accurate information. When tables aren’t normalized, indexed and primary and foreign keys aren’t set, the data warehouse might perform poorly. Ensure that data and the database structures are properly set up before releasing and deploying the warehouse to the production environment.

Ad-hoc Queries and Nightly Jobs

For reporting, add nightly jobs and ad-hoc queries that run off-hours and store the data in a separate reporting database. Cloud hosting can run these queries in the background while serving up other interfaces such as the company’s website and data transformation projects. This reduces the amount of queries that return large data sets on the production database.

Nightly jobs run a query on the main production server and transform the data to a table or group of tables on a reporting server. In some cases, businesses copy all new data over to a reporting server that stores a copy of the production data. If data is not needed real-time for reports, then a 24-hour delay for reports can be used.

For instance, the billing department might need a list of invoices for the previous day. A reporting server with 24-hour-old data is sufficient to store the data, and the production server is left only for queries required in real-time.

Implement Audit Tables

Audit tables keep track of when data is changed, and it’s a great way to monitor important data for any data changes or when the data is accessed. The audit tables track when a record is viewed or updated, the person who viewed the data, and the application from which the data was accessed.

Place indexes on the date and time field for sorting and performance. These tables can grow by megabytes of storage space. If the tables grow too much, archive old data that isn’t needed into a separate table or database.

Energy Efficiency

Google recently released some of its energy efficiency information for the company’s own data centers. The company creates its own renewable energy and efficiency standards. Instead of keeping the data centers at a low temperature, Google allows temperatures to rise within the center section of the data center and removes the hot air using cooling units. This reduces A/C and energy costs. Some of the basic standard changes pioneered by Google can help new data center administrators create a cost-saving warehouse for the business.

Creating a well-designed data warehouse takes several months of planning and consideration. Although the initial costs seem high, the benefit is monthly cost reductions are well worth the money, especially when the company moves the data to a cloud host.

This entry was posted in Cloud Computing and tagged , , , . Bookmark the permalink.

2 Responses to Data Warehouse Design Best Practices

  1. Jim Mello says:

    Good thoughts here. Do you know of institutions that are working collaborately to form shared but separate data warehouses? Can consortium take advantages when alone schools may not be able to scale?

  2. The Higher Ed CIO says:

    Jim – The best way to answer your question I believe is to mention that systems are probably the primary example of shared data warehouse followed by consortium running a share ERP. I would imagine the complications of attempting to build a ‘shared’/’separate’ EDW would be impractical especially if they are not running a shared ERP. I am also trying to remember if any of the canned EDW’s from the ERP providers even support multi-tenancy.

Comments are closed.