BUSINESS INSIGHTS

May 13, 2013

Overview of Data Quality Assurance in Data Warehousing

Posted by Shane Risk

As with any IT initiative, proper quality assurance processes can make or break a project.  In data warehousing, there are a number of steps that you can take to make sure your solution is not only successful, but highly trusted and extremely stable.

One of the high points of a data warehousing project is when the data warehouse, the 'new kid on the block', begins to point out data issues in legacy reporting that have been lingering for years.  To me, this is a major milestone that I always celebrate on my projects (and you should do the same).  However, there are a number of data warehouses that struggle to reach this point.  This is likely due to a number of reasons, and one of the major obstacles is poor quality assurance!

Nothing casts doubt on a data warehouse quicker than incorrectly reporting information.  It is critical that data warehousing projects do everything in their power to mitigate this risk.

This article is intended to give an overview of some of the key concepts of quality assurance in data warehousing and business intelligence projects to ensure you don't have to struggle with quality issues on your project.  If you follow the framework that I lay out below, the integrity and stability of your solutions should increase significantly.

Overview of Quality Assurance in Data Warehousing

There are 6 types of testing that must be considered when implementing a data warehouse, as illustrated in the image below.  The four types of testing I will spend most of my time discussing are Unit Testing, System Integration Testing, Data Validation, and User Acceptance testing.  I will revisit each of these in a future post to fully describe each type in turn.

 Quality Assurance in Data Warehousing

Unit Testing

Unit Testing is the process of validating each of the constituent parts of a solution.  Unit Testing is entirely the responsibility of the developer, and MUST be done during development.  There is no effective way to do Unit Testing after the fact, it will only serve to introduce bugs into the data warehouse. 

In a data warehouse/business intelligence solution, the most critical items to unit test are ETL logic, business rules and calculations implemented in the OLAP layer, KPI logic, and individual report validation.  Depending on the solution of course there could be other important items to unit test, but these are the high risk areas. 

Ideally, Unit Testing can be automated since this form of testing is done repeatedly throughout the course of a project.  Consideration will be given to this in future articles.

System Integration Testing

System Integration Testing is intended to confirm the system acts as expected once the constituent parts of the solution are put together.  System Integration Testing is completely dependent on successfully Unit Testing your data warehousing solution first! 

System Integration Testing should accomplish two main objectives.  First you must perform system build testing to ensure that you can successfully build and deploy into your system integration testing environment.  Once deployed and configured, all jobs must be executed and data processed to ensure no issues arise during job execution.  If you have multiple production jobs, you will want to run them all under real world circumstances if possible.

Adopting System Integration Testing into your data warehouse development cycle is a giant step forward, assuming you haven't yet adopted this form of testing.  If you have been using these techniques and are ready to take it a step further, consider creating scripts to stage specific cases in your test data.  These cases should stage tests to check the handling of specific circumstances such as bounds testing, calculation logic, condition processing and any other core data processing logic to really put your system through its paces. 

Data Validation 

Data Validation is the process of testing the data within a data warehouse.  A common way to perform this test is by using an ad hoc query tool (Excel) to retrieve data in a format similar to existing operational reports.  Data that should be validated includes dimension member completeness, base measure accuracy, and business calculations.

If the data ties between the data warehouse and the operational report then the data is valid (unless of course the original report is flawed).  Once a number of reports have been validated, then likely you can rest assured that the data within the warehouse is correct.

It is imperative to note that data validation should be performed by a business representative! It is marginal at best to have the data warehousing team perform data validation on the data within the warehouse.  The data warehouse team can detect some data issues, but the individuals who know the data best are integral to successfully validating the data within the system. 

If you are unable to get commitment from a business representative, then likely you don't have proper backing from the correct stakeholder for your data warehouse project.  This is a huge red flag that will need to be addressed.  Data warehouses without appropriate stakeholder backing and proper business representation are at high risk of failure, and will continue to struggle throughout the course of the project.

User Acceptance Testing

The objective of User Acceptance Testing is to ensure two things.  First, that the data that is being provided to the end user is what is expected.  Second, that the tools provided to the end user meet their expectations.

Ideally if there is a problem with the scope of data being provided, it is best to discover potential issues early on.  It is critical to reconcile end user expectations with the scope of the project at the beginning to reduce the risk of having to rearchitect the data model (and everything built on top of it) once you're ready to promote the data warehouse to production. 

Tool validation is typically much more flexible if there are issues discovered during user acceptance testing.  In fact, I've found it helpful to conduct rapid development sessions where you construct dashboards or reports on the fly with end users.  This is a great way to give them exactly what they want, provide the data warehouse team with immediate feedback, and greatly enhance the chances of successfully implementing the end user tools.

Performance Testing

Performance Testing is a very complex topic that I won't spend much time on here.  Suffice to say that properly validating the performance of your solution under real world conditions is important to satisfying user expectations.  There are a number of factors to consider including data architecture, hardware configuration, system scalability, query complexity, concurrent users, etc.  Look for discussions on these topics to come in future articles.

Regression Testing

A Regression Test is the process of retesting functionality to ensure that future development has not broken anything that was previously known to work.  Each of the different testing categories defined above are subject to Regression Testing.

Since techniques for handling Regression Testing are different for each of the quality assurance categories noted, this topic will be integrated into the future articles about each respective quality assurance category.

Conclusion

The testing framework that I've laid out above has been critical to the success of the projects that I've been involved with.  I've gotten feedback from a number of clients that the data warehouse efforts that I've lead have been extremely stable, requiring minimal effort to keep them running. 

In fact, one of the technical leads that I recently handed over a project to claimed that the solution was boring to maintain because nothing ever goes wrong!  As far as criticism goes, that's about the best feedback you can hope for. 

Its my hope that you will be able to take the concepts that I've presented above and use them on your implementations.  If you do use any of these techniques, please let me know.  I'd love to hear about your experiences.  And, of course I'm always open to feedback, please leave a comment!

New Call-to-action
Shane Risk

About The Author

Shane Risk

Shane is former Solutions Consultant at BlueGranite, certified a certified Scrum Master and has expertise in the areas of data warehousing, business intelligence, Master Data Management, data governance and data quality.