Over the past 5 years, the world of data and Business Intelligence has experienced major and disruptive technological change. Big data, machine learning, data science, deep learning – these terms and technologies are not all new, but they are now at the forefront of discussion for data practitioners.
We’ve seen the emergence of various Business Intelligence tools claiming to have either automated or replaced the need for data modeling. There has also been an explosion, both in terms of general interest and implementation, of new data storage technologies, such as data lakes. In today’s climate, data modeling can be portrayed negatively, as a relic of the past.
For Business Intelligence projects, data modeling usually means dimensional modeling – the approach originated by data warehouse pioneer Ralph Kimball; this will be the heart of our discussion. Rather than rehash the merits of this approach over any other in the data warehousing context, let’s examine its relevancy after recent upheavals in technology. To understand current best practices, we will start with a review of dimensional modeling’s evolution.
Dimensional modeling can apply to any data practitioner – from a financial analyst who needs to create an executive dashboard in Power Pivot, to a data architect in a multinational company, or even anyone who spends more than 20% of their time manipulating data. If your job involves creating reports, dashboards, or basic forecasts, then dimensional modeling is relevant to you.
Laying an Intuitive & Effective Groundwork
Let’s go back in time and review why dimensional modeling was created in the first place, to discover if these arguments still hold true today. The initial goals were:
- Performance: By denormalizing and simplifying the schema (fewer joins), we were able to obtain better performance, and we were able to better predict the performance of our data warehouse. It was also easier to create aggregate tables on a star or snowflake schema than on a normalized schema.
- Integration: The enterprise bus matrix was designed to integrate various business processes, agnostic to the application that implements them. For instance, a conformed customer dimension allowed finance, marketing, and sales teams to have one common customer reference regardless of the source application.
- Extensibility: Dimensional modeling is modular by nature; many components can and should be re-used. While there was no agile project management 20 years ago, this modular nature, in theory, helped build the data warehouse incrementally and avoid a big bang approach.
- Ease of understanding: The simple structure of the database allowed a non-technical end user, (e.g. an accountant or marketing analyst) to easily query the model without wondering if a relationship was 1-n, n-n, or if there was a loop in the model.
Stable Groundwork Still Key
Fast forward 20 years and the original tenets underlying a dimensional model’s use still hold true today.
Hardware and software have improved dramatically: a multi-month project, requiring a top-of-the-line server 20 years ago, can now be prototyped, with better performance, on a decent laptop with Power BI in less than a week. The emergence of cloud computing also allows greater access to massively parallel databases for a fraction of the price. And today’s big data technologies allow BI practitioners to manipulate a quantity of data that was unfathomable when dimensional modeling was created. While performance problems have improved, end users now expect less data latency. Most users won’t accept having their finance data refreshed just once a month; this was the norm 15 years ago. Daily refresh, or even multiple daily refresh, is now typical, but is only achievable if the whole reporting infrastructure is optimized, proving dimensional modeling still plays a role in performance.
The integration capability of the bus matrix was probably one of the most important features of dimensional modeling. However, the theory often failed to deliver in practice. Rotating stakeholders with changing needs often muddied an enterprise’s efforts to make use of its data. Additionally, the wait to add a new business process to the data warehouse often seemed too long – businesses often opted for quicker departmental solutions.
IT departments talked about data integration, while business units wanted to break departmental silos; though everyone agreed that solving these organizational issues was one of the most important data warehouse roles, success here was a challenge for most BI teams. Effective data modeling requires bringing the IT, marketing, and sales departments to the same table – to decide on common definitions and the skills required to create the databases, tables, and ETL processes – a capability rarely found together in BI teams of the past. Team members were typically either too technical or too functional.
The creation of hybrid teams, combining data engineers, business experts, and, with an increasing frequency, data scientists, is part of today’s answer to the integration challenge. These hybrid teams will take various names in an organization: BI Center of Excellence, BI Competency Center. These cross departmental teams are often derived from the more generic Community of Practice.
Master Data Management (MDM) projects, which use a systematic data integration approach, tackle the other challenges; MDM emphasizes governance and business processes. Integration is still a valid argument in favor of dimensional modeling, but perhaps more so it is an argument for MDM.
The main concepts comprising dimensional modeling are facts and dimension. Dimensions are (or should be) designed independently from their source system. The surrogate key, which identifies each member of a dimension, is independent from the source system. As a result, it can and should be re-used for different business areas. By designing each dimension (and as a result, the facts also) independently, the data warehouse is modular by design. This is still true. The advent of agile project management is only the confirmation that developing a data warehouse by manageable chunks is key to the success of a solution. The extensibility of dimensional modeling is still a relevant, key feature in favor of its use.
Ease of Understanding
The recent dramatic evolution of technology has given rise to an exponentially complex data landscape. It includes semi-structured data, text analytics, sensor analytics, and web-related statistics. But if all of this information can’t benefit its end users, it’s useless to enterprise. And dimensional modeling still offers the best route to making sense of mounds of data.
Data is typically consumed through reports or dashboards built from tables or graphs. A well-built dimensional model can field nearly any related end-user query, in the form of an easily digestible flat table or graph.
Even seasoned report designers benefit from well-designed data models. Calculations are easier to develop, report creation is faster, and reports are more consistent from one developer to another. Dimensional models aren’t just key to dashboards, reports, and simple data analysis – they also benefit data scientists.
Most data scientists spend around 80% of their time wrangling, cleaning, and organizing data to obtain a tidy dataset (Wickham, 2014): one observation per row and one variable per column. This type of data structure is extremely easy to obtain from dimensional modeling. A simple join between the relevant dimensions, aggregate the indicators, and you have a tidy tabular dataset.
Cleaned, organized data ensures that data scientists – who are rare and expensive – can focus on actual data science, rather than on engineering tasks that your BI team has already completed.
The real strength of dimensional modeling is its ability to be easily understood and used for a wide range of business problems, regardless of an end user’s technical knowledge. A carefully designed model saves your analysts, report designers, and data scientists countless hours. The time saved from cleaning and organizing data allows them to focus on gaining valuable insight.
Dimensional modeling is not dead; far from it. As the data landscape evolves toward more complexity, dimensional modeling continues to allow more people to access and use the information buried in the mountains of data generated every day.In other words, the question is not whether you should build a dimensional model, but who will create it and when. If you want to learn more about how BlueGranite can help with your data modeling needs, contact us today and we will be happy to answer your questions.