BUSINESS IMPACT

Sep 11, 2019

Modeling stock-type indicators: headcount, occupancy, and more!

Paul Poco Posted by Paul Poco

I recently worked on a project where I had to calculate a maintenance backlog. A maintenance backlog is a count of the number of tasks that are active (or in a specific status) at a specific point in time. These tasks have a duration of one or more days. This type of indicator falls under the category of semi-additive indicators. They are additive on all but the time dimension.  

 

Opening Balance

Mon.

Tues.

Wed.

Thu.

Fri.

Sat.

Sun.

Total Week

Task Backlog

50

52

54

51

46

48

50

51

51

New Tasks

N\A

10

5

7

0

11

2

1

36

Tasks Closed

N\A

8

3

10

5

9

0

0

35

 

In the example above, the Backlog at the end of the week is not 352 (the sum of each day’s Backlog) but the Backlog amount for the last day. The aggregation chosen for this type of indicator on the time dimension will depend on your requirements. It could be the last value (as in our example), an average, or the maximum or minimum for the given period 

The “New Task” and Tasks Closed”, on the other hand, are purely additive; the amount for the week is the sum of each day’s value. 

While less common than fully additive indicators, stock-type indicators are still very common and depending on your specific situation, there are many ways to implement themThe method you choose will hinge on various factors. I will touch on some of the pro and cons of different methods. 

There are several common use cases for stock-type indicators a company’s headcount, hospital or hotel occupancy, active tickets for a helpdesk service, maintenance backlog, and, probably the most common, a balance sheet. Proper implementation will improve performance, functionality, and potentially unlock new analysis for your end users. 

While I was working on this project, my first instinct was to implement a technique with SQL Server Analysis Services (SSAS) Multidimensional Expressions (MDX) I’ve used many times before. However, after a few tests, I wasn’t satisfied with the result and decided to investigate other techniques to tackle our client’s maintenance backlog.  

Build a Trial Model 

For demonstration purposes, let’s work with a fictional healthcare company. This healthcare system receives patients in one of its hospitals at a specific date and releases them a few days later. We will only take inpatients (those who stay more than one day) into account. Our goal is to determine how many beds are occupied, and indirectly, remaining capacity. 

There are three dimensions in this model: 

  • Calendar 
  • Patient 
  • Hospital 

And one fact table: 

  • Occupancy 

The data we’ll use was randomly generated in Exceleach patient is assigned hospital (Site)an Arrival Date, and Departure Date.  

We want to know how many beds are occupied, at any point in time.

Decide on a Data Storage Option

At the very beginning of your project, you’ll have to decide how you’ll store the data in your data warehouse. It is one of the most important decisions you’ll make because idrives everything you later do in your semantic model (Power BI, SSAS Tabular or Azure Analysis Services). The format in which the transactional system captures data will also be a factor. 

There are three data storage options: 

Option 1:  Each of your events will have a start date and an end date. If no other measures are added to your fact, this is a factless fact table. The fact table looks like this: 

Column

Data Type

Arrival Date

Date

Release Date

Date

PatientId

Integer

LocationId

Integer

 

Each time a patient is admitted to the hospital, a new line is added to the table, and the Release Date is set to NULL. When the patient is released, the Release Date is populated (assuming you use a relational database for your data warehouse you would do a merge, update or truncate, and load). This is called a factless fact table because there are no indicators or measurements, only foreign keys in the fact table (both date columns are foreign keys to the date dimension). 

Depending on your situation, you might also need an opening balance. Having an opening balance equal to zero (no opening balance) is ideal, as it will greatly simplify your model; however, it is not always possible 

If the hospital was opened 50 years ago, chances are that last time there weren’t any patients was 50 years ago. It’s also extremely unlikely to have 50 years of daily transactions in source system. An opening balance requires a snapshot-type fact table, as described below.  

It can often be easier to have more history than required and no opening balance, rather than a shorter history and an opening balance. It’s a trade-off between volume and model complexity.  

Option 2 This  is  the most common approach; a Start Date – End Date-type snapshot fact table. It’s easy to implement in your ETL and it’s easy to understand. A snapshot-type fact table is one in which a new line is added every day for each patient who is still hospitalizedIt looks like this: 

Column

Data Type

OccupancyDate

Date

PatientId

Integer

LocationId

Integer

 

The opening-balance table mentioned in Option 1 is a snapshot-type fact table for one date only, the first date considered in your model. 

Option 3This option combines both a snapshot at a specific point in time (yearly, for instance) and an event-based fact table.  

This solution adds complexity and is only useful in specific use cases. For example, this option is the traditional way to model a balance sheet.  

Pros and Cons of Each Option 

As mentioned earlier, the most commonly encountered approach is Option 2, the snapshot fact tableThe main drawback of this approach is that the fact table’s size will grow extremely fast. For example, if you want to calculate the headcount in a company with 10,000 employees on average, and you want 5 years of historical data, you will add 10,000 rows per day to your fact table – that gives you (10,000 * 365 * 5 =) 18,250,000 rows after 5 years.  

If you used the first approach, Option 1, the fact table would be (10,000 * 5 =) 50,000 rows after 5 years, assuming your employees change position or quit the company once a year, on average. 

The snapshot fact table (Option 2) is (18,250,000 / 50,000 =) 365 times bigger. On the bright side, as the data is very repetitive, you might get a very good compression ratio on these tables.  

However, the difference in size between Options 1 and 2  is major; as a result, the snapshot approach should only be used for smaller datasets, or if your budget allows you to have a massively parallel processing (MPPtype database, such as the modern Azure Data Warehouse.  

In some cases, you can’t use Option 1, as the data is captured as a snapshot-type fact. 

Data Analysis Expressions (DAX) and Our Models 

After you decide on how to store your data, you will need to prepare your indicators for reporting layer consumption or for expert analysis. You will typically have three main indicators: 

  • Number of arrivals 
  • Number of departures 
  • Occupancy 

The focus of this post is Occupancy. 

With a snapshot-type fact table: 

 

For the purpose of this post I generated a model based on Excel data. I randomly assigned a start date, end date and hospital to random people to generate the fact table. In order to have a snapshot-type fact, I generated one row in Power Query for each day a patient is hospitalized, as follows: 

let

    Source = Occupancy,

    #"Added Custom" = Table.AddColumn(Source, "Custom", each {Number.From([Arrival Date])..Number.From([Departure Date]) - 1}),

    #"Expanded Custom" = Table.ExpandListColumn(#"Added Custom", "Custom"),

    #"Changed Type" = Table.TransformColumnTypes(#"Expanded Custom",),

    #"Renamed Columns" = Table.RenameColumns(#"Changed Type",)

in

    #"Renamed Columns"

This M code will add a custom column that contains a list of dates between Arrival Date and Departure Date. As you can only generate an interval in M with integers, you should convert the dates to an integer and generate the interval.  

{Number.From([Arrival Date])..Number.From([Departure Date])}

Then expand the list column, convert back to a date, and you have the snapshot-type fact table. There are other ways (a calculated table in DAX) to generate this kind of fact from a Start Date – End Date-type of table, but this is fast, efficient, and quite simple. Generating the snapshot in a calculated table is also a good option. 

From here we can create the Occupancy indicator: 

Occupancy (Snapshot) =
VAR CurDate =
    MAX ( 'Date'[Date] )
RETURN
    CALCULATE (
        COUNTROWS ( 'Occupancy Snapshot' ),
        FILTER (
            ALL ( 'Occupancy Snapshot'[PresenceDate] ),
            'Occupancy Snapshot'[PresenceDate] = CurDate
        )
    )

This formula will count the number of rows in the current filter context, the date filter context being overridden by the LASTDATE(Date). This gives you the number of rows for the last day of the period considered. 

Creation is simple and easy, the performance is adequate, and the table can hold massive volume. refreshing it will take 365 times longer than if you use the Option 1 design – assuming the snapshot is stored in your data warehouse and not generated in Power BI. 

If the data comes as a snapshot from the source system, calculating the number of arrivals and number of departures can be less straightforward than with other options. Another drawback is volume, and I’ve often seen a snapshot at a month-level rather than at a day-level, only to accommodate performance.  

With a date-type (Start Date – End Date) fact table: 

The main differences in this model are the fact table and the two relationships with the date table; the dimensions are identicalThe second relationship is inactive. 

Occupancy (Filter) =
VAR CurDate =
    LASTDATE ( 'Date'[Date] )
RETURN
    CALCULATE (
        COUNTROWS ( Occupancy ),
        FILTER ALL ( 'Date' ), 'Date'[Date] <= CurDate ),
        FILTER (
            ALL ( Occupancy[Departure Date] ),
            Occupancy[Departure Date] >= CurDate
        )
    )

This formula will first store the current date (in the filter context) in a variable and count all the rows in the fact table for which the current date is between the Arrival Date and the Departure Date. 

This calculation can be tuned in many ways, depending on your specific use case. It is worth mentioning that the performance depends on the size of your fact table and also on the duration of the events (Duration = Departure Date – Arrival Date). Also, this specific formula only provides the end of period value. Slightly different approaches are required to get an average, a minimumor a maximum, over the period considered. 

With a Start – End Date and inception-to-date indicators: 

This was my initial approach, as it works well with SSAS Multidimensional and Multidimensional Expressions (MDX) (thanks to the aggregation mechanism) 

The DAX is as follows:

Occupancy (ITD) =
VAR CurDate =
    LASTDATE ( 'Date'[Date] )
VAR AdmissionITD =
    CALCULATE (
        COUNTROWS ( 'Occupancy' ),
        FILTER ( ALL ( 'Date' ), 'Date'[Date] <= CurDate )
    )
VAR ReleaseITD =
    CALCULATE (
        COUNTROWS ( 'Occupancy' ),
        FILTER ( ALL ( 'Date' ), 'Date'[Date] < CurDate ),
        USERELATIONSHIP ( 'Date'[Date], 'Occupancy'[Departure Date] )
    )
RETURN
    AdmissionITD – ReleaseITD

This formula will calculate all the Admissions since the beginning of the time (in the variable AdmissionITD), the number of Release since the beginning of the time (in the variable ReleaseITD, finally, the Occupancy will be the difference of both.  

This last formula has two drawbacks: 

1 – As it scans the Fact from the beginning of time twice for each filter context, it is not always the most performant. That being said, the recent availability of aggregate tables with Power BI might solve this performance issue, but I haven’t yet tried it. Maybe in a future blog post. 

2 – If the data quality is not on point, the errors will amplify with time. For instance, when a system is first rolled out, there is a period during which end users will make mistakes, forget to close a case, forget to open it, close it too late, etcAssuming they are not closed, these mistakes will carry over to the most recent period. 

In the Power BI report below, you can see that the three approaches provide the same result. 

While working on our client’s project, as well as this blog post, I intensively used the two following resources – worth considering if you need to work on these types of metrics. 

Alberto Ferrari’s solution also presents an alternative to simulate a snapshot fact table, using DAX rather than Power Query. 

Mixed Approaches 

You can mix some of these solutionsfor instance, using a snapshot fact table as an opening balance every year and a Start Date – End Date fact during the year. In other words, you reset your Stock  

Indicator” calculation yearly. However, this will add some complexity and should be used sparingly. 

The best solution is the simplest solution that solves your problem with adequate performance.  

Conclusion 

Stock-type indicators are slightly more complex to calculate than purely additive-type indicators. The most important point to consider is that each calculation has its pro and cons, and choosing the “best” approach will depend on these parameters: 

  • How is the data captured?  
  • What is the volume (in your data warehouse and in Power BI)? 
  • How long are the events you want to count? 
  • What is the data quality? 

These factors will also strongly influence performance. Your final solution should take each of them into consideration.  

In my experience, Option 2 is often the best approach to calculating stock-type indicators, but becoming comfortable with each approach makes sense, as real-life needs will vary. If you find performance is not adequate with Option 2 – bringing the data in Power BI with a Start and End date and doing the calculation in DAX –, you can create the snapshot in M or DAX, rather than having the snapshot transiting over the network. 

Looking to maximize your data? Our experts can help. Contact BlueGranite today to learn more. 

New call-to-action
Paul Poco

About The Author

Paul Poco

Paul has over 15 years of experience with Business Intelligence technologies and implementation. He has worked on all the phases of Business Intelligence projects, ranging from pre-project high level POC development, to testing and deployment. Paul's main technical focus is within Microsoft’s BI tools, but also has experience with other tools including Teradata, Business Object, EssBase, OBIEE and Tableau.

Latest Posts

Bring your reports to life