Optimizing your distinct count calculation in Analysis Services
A distinct count (e.g. unique user or visitor counts) provides valuable information but it is not an easy problem to resolve. An example of this problem is your grocery "market basket" scenario where a single visitor purchased various dairy, meat, and vegetable items. While there may be eight (for example) items purchased, these items were all purchased by one person. Understanding that one person purchased many items can be very revealing in the context of business intelligence. Using a web analytics example, if you have a high activity / visitor web site that has a lot of traffic in comparison to visitors (e.g. a financial market site where a set of number of users are viewing many web pages to do market research), you have a very "sticky" web site with relatively loyal users. On the other hand, if you have a low activity / visitor web site that has a lot of users but not so much traffic in comparison (e.g. an auto site where many visitors view many different auto web site to find the best deal in cars), you have a less “sticky” web site with relatively un-loyal visitors. Both business scenarios reveal different insight about the types of visitors to your web site business allowing you to make some actionable business items.
Often distinct counts are solved via custom applications and pipelines; more prevalent advanced solutions utilize data mining techniques. Helping out the cause for many business intelligence analysts, the distinct count measure was introduced within Analysis Services 2000 (AS2k). It was based originally on the dimension structure within AS2k and provided fast access to distinct count calculations for small-to-medium size volumes of data. Much expanded in Analysis Services 2005 (AS2k5), the distinct count measure was redesigned based on the AS2k5’s hierarchy and attribute structure. This allowed the distinct count measure to be calculated on much larger volumes of data.
Saying this, typically users of Analysis Services will want to apply the distinct count measure on enterprise size amount of data (loosely defined as 20 million rows/day or greater). While Analysis Services 2005 provides much better functionality out of the box including:
· No longer having only 3Gb of memory to work with (AS2k could use a maximum of 3Gb of memory irrelevant of how much memory the server has)
· The 64-bit version of AS2k was never working that usable or workable (AS2k5 rocks!)
· The attribute/hierarchy model of AS2k5 handles the Cartesian problem (any measure needs to be calculated for every possible dimension member in combination with every other dimension member) in a very different way than a dimensional model.
there are still some limitations of the distinct count measure. This is due to the fact that:
· Distinct count measures require quite a bit of memory to store information on the property that is being counted (e.g. a user). This is eased by using 64-bit AS2k5 with its hexadecimal amounts of memory space but one only has so much memory before your server “runs out”.
· The attribute/hierarchy design of the AS2k5 allows distinct counts to be calculated by referring to the hierarchy or path of your dimension. That is, the hierarchy of a dimension (e.g. Browser > Internet Explorer > Internet Explorer 7.0 > Internet Explorer 7.0 SP1) allows a distinct count to be calculated faster by incrementing the count up the hierarchy.
· For the vast majority of the time, distinct counts are calculated at run-time vs. processing time.
The advantage of Analysis Services (and any OLAP system in general) is that it can pre-aggregate quite a bit of data. That way, when a query is sent to the OLAP cube the information is already calculated for you. Analysis Services proprietary data design also allows the engine to quickly calculate metrics that are not even pre-calculated. The problem is that with distinct counts it is not simple to pre-aggregate the various combinations of distinct counts by the many dimensions. And even if you are able to pre-aggregate various distinct counts, it is impossible to do this for all dimension members. After all, if you were to perform all of the possible cross-correlations for dimension members with each other – the result set would be far greater than the original data set. Because of these issues, distinct counts are typically calculated at run-time – i.e. at the moment the user asks the question.
Understanding these issues allows us to note ways of optimizing the OLAP distinct count. These techniques are applicable to both AS2k and AS2k5; though the terminology used here is for AS2k5. The basic run down of these distinct count optimization techniques are to:
· Create a separate measure group for each distinct count measure
· Create custom aggregations
· Create a partitioning strategy for your OLAP cube allowing yourself to “distribute” the data.
Create a separate measure group for each distinct count measure
As noted above, a separate measure group should be created for the distinct count measure containing only the distinct count measure. Create a cube to join the “distinct” measure group with all of the other measure groups. This does mean that will have multiple copies of the exact same measure group except that they have different measures within in them (especially the distinct count measures). The reason to do this is that:
· Distinct count measure is designed architecturally different from other measures like sum, count, min, or max.
· Noting this architectural difference, adding other measures with the distinct count will exponentially increase the size of the cube thus resulting in slower processing and slower query times.
Create your own custom aggregations
Creating custom aggregations can force the Analysis Services engine to pre-calculate some aggregations while not calculating others. While this can be done for any and all measures, this is rather effective for distinct count measures. This is because distinct counts are typically the most complicated at the summary level (the highest level of the hierarchy) hence creating custom aggregations at that level will allow the Analysis Services engine to calculate these at processing time (hence much faster queries). For example, within the context of web analytics users will typically query for day, week, and monthly reports. Hence if one created an OLAP cube for web analytics, one can build custom aggregations for those time periods.
Using the view of the sample Budget cube from AS2k, notice from above that you can set the Time dimension to aggregate at the “Bottom Level Only” instead of “Standard”. In this particular case, this would tell the Analysis Services processing engine to design aggregations for the “month” level only instead of the standard approach.
Create a partitioning strategy for your OLAP cube
As noted above, by creating a partitioning strategy for your OLAP cube, you will be able to “distribute” the data across multiple files thus allowing more threads to calculate the distinct counts faster. Typically, for maintenance, processing, and querying purposes – if you have a medium volume or larger amount of data within your fact tables – you will typically partition the data by regular intervals. For example, some enterprise-size OLAP cubes will be partitioned by day because the vast majority of the queries are done at the day level.
For AS2k, we had determined an optimization technique that is quite counter-intuitive. On top of creating partitions by a time period, you create additional partitions by some other dimension thus partitioning the data even further (within AS2k5, this implies that you need to do query binding vs. table binding for your partitions). What is counter-intuitive is that you would create partitions where the distinct IDs (e.g. unique visitors) are spread out through all of the partitions. That is, partition by dimension(s) that have the distinct IDs repeated in most/all partitions. The above is counter-intuitive in that most would try to partition by a dimension that minimized the distinct ID replication throughout the partitions which may be the case for AS2k5. Either way, at query time, this design forces the Analysis services engine to utilize multiple threads to calculate distinct counts instead of just one thread. That is, now you have distributed query calculations in parallel. Empirical analysis noted a four to ten time improvement in distinct count queries by using this design.
Partitioning Strategy Example
Going back to the grocery market basket analysis mentioned above, the question to be answered is “how many customers do I have that have purchased multiple products within a grocery store?” Using this example, the partitioning choices can be:
· Partition by Users (minimize ID replication throughout partitions)
· Partition by Products (replicate ID users throughout as many partitions as possible)
Below is a graphical representation of three people (beige, orange, and blue) purchasing seven different products (a-g). If you use no partitioning strategy, a single thread will go through these 15 “facts”to calculate there are:
· 3 distinct users
· 1 user that purchased product g and another f
· 2 users that purchased b
Note, there is one thread for one partition to analyze the data.
Another option is to partition by users to minimize replication of users within these partitions:
Notice, there are now three threads to help perform this calculation. As you can see, this particular partitioning strategy will be much faster than having no partitioning. As well, within the context of AS2k5, this actually may be the optimal solution due to the redesign of Analysis Services to an attribute/hierarchy model.
The optimal partition strategy for AS2k (and potentially for AS2k5) is to partition by a dimension in which the users are spread evenly across all partitions.
Referring to the above graphic, now we have seven threads to quickly determine the distinct counts.
As you can see from the above example, by partitioning so that the distinct users are spread across multiple partitions, there will be more thread and less data for those threads to calculate the distinct counts. You will need to be careful with your partitioning strategy – you can actually run the risk where you peg the processors if there are too many threads running (hence the importance of multi-proc OLAP machines). In the above example, the CPU utilization for a single enterprise-size query by a single user was that of:
· One partition: 25%
· Three partitions: 50%
· Seven partitions: 75-80%
While these numbers are approximations and the CPU was pegged for milliseconds, as your system becomes larger and is utilized by more people, it will be that much more important to take into account of this.