Jem’Hadar Warrior: “I am a Jem’Hadar. He is a Vorta.
It is the order of things.”

Captain Sisko: “Do you really want to give up your life for the ‘order of things’?”

Jem’Hadar Warrior: “It is not my life to give up, Captain – and it never was.”

– From Rocks and Shoals, Star Trek DS9, written by Ronald D. Moore


Authors: Denny Lee, Akshai Mirchandani, and Dianne Wood

Earlier in the year (2013), Dianne Wood and I had presented Yahoo! Big Data and Microsoft Bi – Bigger and Better Together and the PASS Business Analytics Conference.  One of the common questions / comments from that session (outside of my overuse of Star Trek references) was section called the “It is the order of things”.

The Importance of Sort Order

The important call out here – within the context of Analysis Services – is that the way the data is sorted can ultimately have a large impact on the ability for the Analysis Services Multidimensional engine to compress the data.  For example, for the fact partition noted in the table below, there is a 43% difference in size when the data was sorted vs. not-sorted.  When you are working with very large Analysis Services cubes, this difference is in the form of hundreds of GBs to terabytes of wasted space.  Because the Analysis Services Multidimensional Storage Engine can be disk I/O intensive, using up more disk will also heavily impact query and processing performance as well.


So while the the theme is sort your data before Analysis Services processes the data, the invariable questions are why is this happening and what can I do about it?

Why is this Happening?

To understand why this is happening we need to go one level deeper into how data is organized within Analysis Services.  Data within an Analysis Services partition is organized in terms of segments or groups of 64K records.  During the processing of a segment, the process of creating dictionary / mapping files and compressing the data is performed.  If the data is not sorted (like the figure below), it becomes more difficult to compress the data.


But if the data is already sorted, it becomes easier to compress the data because similar attributes / rows within the segment can be represented with fewer values.


The basic rule here is that the best compression is obtained by ensuring that the range of values for each attribute (e.g. dimension members for a dimension column of a fact table) is as small as possible for each segment.  If an attribute has a large range of DataIDs (the internal identifier that Analysis Services uses to uniquely identify every single attribute), you would need more bits to encode that attribute per segment.  For example, if each segment contained DataIDs {5, …, 5,000,000 } then you would need ~24bits to encode that range.  But if each segment has a smaller range of DataIDs (e.g. {5, …, 10,000}, {10,000, …, 100,000}, …) then you will require less bits to encode this data (e.g. {5, …, 10,000} need ~14bits).

The other advantage of ordering the data is that if your queries also just happen to be filtered on an attribute that you ordered by, then you can get much better segment elimination. E.g. if you order by Date, then queries that scan for particular dates would simply look at the index of Date and avoid I/Os for segments that simply no longer contain that Date

For more information on Analysis Services segments, please refer to Analysis Services: Performance Implications of the Architecture and/or Microsoft SQL Server 2008 Analysis Services Unleashed.

What can I do about this?

Therefore, based on the internal data structures (i.e. segments), the general rules are:

  1. You get the best compression if the range of values for each attribute is as small as possible in each segment.
  2. Ordering by one attribute can potentially hurt the compression of another attribute – so you have to choose what makes the most sense for your data

In practical terms, you order the fact data coming into Analysis Services Multidimensional by “most repetitive”.  For example, if I have a table of Date, Product, Sales, and Time, then the sort order would be:

  • Order by Date
  • Then by Time (if there are a lot more products sold than time which is typically the case)
  • Then by Product (I.e. There are multiple sales per product)
  • And then Sales

But like all things performance related, the usual answer is “test, test, test!”