Jem’Hadar Warrior: “I am a Jem’Hadar. He is a Vorta.
It is the order of things.”Captain Sisko: “Do you really want to give up your life for the ‘order of things’?”
Jem’Hadar Warrior: “It is not my life to give up, Captain – and it never was.”
Rocks and Shoals, Star Trek Deep Space 9 (written by Ronald D. Moore)
Authors: Denny Lee, Akshai Mirchandani, and Dianne Wood
Earlier in the year (2013), Dianne Wood and I presented Yahoo! Big Data and Microsoft Bi – Bigger and Better Together and the PASS Business Analytics Conference. One of the common comments (outside of my overuse of Star Trek references) was the section, “It is the order of things”.
The Importance of Sort Order
How data is sorted significantly impacts the Analysis Services Multidimensional (SSAS) engine’s ability to compress the data. For example for fact.data
, there is a 43% difference in size when the data was sorted vs. not sorted. When you are working with very large cubes, this difference translates to terabytes of wasted space. Because the SSAS engine can be disk I/O intensive, more disk space impacts query and processing performance as well.

Why is this Happening?
To understand why this is happening, we need to understand how data is organized within Analysis Services. Data within an Analysis Services partition is organized in terms of segments or groups of 64K records. During the processing of a segment, the process of creating dictionary/mapping files and compressing the data is performed. If the data is not sorted (like the figure below), it becomes more difficult to compress the data.

But if the data is already sorted, it becomes easier to compress it. Similar attributes/rows within the segment can be represented with fewer values.

Therefore, the best compression is obtained by ensuring that the range of values for each attribute. For example, the dimension members for a column of a fact table are as small as possible for each segment. If an attribute has a large range of DataIDs, you need more bits to encode that attribute per segment.
DataIDs are the internal identifiers that Analysis Services uses to uniquely identify every single attribute
As noted in the following table, file sizes are smaller when encoding with smaller ranges.
DataID range | Encoding (size) |
{5, …, 5,000,000} | ~24 bits |
{5, …, 10,000}, {10,000, …, 100,000} | ~14 bits |
If you filter on an attribute that you order by, you will get much better segment elimination. For example, if you order by date, then scans by date would look at the index by date. Thus avoiding I/Os for segments that do not contain that date.
For more information on Analysis Services segments, please refer to
- Analysis Services: Performance Implications of the Architecture
- Microsoft SQL Server 2008 Analysis Services Unleashed.
What can I do about this?
Therefore, based on the internal data structures (i.e., segments), the general rules are:
- You get the best compression if the range of values for each attribute is as small as possible in each segment.
- Ordering by one attribute can potentially hurt the compression of another attribute. So you have to choose and experiment to determine what makes the most sense for your data
In practical terms, you order the fact data coming into Analysis Services Multidimensional by “most repetitive.” For example, if I have a table of Date, Product, Sales, and Time, then the sort order would be:
- Order by Date
- Then by Time (if there are a lot more products sold than time, which is typically the case)
- Then by Product (I.e., There are multiple sales per product)
- And then Sales
But like all things performance-related, the usual answer is “test, test, test!”
Enjoy!
Leave a Reply