Jem’Hadar Warrior: “I am a Jem’Hadar. He is a Vorta.
It is the order of things.”
Captain Sisko: “Do you really want to give up your life for the ‘order of things’?”
Jem’Hadar Warrior: “It is not my life to give up, Captain – and it never was.”
— From Rocks and Shoals, Star Trek DS9, written by Ronald D. Moore
—–
Authors: Denny Lee, Akshai Mirchandani, and Dianne Wood
Earlier in the year (2013), Dianne Wood and I had presented Yahoo! Big Data and Microsoft Bi – Bigger and Better Together and the PASS Business Analytics Conference. One of the common questions / comments from that session (outside of my overuse of Star Trek references) was section called the “It is the order of things”.
The Importance of Sort Order
The important call out here – within the context of Analysis Services – is that the way the data is sorted can ultimately have a large impact on the ability for the Analysis Services Multidimensional engine to compress the data. For example, for the fact partition noted in the table below, there is a 43% difference in size when the data was sorted vs. not-sorted. When you are working with very large Analysis Services cubes, this difference is in the form of hundreds of GBs to terabytes of wasted space. Because the Analysis Services Multidimensional Storage Engine can be disk I/O intensive, using up more disk will also heavily impact query and processing performance as well.
So while the the theme is sort your data before Analysis Services processes the data, the invariable questions are why is this happening and what can I do about it?
Why is this Happening?
To understand why this is happening we need to go one level deeper into how data is organized within Analysis Services. Data within an Analysis Services partition is organized in terms of segments or groups of 64K records. During the processing of a segment, the process of creating dictionary / mapping files and compressing the data is performed. If the data is not sorted (like the figure below), it becomes more difficult to compress the data.
But if the data is already sorted, it becomes easier to compress the data because similar attributes / rows within the segment can be represented with fewer values.
The basic rule here is that the best compression is obtained by ensuring that the range of values for each attribute (e.g. dimension members for a dimension column of a fact table) is as small as possible for each segment. If an attribute has a large range of DataIDs (the internal identifier that Analysis Services uses to uniquely identify every single attribute), you would need more bits to encode that attribute per segment. For example, if each segment contained DataIDs {5, …, 5,000,000 } then you would need ~24bits to encode that range. But if each segment has a smaller range of DataIDs (e.g. {5, …, 10,000}, {10,000, …, 100,000}, …) then you will require less bits to encode this data (e.g. {5, …, 10,000} need ~14bits).
The other advantage of ordering the data is that if your queries also just happen to be filtered on an attribute that you ordered by, then you can get much better segment elimination. E.g. if you order by Date, then queries that scan for particular dates would simply look at the index of Date and avoid I/Os for segments that simply no longer contain that Date
For more information on Analysis Services segments, please refer to Analysis Services: Performance Implications of the Architecture and/or Microsoft SQL Server 2008 Analysis Services Unleashed.
What can I do about this?
Therefore, based on the internal data structures (i.e. segments), the general rules are:
- You get the best compression if the range of values for each attribute is as small as possible in each segment.
- Ordering by one attribute can potentially hurt the compression of another attribute – so you have to choose what makes the most sense for your data
In practical terms, you order the fact data coming into Analysis Services Multidimensional by “most repetitive”. For example, if I have a table of Date, Product, Sales, and Time, then the sort order would be:
- Order by Date
- Then by Time (if there are a lot more products sold than time which is typically the case)
- Then by Product (I.e. There are multiple sales per product)
- And then Sales
But like all things performance related, the usual answer is “test, test, test!”
Enjoy!
This. Is. Awesome.
Even though we partition by day/month, just adding a simple ‘order by’ we were able to reduce our size by nearly 25%!
Wow – this is pretty awesome! Glad to hear that it helped! 🙂
Hi Denny, I know I’m 3 years late to the party here, but I have a question regarding this process. I asked it on Stack Overflow with little help. (http://stackoverflow.com/questions/33508553/how-do-i-sort-partition-data-when-using-query-binding-on-an-ssas-cube). How do you actually perform the sorting? Is it a setting in SSAS, or are you implying the sort has to happen on your source data? Because of how SSAS wraps your queries, an ORDER BY is not possible – so is it purely dependent on your clustered index? Thank you in advance if you take the time to come back to this.
No worries – the implication is that you have to sort the data within the source. Saying this, you can tell SSAS to include an ORDER BY clause by using the “by query” option vs. “By table” option. By the same token, if you clustered index is already sorting the data in the correct order per se, that would also do the trick. HTH!
Thank you for getting back to me. This all makes sense, except for one thing. Is there a trick to getting SSAS to issue the ORDER BY clause? I use query binding for many of my partitions, and if you put an order by in the query, it will fail during processing. This is because SSAS makes your query a sub-query and the order by becomes a syntax error. An example is in the SO post I linked above.
Unless you have a trick to getting the ORDER BY in there – I’m assuming the only reliable method is using your relational indexes to force the sort. Thanks again for the input.
Oh sorry, its been awhile and I just realized that you are right and I’m wrong for noting the ORDER BY statement could be included. The two ways you can do it are:
1) Just as you noted, include the clustered index so that way the query is hitting it so the ORDER BY is already included
2) For SQL Server as your source, created an indexed view so that way it can force the query to be an ORDER BY based on the query that you send from SSAS.