Denny Lee

Analysis Services Multidimensional: It is the Order of Things

Jem’Hadar Warrior: “I am a Jem’Hadar. He is a Vorta.
It is the order of things.”

Captain Sisko: “Do you really want to give up your life for the ‘order of things’?”

Jem’Hadar Warrior: “It is not my life to give up, Captain – and it never was.”

Rocks and Shoals, Star Trek Deep Space 9 (written by Ronald D. Moore)

Authors: Denny Lee, Akshai Mirchandani, and Dianne Wood

Earlier in the year (2013), Dianne Wood and I presented Yahoo! Big Data and Microsoft Bi – Bigger and Better Together and the PASS Business Analytics Conference.  One of the common comments (outside of my overuse of Star Trek references) was the section, “It is the order of things”.

The Importance of Sort Order

How data is sorted significantly impacts the Analysis Services Multidimensional (SSAS) engine’s ability to compress the data.  For example for fact.data, there is a 43% difference in size when the data was sorted vs. not sorted.  When you are working with very large cubes, this difference translates to terabytes of wasted space.  Because the SSAS engine can be disk I/O intensive, more disk space impacts query and processing performance as well.

Comparing sorted vs. not sorted on the Analysis Services cube file sizes
Comparing sorted vs. not sorted on the Analysis Services cube file sizes

Why is this Happening?

To understand why this is happening, we need to understand how data is organized within Analysis Services.  Data within an Analysis Services partition is organized in terms of segments or groups of 64K records.  During the processing of a segment, the process of creating dictionary/mapping files and compressing the data is performed.  If the data is not sorted (like the figure below), it becomes more difficult to compress the data.

What happens with Analysis Services segments when they are not sorted
What happens with segments when they are not sorted

But if the data is already sorted, it becomes easier to compress it. Similar attributes/rows within the segment can be represented with fewer values.

What happens with segments when they are sorted
What happens with segments when they are sorted

Therefore, the best compression is obtained by ensuring that the range of values for each attribute. For example, the dimension members for a column of a fact table are as small as possible for each segment.  If an attribute has a large range of DataIDs, you need more bits to encode that attribute per segment. 

DataIDs are the internal identifiers that Analysis Services uses to uniquely identify every single attribute

As noted in the following table, file sizes are smaller when encoding with smaller ranges.

DataID rangeEncoding (size)
{5, …, 5,000,000} ~24 bits
{5, …, 10,000}, {10,000, …, 100,000}~14 bits
Encoding size in relation to Analysis Services segment DataID range

If you filter on an attribute that you order by, you will get much better segment elimination. For example, if you order by date, then scans by date would look at the index by date. Thus avoiding I/Os for segments that do not contain that date.

For more information on Analysis Services segments, please refer to

What can I do about this?

Therefore, based on the internal data structures (i.e., segments), the general rules are:

  1. You get the best compression if the range of values for each attribute is as small as possible in each segment.
  2. Ordering by one attribute can potentially hurt the compression of another attribute. So you have to choose and experiment to determine what makes the most sense for your data

In practical terms, you order the fact data coming into Analysis Services Multidimensional by “most repetitive.”  For example, if I have a table of Date, Product, Sales, and Time, then the sort order would be:

  • Order by Date
  • Then by Time (if there are a lot more products sold than time, which is typically the case)
  • Then by Product (I.e., There are multiple sales per product)
  • And then Sales

But like all things performance-related, the usual answer is “test, test, test!”

Enjoy!

6 responses to “Analysis Services Multidimensional: It is the Order of Things”

  1. This. Is. Awesome.

    Even though we partition by day/month, just adding a simple ‘order by’ we were able to reduce our size by nearly 25%!

    1. Wow – this is pretty awesome! Glad to hear that it helped! 🙂

  2. Michael Biasetti Avatar
    Michael Biasetti

    Hi Denny, I know I’m 3 years late to the party here, but I have a question regarding this process. I asked it on Stack Overflow with little help. (http://stackoverflow.com/questions/33508553/how-do-i-sort-partition-data-when-using-query-binding-on-an-ssas-cube). How do you actually perform the sorting? Is it a setting in SSAS, or are you implying the sort has to happen on your source data? Because of how SSAS wraps your queries, an ORDER BY is not possible – so is it purely dependent on your clustered index? Thank you in advance if you take the time to come back to this.

    1. No worries – the implication is that you have to sort the data within the source. Saying this, you can tell SSAS to include an ORDER BY clause by using the “by query” option vs. “By table” option. By the same token, if you clustered index is already sorting the data in the correct order per se, that would also do the trick. HTH!

      1. Michael Biasetti Avatar
        Michael Biasetti

        Thank you for getting back to me. This all makes sense, except for one thing. Is there a trick to getting SSAS to issue the ORDER BY clause? I use query binding for many of my partitions, and if you put an order by in the query, it will fail during processing. This is because SSAS makes your query a sub-query and the order by becomes a syntax error. An example is in the SO post I linked above.

        Unless you have a trick to getting the ORDER BY in there – I’m assuming the only reliable method is using your relational indexes to force the sort. Thanks again for the input.

  3. Oh sorry, its been awhile and I just realized that you are right and I’m wrong for noting the ORDER BY statement could be included. The two ways you can do it are:
    1) Just as you noted, include the clustered index so that way the query is hitting it so the ORDER BY is already included
    2) For SQL Server as your source, created an indexed view so that way it can force the query to be an ORDER BY based on the query that you send from SSAS.

Leave a Reply

Discover more from Denny Lee

Subscribe now to keep reading and get access to the full archive.

Continue reading