Analysis Services Multidimensional: It is the Order of Things

September 30, 2013

Jem’Hadar Warrior: “I am a Jem’Hadar. He is a Vorta.
It is the order of things.”

Captain Sisko: “Do you really want to give up your life for the ‘order of things’?”

Jem’Hadar Warrior: “It is not my life to give up, Captain – and it never was.”
Rocks and Shoals, Star Trek Deep Space 9 (written by Ronald D. Moore)

Authors: Denny Lee, Akshai Mirchandani, and Dianne Wood

Earlier in the year (2013), Dianne Wood and I presented Yahoo! Big Data and Microsoft Bi – Bigger and Better Together and the PASS Business Analytics Conference. One of the common comments (outside of my overuse of Star Trek references) was the section, “It is the order of things”.

Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together from Denny Lee

The Importance of Sort Order

How data is sorted significantly impacts the Analysis Services Multidimensional (SSAS) engine’s ability to compress the data. For example for fact.data, there is a 43% difference in size when the data was sorted vs. not sorted. When you are working with very large cubes, this difference translates to terabytes of wasted space. Because the SSAS engine can be disk I/O intensive, more disk space impacts query and processing performance as well.

Comparing sorted vs. not sorted on the Analysis Services cube file sizes

Why is this Happening?

To understand why this is happening, we need to understand how data is organized within Analysis Services. Data within an Analysis Services partition is organized in terms of segments or groups of 64K records. During the processing of a segment, the process of creating dictionary/mapping files and compressing the data is performed. If the data is not sorted (like the figure below), it becomes more difficult to compress the data.

What happens with Analysis Services segments when they are not sorted — What happens with segments when they are not sorted

But if the data is already sorted, it becomes easier to compress it. Similar attributes/rows within the segment can be represented with fewer values.

What happens with segments when they are sorted

Therefore, the best compression is obtained by ensuring that the range of values for each attribute. For example, the dimension members for a column of a fact table are as small as possible for each segment. If an attribute has a large range of DataIDs, you need more bits to encode that attribute per segment.

DataIDs are the internal identifiers that Analysis Services uses to uniquely identify every single attribute

As noted in the following table, file sizes are smaller when encoding with smaller ranges.

DataID range	Encoding (size)
{5, …, 5,000,000}	~24 bits
{5, …, 10,000}, {10,000, …, 100,000}	~14 bits

Encoding size in relation to Analysis Services segment DataID range

If you filter on an attribute that you order by, you will get much better segment elimination. For example, if you order by date, then scans by date would look at the index by date. Thus avoiding I/Os for segments that do not contain that date.

For more information on Analysis Services segments, please refer to

What can I do about this?

Therefore, based on the internal data structures (i.e., segments), the general rules are:

You get the best compression if the range of values for each attribute is as small as possible in each segment.
Ordering by one attribute can potentially hurt the compression of another attribute. So you have to choose and experiment to determine what makes the most sense for your data

In practical terms, you order the fact data coming into Analysis Services Multidimensional by “most repetitive.” For example, if I have a table of Date, Product, Sales, and Time, then the sort order would be:

Order by Date
Then by Time (if there are a lot more products sold than time, which is typically the case)
Then by Product (I.e., There are multiple sales per product)
And then Sales

But like all things performance-related, the usual answer is “test, test, test!”

Enjoy!

dennyglee

Order, Performance, Segments

6 responses to “Analysis Services Multidimensional: It is the Order of Things”

Jesse O

October 2, 2013 at 6:44 pm

This. Is. Awesome.

Even though we partition by day/month, just adding a simple ‘order by’ we were able to reduce our size by nearly 25%!

Loading…

Reply
1. dennyglee
  
  October 9, 2013 at 5:07 am
  
  Wow – this is pretty awesome! Glad to hear that it helped! 🙂
  
  Loading…
  
  Reply
Michael Biasetti

July 4, 2016 at 5:45 am

Hi Denny, I know I’m 3 years late to the party here, but I have a question regarding this process. I asked it on Stack Overflow with little help. (http://stackoverflow.com/questions/33508553/how-do-i-sort-partition-data-when-using-query-binding-on-an-ssas-cube). How do you actually perform the sorting? Is it a setting in SSAS, or are you implying the sort has to happen on your source data? Because of how SSAS wraps your queries, an ORDER BY is not possible – so is it purely dependent on your clustered index? Thank you in advance if you take the time to come back to this.

Loading…

Reply
1. dennyglee
  
  July 6, 2016 at 8:04 pm
  
  No worries – the implication is that you have to sort the data within the source. Saying this, you can tell SSAS to include an ORDER BY clause by using the “by query” option vs. “By table” option. By the same token, if you clustered index is already sorting the data in the correct order per se, that would also do the trick. HTH!
  
  Loading…
  
  Reply
  1. Michael Biasetti
    
    July 7, 2016 at 12:48 pm
    
    Thank you for getting back to me. This all makes sense, except for one thing. Is there a trick to getting SSAS to issue the ORDER BY clause? I use query binding for many of my partitions, and if you put an order by in the query, it will fail during processing. This is because SSAS makes your query a sub-query and the order by becomes a syntax error. An example is in the SO post I linked above.
    
    Unless you have a trick to getting the ORDER BY in there – I’m assuming the only reliable method is using your relational indexes to force the sort. Thanks again for the input.
    
    Loading…
dennyglee

July 7, 2016 at 10:40 pm

Oh sorry, its been awhile and I just realized that you are right and I’m wrong for noting the ORDER BY statement could be included. The two ways you can do it are:
1) Just as you noted, include the clustered index so that way the query is hitting it so the ORDER BY is already included
2) For SQL Server as your source, created an indexed view so that way it can force the query to be an ORDER BY based on the query that you send from SSAS.

Loading…

Reply