Yahoo! 24TB SSAS Big Data Case Study + Slides

In my post from last year, I had asked the rhetorical question What’s so BIG about “Big Data”.  I had the honor of announcing the largest known Analysis Services cube – at 24TB – in which the source of this cube is 2PB from a huge Hadoop cluster.

For those whom had attended the PASS 2011 session Tier-1 BI in the world of Big Data, Thomas Kejser, myself, and Kenneth Lieu were honored to discuss the details surrounding this uber-cube.   At that time, I had promised the case study would only be months away…

Alas, it took a little while longer to get the case study out – 13 months – but nevertheless, I am proud to re-announce (I did tweet it last week) that the Yahoo! 24TB Analysis Services / Big Data case study has been publishedYahoo! Improves Campaign Effectiveness, Boosts Ad Revenue with Big Data Solution.

For those whom like a graphical view of this case study, embedded below is an excerpt of the Yahoo! TAO Case Study from the above mentioned PASS 2011 session.



  1. Great article.

    i guess my only question concerning the 24TB cube: How does Yahoo process the dimensions in an efficient way? Process add? No flexible aggs?

    1. Thanks Jesse. In general, the idea is that you run ProcessAdd as your primary mechanism and then to your ProcessUpdate on a weekly or monthly basis. You do your best to setup all dimension data to run as slowly changing dimensions type-2 (SCD-2) so that way its possible to add and not need to worry about flexible aggregations. But as this is virtually impossible to do (business-wise anyways), then that’s why running the ProcessUpdate periodically is the best option. HTH!

  2. […] Elérhető a világ legnagyobb ismert Analysis Services kockájának esettanulmánya. Aki olvassa a BI jegyzeteket, annak nem sok újdonság lesz benne, mert szerintem minden fontos technikai részletet leírtam már. (Lásd Big Data témakör) Mindenesetre akit érdekel egy business ready esettanulmány az letöltheti innen: Yahoo! 24TB SSAS Big Data Case Study + Slides […]

  3. This is awesome stuff!! Mind-blowing work Denny. If you don’t mind could you tell me more about the cube:

    What is the size of your largest dimension? How much time does it take to do Process Add/Update on that? How many partitions does the cube have and what is the average size of individual partitions? For query optimization do you disable prefetching and change the cache ratio parameter? Do all the queries hit the aggregations? How many partitions does an average query scan? Do you have any query that queries the entire cube at a time, if so how much time does that query take? Thanks a lot in advance.

    1. The dimension sizes are in the hundreds of thousands to low millions. We try to minimize the ProcessUpdates because they take a very long time. The ProcessAdd doesn’t normally take that long since it is in the thousands of new values. The query optimizations can some times be helpful but as a general statement, we do not need to do those configurations. As for the queries hitting aggregations – not all, that would be quite difficult to do but the aggs are hit most of the time. Queries that hit the entire cube admittedly do take awhile but those are the very rare queries, eh?! HTH!

  4. […] storage to compute becomes cost prohibitive.  We have seen some pretty large extremes like the 24TB Yahoo! TAO cube whose source is 2PB in a 14PB Hadoop cluster (where Dave and I first worked together).  While we […]

  5. […] I am proud of my time with the SQL Server team and we had achieved some amazing lofty goals (e.g. Yahoo! 24TB Analysis Services cube), I had been drawn back to my statistical […]

  6. […] When SSASm was originally designed in the mid-late 1990’s, the per-Gb cost of disk was a small fraction of that for memory.  In addition, CPU and OS architectures (32-bit limitations were still prevalent), along with memory economics, limited total memory to a small fraction of what was, as a practical matter, unlimited disk storage.  SSASm was architected with these realities of the time in mind, so while it leverages memory and compression as much as possible, its design is and had to be fundamentally disk-oriented.  While this has potential query performance implications, it also means SSASm can handle huge amounts of data – the largest documented example being 24 Tb, compressed  from a 2 Pb source at Yahoo. […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s