Yahoo! 24TB SSAS Big Data Case Study + Slides

In my post from last year, I had asked the rhetorical question What’s so BIG about “Big Data”.  I had the honor of announcing the largest known Analysis Services cube – at 24TB – in which the source of this cube is 2PB from a huge Hadoop cluster.

For those whom had attended the PASS 2011 session Tier-1 BI in the world of Big Data, Thomas Kejser, myself, and Kenneth Lieu were honored to discuss the details surrounding this uber-cube.   At that time, I had promised the case study would only be months away…

Alas, it took a little while longer to get the case study out – 13 months – but nevertheless, I am proud to re-announce (I did tweet it last week) that the Yahoo! 24TB Analysis Services / Big Data case study has been publishedYahoo! Improves Campaign Effectiveness, Boosts Ad Revenue with Big Data Solution.

For those whom like a graphical view of this case study, embedded below is an excerpt of the Yahoo! TAO Case Study from the above mentioned PASS 2011 session.

Enjoy!

8 thoughts on “Yahoo! 24TB SSAS Big Data Case Study + Slides

  1. Great article.

    i guess my only question concerning the 24TB cube: How does Yahoo process the dimensions in an efficient way? Process add? No flexible aggs?

    • Thanks Jesse. In general, the idea is that you run ProcessAdd as your primary mechanism and then to your ProcessUpdate on a weekly or monthly basis. You do your best to setup all dimension data to run as slowly changing dimensions type-2 (SCD-2) so that way its possible to add and not need to worry about flexible aggregations. But as this is virtually impossible to do (business-wise anyways), then that’s why running the ProcessUpdate periodically is the best option. HTH!

  2. Pingback: A Yahoo! 24 terás OLAP kockájának esettanulmánya - Kővári Attila szakmai blogja - TechNetKlub

  3. This is awesome stuff!! Mind-blowing work Denny. If you don’t mind could you tell me more about the cube:

    What is the size of your largest dimension? How much time does it take to do Process Add/Update on that? How many partitions does the cube have and what is the average size of individual partitions? For query optimization do you disable prefetching and change the cache ratio parameter? Do all the queries hit the aggregations? How many partitions does an average query scan? Do you have any query that queries the entire cube at a time, if so how much time does that query take? Thanks a lot in advance.

    • The dimension sizes are in the hundreds of thousands to low millions. We try to minimize the ProcessUpdates because they take a very long time. The ProcessAdd doesn’t normally take that long since it is in the thousands of new values. The query optimizations can some times be helpful but as a general statement, we do not need to do those configurations. As for the queries hitting aggregations – not all, that would be quite difficult to do but the aggs are hit most of the time. Queries that hit the entire cube admittedly do take awhile but those are the very rare queries, eh?! HTH!

  4. Pingback: Why all this interest in Spark? | Denny Lee

  5. Pingback: Yahoo! 24TB SSAS Cube – Big Data Case Study + Slides | Clint Huijbers' Blog

  6. Pingback: To Spark … and Beyond! | Denny Lee

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s