3. 3
Agenda
Yahoo! Business Case for Hadoop and BI
Big Data, Fast Queries
Big Data / BI Themes
Get the Hardware Balance Right
Partitioning, Partitioning, Partitioning
Keep it Simple
It is the order of things
4. 4
Yahoo! manages a
powerful scalable
advertising exchange
that includes publishers
and advertisers
Yahoo! TAO Business Challenge
5. 5
Advertisers want to get
the best bang for their
buck by reaching their
targeted audiences
effectively and efficiently
Yahoo! TAO Business Challenge
6. 6
Yahoo! needs visibility into how consumers
are responding to ads along many
dimensions: web sites, creatives, time of
day, user segments (e.g. gender, age,
location) to make the exchange work as
efficiently and effectively as possible
Yahoo! TAO Business Challenge
7. 7
Yahoo! TAO Technical Requirements
680,000,000Visitors to Yahoo! Branded sites:
Ad Impressions: 3,500,000,000(perday)
Refresh Frequency: Hourly
464,000,000,000(perqtr)
Rows Loaded:
Average Query Time: <10 seconds
8. 8
Yahoo! TAO Platform Architecture
How did we load so much so quickly?
Data Archive & Staging
Oracle 11G RAC
File 1
File 2
File N
Partition 1
Partition 2
Partition N
Partition 1
Partition 2
Partition N
24TB
Cube
/qtr
1.2TB
/day
135GB/day
compressed
2PB
cluster
Data Aggregation & ETL
Hadoop
BI Server
SQL Server Analysis
Services 2008 R2
9. 9
BI Query Servers
SQL Server Analysis
Services 2008 R2
24TB
Cube
/qtr
Adhoc Query/Visualization
Tableau Desktop 7
Optimization Application
Custom J2EE App
Yahoo! TAO Platform Architecture
Queries at the “speed of thought”
464B rows of
event level data
/qtr
• Dimensions: 42
• Attributes: 296
• Measures: 278
Avg Query Time:
2 secs
Avg Query Time:
5 secs
10. 10
Yahoo! TAO Return on Investment
For campaigns
optimized using TAO,
advertisers spent
more with Yahoo! than
before
For campaigns
optimized using TAO,
more eCPMs
(revenue)!
11. 11
Yahoo! TAO Return on Investment
Yahoo! TAO exposed customer segment
performance to campaign managers and
advertisers for the first time! No longer
“flying audience blind”
12. 12
Yahoo! TAO Future Direction
Increase Segments by 3x
Increase data size and cartesian
No longer doing distinct count
Built frequency reports and sampling to deliver this due to the inherent complexity!
Current Challenge
Hadoop to SSAS cube (more later)
External access to cubes
More disk due to need for more IO
15. 15
Extracting the data
File Generation
Hadoop jobs create many files that are exported / dumped to disk in tabular format
File Staging
Files are propped to a staging folder for relational dB access
Oracle External Tables
Generate external tables that point to the staged files
No need to import the data
Processing is slow
16. 16
AS on Oracle Case
Oracle OLEDB
10K rows/sec
100K
rows/sec
SSIS Connector
20K rows/sec
Oracle Analysis Services
Oracle SQL Analysis Services
19. 19
PartitionsPartitions
Yahoo Example – “Fast” Oracle Load
• Data is streamed in to Oracle to files
• To get max processing, 30 threads are fired because all T (temp) partitions are
processed concurrently
• Super fast data loads
• Problem is that it requires constant merging of partitions
Files are streamed in
as they become
available
10/10/10 T360772
10/10/10 T360773
…
10/10/10 T361645
10/10/10 T360772
Oracle 10g
10/10/10 T360773
10/10/10 T361645
…
10/10/10 T360772
10/10/10 T360773
10/10/10 T361645
…
SSAS
10/10/10
Merge
20. 20
Partitions – Directly Merging
Partitions
10/10/10 00:00
Oracle 10g
10/10/10 01:00
10/10/10 23:00
…
• New model allows for set hourly partitions
• No more streaming data but with hourly partitions, cannot have as many threads for
fast data loads, unless…
• Process multiple cubes or measure groups in parallel
Partitions
10/10/10 00:00
10/10/10 01:00
10/10/10 23:00
…
SSAS
Segments
10/10/10 00:00
10/10/10 01:00
10/10/10 23:00
…
Activities
10/10/10 00:00
10/10/10 01:00
10/10/10 23:00
…
Uniques
22. 22
It is the order of things
“I am a Jem'Hadar. He is a Vorta.
It is the order of things."
"Do you really want to give up
your life for the 'order of things'?"
"It is not my life to give up, Captain
– and it never was.”
Rocks and Shoals,
Deep Space Nine
Written by Ronald D. Moore
26. 26
Back to cube dimensions
Running ProcessUpdate
Takes a long time to run because all of the fact partitions are re-indexed!
Minimize likelihood by building SCD-2 dimensions
Composite Key based on lowest level unique values to represent row
Sometimes identity can be just as effective though hashing requires mapping or lookuptables
Create SK to allow for SCD-2 dimensions
Key is that we keep the memory space of the SK small
Composite(Composite) or Hash(Composite) is good for dimensions loaded from fact BUT do
not expect Type-2 for fact-based dimensions
Important to call out restatement based on current data (high cost associated with keeping
versioned history of dimension tables)
Like the NYSE, the Yahoo! ad network behaves like an exchange for display advertising
Advertisers are the buyers
Publishers (web sites) are the sellers (Yahoo! is one of the publishers)
Yahoo! needs to create the most efficient exchange as possible
Performance display advertiser requires that we can:
Identify the target audience for a campaign
Monitor how they behave across a number of different dimensions
Huge opportunity for optimization but difficult given the large number of discrete dimensions
The number of ad performance factors (i.e. dimensions) and the number of ad impressions per day is huge
Yahoo! branded sites attract 680 million unique visitors worldwide
3.5B performance display ad impressions served on Yahoo! exchange per day
Large many to many relationships (consumers can be a member of more than one segment)
Each consumer is a member of an average of 10 segments – explodes the data by 10x
161B rows per quarter for impression data
203B rows per quarter for segment data (compressed but # of rows processed is really 10x = 2 trillion)
Given the number of permutations, query performance needs to be speed of thought or the system is useless
Traditional ROLAP is too slow
Hundred of dimensions, attributes and metrics create complexity
Need integration with good visualization tools to find relevant trends and performance improvement opportunities
Data needs to be fresh (from ad impression to query in less than 24 hours) or opportunities are lost
Display ad campaigns have very short timeframes (< 2 weeks)
Key design concepts are:
Use standard, off the shelf parts
Loosely coupled components (using a pull architecture)
Centralize data aggregation on grid using Hadoop
Leverage Oracle’s external table feature to make data available to SSAS with minimal latency
One to one match of SASS partitions to Oracle partitions so not aggregation needed & partition pruning enabled (30+ trillion rows in Oracle tables)
Maximize parallel loading (90+ threads loading in parallel)
Separate cube building from cube querying
Improvements in HW/Design
9h -> 2.5h: Change in HW: IBM x3560 M3 256GB RAM, 48 cores; EMC Clariion SAN
2.5h -> 1.25h: Use of Data Direct / Attunity drivers
Cube is complex due to nature of the ad business
Need to provide an “anything by anything” query environment to find the optimization opportunities
If queries aren’t fast, we lose the value
Need to update the cube continuously given that there’s limited time to optimize a display ad campaign (data needs to be updated 4x day at minimum)
Used SASS aggregations extensively – cut down on Hadoop aggregations dramatically
Only 8 fact tables loaded (4 areas, 1 detail, 1 aggregate)
As opposed to an existing ROLAP application at Yahoo! that requires 3,600 facts (aggregate) tables
Doubled the eCPM (revenue) by allowing our campaign managers to “tune” campaign targeting and creatives
Drove increase in spend from advertisers since they got better performance by advertising through Yahoo!
IMPORTANT: Sorting is require for both the source and the cube partition queries.
Haven’t used UBO yet due to the 2005 issues
Creates own spreadsheet (above) to hand-make aggregations
Extremely difficult to make/explain aggs
Analysis: once you split; how long is ProcessData v.s ProcessIndexes
To determine if aggregation creation is the issue or not