SlideShare a Scribd company logo
1 of 28
April 10-12, Chicago, IL
Yahoo!, Big Data, and
Microsoft BI: Bigger and
Better Together
Dianne Cantwell and Denny Lee
April 10-12, Chicago, IL
Please silence
cell phones
3
Agenda
Yahoo! Business Case for Hadoop and BI
Big Data, Fast Queries
Big Data / BI Themes
Get the Hardware Balance Right
Partitioning, Partitioning, Partitioning
Keep it Simple
It is the order of things
4
Yahoo! manages a
powerful scalable
advertising exchange
that includes publishers
and advertisers
Yahoo! TAO Business Challenge
5
Advertisers want to get
the best bang for their
buck by reaching their
targeted audiences
effectively and efficiently
Yahoo! TAO Business Challenge
6
Yahoo! needs visibility into how consumers
are responding to ads along many
dimensions: web sites, creatives, time of
day, user segments (e.g. gender, age,
location) to make the exchange work as
efficiently and effectively as possible
Yahoo! TAO Business Challenge
7
Yahoo! TAO Technical Requirements
680,000,000Visitors to Yahoo! Branded sites:
Ad Impressions: 3,500,000,000(perday)
Refresh Frequency: Hourly
464,000,000,000(perqtr)
Rows Loaded:
Average Query Time: <10 seconds
8
Yahoo! TAO Platform Architecture
How did we load so much so quickly?
Data Archive & Staging
Oracle 11G RAC
File 1
File 2
File N
Partition 1
Partition 2
Partition N
Partition 1
Partition 2
Partition N
24TB
Cube
/qtr
1.2TB
/day
135GB/day
compressed
2PB
cluster
Data Aggregation & ETL
Hadoop
BI Server
SQL Server Analysis
Services 2008 R2
9
BI Query Servers
SQL Server Analysis
Services 2008 R2
24TB
Cube
/qtr
Adhoc Query/Visualization
Tableau Desktop 7
Optimization Application
Custom J2EE App
Yahoo! TAO Platform Architecture
Queries at the “speed of thought”
464B rows of
event level data
/qtr
• Dimensions: 42
• Attributes: 296
• Measures: 278
Avg Query Time:
2 secs
Avg Query Time:
5 secs
10
Yahoo! TAO Return on Investment
For campaigns
optimized using TAO,
advertisers spent
more with Yahoo! than
before
For campaigns
optimized using TAO,
more eCPMs
(revenue)!
11
Yahoo! TAO Return on Investment
Yahoo! TAO exposed customer segment
performance to campaign managers and
advertisers for the first time! No longer
“flying audience blind”
12
Yahoo! TAO Future Direction
Increase Segments by 3x
Increase data size and cartesian
No longer doing distinct count
Built frequency reports and sampling to deliver this due to the inherent complexity!
Current Challenge
Hadoop to SSAS cube (more later)
External access to cubes
More disk due to need for more IO
13
Big Data Analytics Challenges
Cube
F
14
Get the data out!
15
Extracting the data
File Generation
Hadoop jobs create many files that are exported / dumped to disk in tabular format
File Staging
Files are propped to a staging folder for relational dB access
Oracle External Tables
Generate external tables that point to the staged files
No need to import the data
Processing is slow
16
AS on Oracle Case
Oracle OLEDB
10K rows/sec
100K
rows/sec
SSIS Connector
20K rows/sec
Oracle Analysis Services
Oracle SQL Analysis Services
17
Passthrough Query to Linked Server
http://msdn.microsoft.com/en-us/library/jj710329.aspx
18
Partitioning,
Partitioning,
Partitioning
19
PartitionsPartitions
Yahoo Example – “Fast” Oracle Load
• Data is streamed in to Oracle to files
• To get max processing, 30 threads are fired because all T (temp) partitions are
processed concurrently
• Super fast data loads
• Problem is that it requires constant merging of partitions
Files are streamed in
as they become
available
10/10/10 T360772
10/10/10 T360773
…
10/10/10 T361645
10/10/10 T360772
Oracle 10g
10/10/10 T360773
10/10/10 T361645
…
10/10/10 T360772
10/10/10 T360773
10/10/10 T361645
…
SSAS
10/10/10
Merge
20
Partitions – Directly Merging
Partitions
10/10/10 00:00
Oracle 10g
10/10/10 01:00
10/10/10 23:00
…
• New model allows for set hourly partitions
• No more streaming data but with hourly partitions, cannot have as many threads for
fast data loads, unless…
• Process multiple cubes or measure groups in parallel
Partitions
10/10/10 00:00
10/10/10 01:00
10/10/10 23:00
…
SSAS
Segments
10/10/10 00:00
10/10/10 01:00
10/10/10 23:00
…
Activities
10/10/10 00:00
10/10/10 01:00
10/10/10 23:00
…
Uniques
21
It is the order of things
22
It is the order of things
“I am a Jem'Hadar. He is a Vorta.
It is the order of things."
"Do you really want to give up
your life for the 'order of things'?"
"It is not my life to give up, Captain
– and it never was.”
Rocks and Shoals,
Deep Space Nine
Written by Ronald D. Moore
23
Segments and the importance of sort order
Data File Sorted Not Sorted % Diff
fact.data 195,708,592 344,502,968 43.19%
agg.rigid.data 106,825,677 106,825,677 0.00%
dim1.dim2.fact.map 17,332,729 32,989,946 47.46%
dim1.dim3.fact.map 16,923,276 32,222,813 47.48%
dim1.dim4.fact.map 6,079,396 12,286,978 50.52%
dim5.dim6.fact.map 2,630,888 6,057,334 56.57%
dim1.dim7.fact.map 1,809,725 3,904,004 53.64%
dim8.dim9.fact.map 1,592,886 3,793,452 58.01%
dim1.dim10.fact.map 1,419,255 3,108,248 54.34%
dim8.dim11.fact.map 1,301,221 3,042,638 57.23%
dim1.dim12.fact.map 2,949,432 2,949,432 0.00%
dim1.dim13.fact.map 2,934,836 2,934,836 0.00%
dimA.dimA.fact.map 1,101,552 2,716,289 59.45%
dim8.dimB.fact.map 961,332 2,451,956 60.79%
dim1.dimC.fact.map 1,027,305 2,323,906 55.79%
dim8.dim8.fact.map 1,592,886 2,308,232 30.99%
dimA.dimD.fact.map 851,095 2,170,962 60.80%
Not Sorted
Sorted
24
Across the Eighth Dimension!
How do you associate dimensions with
Star Trek Into Darkness?
Cube
25
26
Back to cube dimensions
Running ProcessUpdate
Takes a long time to run because all of the fact partitions are re-indexed!
Minimize likelihood by building SCD-2 dimensions
Composite Key based on lowest level unique values to represent row
Sometimes identity can be just as effective though hashing requires mapping or lookuptables
Create SK to allow for SCD-2 dimensions
Key is that we keep the memory space of the SK small
Composite(Composite) or Hash(Composite) is good for dimensions loaded from fact BUT do
not expect Type-2 for fact-based dimensions
Important to call out restatement based on current data (high cost associated with keeping
versioned history of dimension tables)
27
Let’s aggregate it up
April 10-12, Chicago, IL
Thank you!
Diamond Sponsor

More Related Content

Similar to Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together

Next generation databases july2010
Next generation databases july2010Next generation databases july2010
Next generation databases july2010
Guy Harrison
 
Couchbase Overview Nov 2013
Couchbase Overview Nov 2013Couchbase Overview Nov 2013
Couchbase Overview Nov 2013
Jeff Harris
 
Petascale Analytics - The World of Big Data Requires Big Analytics
Petascale Analytics - The World of Big Data Requires Big AnalyticsPetascale Analytics - The World of Big Data Requires Big Analytics
Petascale Analytics - The World of Big Data Requires Big Analytics
Heiko Joerg Schick
 

Similar to Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together (20)

Five database trends - updated April 2015
Five database trends - updated April 2015Five database trends - updated April 2015
Five database trends - updated April 2015
 
Next generation databases july2010
Next generation databases july2010Next generation databases july2010
Next generation databases july2010
 
20th Athens Big Data Meetup - 1st Talk - Druid: the open source, performant, ...
20th Athens Big Data Meetup - 1st Talk - Druid: the open source, performant, ...20th Athens Big Data Meetup - 1st Talk - Druid: the open source, performant, ...
20th Athens Big Data Meetup - 1st Talk - Druid: the open source, performant, ...
 
Data Culture Series - Keynote & Panel - Reading - 12th May 2015
Data Culture Series  - Keynote & Panel - Reading - 12th May 2015Data Culture Series  - Keynote & Panel - Reading - 12th May 2015
Data Culture Series - Keynote & Panel - Reading - 12th May 2015
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDB
 
Red hatpartner2013edb futureofdatabase
Red hatpartner2013edb futureofdatabaseRed hatpartner2013edb futureofdatabase
Red hatpartner2013edb futureofdatabase
 
SQLCAT: Tier-1 BI in the World of Big Data
SQLCAT: Tier-1 BI in the World of Big DataSQLCAT: Tier-1 BI in the World of Big Data
SQLCAT: Tier-1 BI in the World of Big Data
 
Couchbase Overview Nov 2013
Couchbase Overview Nov 2013Couchbase Overview Nov 2013
Couchbase Overview Nov 2013
 
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdfOSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf
 
Our Hero Flash eBook
Our Hero Flash eBookOur Hero Flash eBook
Our Hero Flash eBook
 
Database revolution opening webcast 01 18-12
Database revolution opening webcast 01 18-12Database revolution opening webcast 01 18-12
Database revolution opening webcast 01 18-12
 
Database Revolution - Exploratory Webcast
Database Revolution - Exploratory WebcastDatabase Revolution - Exploratory Webcast
Database Revolution - Exploratory Webcast
 
Big Data Basic Concepts | Presented in 2014
Big Data Basic Concepts  | Presented in 2014Big Data Basic Concepts  | Presented in 2014
Big Data Basic Concepts | Presented in 2014
 
August meetup - All about Apache Druid
August meetup - All about Apache Druid August meetup - All about Apache Druid
August meetup - All about Apache Druid
 
Graph Database Use Cases - StampedeCon 2015
Graph Database Use Cases - StampedeCon 2015Graph Database Use Cases - StampedeCon 2015
Graph Database Use Cases - StampedeCon 2015
 
Graph database Use Cases
Graph database Use CasesGraph database Use Cases
Graph database Use Cases
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
 
Petascale Analytics - The World of Big Data Requires Big Analytics
Petascale Analytics - The World of Big Data Requires Big AnalyticsPetascale Analytics - The World of Big Data Requires Big Analytics
Petascale Analytics - The World of Big Data Requires Big Analytics
 
The paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECHThe paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECH
 
Evolution of the DBA to Data Platform Administrator/Specialist
Evolution of the DBA to Data Platform Administrator/SpecialistEvolution of the DBA to Data Platform Administrator/Specialist
Evolution of the DBA to Data Platform Administrator/Specialist
 

More from Denny Lee

More from Denny Lee (20)

Azure Cosmos DB: Globally Distributed Multi-Model Database Service
Azure Cosmos DB: Globally Distributed Multi-Model Database ServiceAzure Cosmos DB: Globally Distributed Multi-Model Database Service
Azure Cosmos DB: Globally Distributed Multi-Model Database Service
 
Spark to DocumentDB connector
Spark to DocumentDB connectorSpark to DocumentDB connector
Spark to DocumentDB connector
 
SQL Server Integration Services Best Practices
SQL Server Integration Services Best PracticesSQL Server Integration Services Best Practices
SQL Server Integration Services Best Practices
 
SQL Server Reporting Services: IT Best Practices
SQL Server Reporting Services: IT Best PracticesSQL Server Reporting Services: IT Best Practices
SQL Server Reporting Services: IT Best Practices
 
Introduction to Microsoft's Big Data Platform and Hadoop Primer
Introduction to Microsoft's Big Data Platform and Hadoop PrimerIntroduction to Microsoft's Big Data Platform and Hadoop Primer
Introduction to Microsoft's Big Data Platform and Hadoop Primer
 
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)
Differential Privacy Case Studies (CMU-MSR Mindswap on Privacy 2007)
 
SQL Server Reporting Services Disaster Recovery webinar
SQL Server Reporting Services Disaster Recovery webinarSQL Server Reporting Services Disaster Recovery webinar
SQL Server Reporting Services Disaster Recovery webinar
 
Building and Deploying Large Scale SSRS using Lessons Learned from Customer D...
Building and Deploying Large Scale SSRS using Lessons Learned from Customer D...Building and Deploying Large Scale SSRS using Lessons Learned from Customer D...
Building and Deploying Large Scale SSRS using Lessons Learned from Customer D...
 
Designing, Building, and Maintaining Large Cubes using Lessons Learned
Designing, Building, and Maintaining Large Cubes using Lessons LearnedDesigning, Building, and Maintaining Large Cubes using Lessons Learned
Designing, Building, and Maintaining Large Cubes using Lessons Learned
 
SQLCAT - Data and Admin Security
SQLCAT - Data and Admin SecuritySQLCAT - Data and Admin Security
SQLCAT - Data and Admin Security
 
SQLCAT: Addressing Security and Compliance Issues with SQL Server 2008
SQLCAT: Addressing Security and Compliance Issues with SQL Server 2008SQLCAT: Addressing Security and Compliance Issues with SQL Server 2008
SQLCAT: Addressing Security and Compliance Issues with SQL Server 2008
 
SQLCAT: A Preview to PowerPivot Server Best Practices
SQLCAT: A Preview to PowerPivot Server Best PracticesSQLCAT: A Preview to PowerPivot Server Best Practices
SQLCAT: A Preview to PowerPivot Server Best Practices
 
Deploying and Managing PowerPivot for SharePoint
Deploying and Managing PowerPivot for SharePointDeploying and Managing PowerPivot for SharePoint
Deploying and Managing PowerPivot for SharePoint
 
Big Data, Bigger Brains
Big Data, Bigger BrainsBig Data, Bigger Brains
Big Data, Bigger Brains
 
Jump Start into Apache Spark (Seattle Spark Meetup)
Jump Start into Apache Spark (Seattle Spark Meetup)Jump Start into Apache Spark (Seattle Spark Meetup)
Jump Start into Apache Spark (Seattle Spark Meetup)
 
How Concur uses Big Data to get you to Tableau Conference On Time
How Concur uses Big Data to get you to Tableau Conference On TimeHow Concur uses Big Data to get you to Tableau Conference On Time
How Concur uses Big Data to get you to Tableau Conference On Time
 
SQL Server Reporting Services Disaster Recovery Webinar
SQL Server Reporting Services Disaster Recovery WebinarSQL Server Reporting Services Disaster Recovery Webinar
SQL Server Reporting Services Disaster Recovery Webinar
 
Ensuring compliance of patient data with big data and bi [bdii 301-m] - (4078)
Ensuring compliance of patient data with big data and bi [bdii 301-m] - (4078)Ensuring compliance of patient data with big data and bi [bdii 301-m] - (4078)
Ensuring compliance of patient data with big data and bi [bdii 301-m] - (4078)
 
SQL Server Reporting Services: IT Best Practices
SQL Server Reporting Services: IT Best PracticesSQL Server Reporting Services: IT Best Practices
SQL Server Reporting Services: IT Best Practices
 
Building SSRS 2008 large scale solutions
Building SSRS 2008 large scale solutionsBuilding SSRS 2008 large scale solutions
Building SSRS 2008 large scale solutions
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 

Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together

  • 1. April 10-12, Chicago, IL Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together Dianne Cantwell and Denny Lee
  • 2. April 10-12, Chicago, IL Please silence cell phones
  • 3. 3 Agenda Yahoo! Business Case for Hadoop and BI Big Data, Fast Queries Big Data / BI Themes Get the Hardware Balance Right Partitioning, Partitioning, Partitioning Keep it Simple It is the order of things
  • 4. 4 Yahoo! manages a powerful scalable advertising exchange that includes publishers and advertisers Yahoo! TAO Business Challenge
  • 5. 5 Advertisers want to get the best bang for their buck by reaching their targeted audiences effectively and efficiently Yahoo! TAO Business Challenge
  • 6. 6 Yahoo! needs visibility into how consumers are responding to ads along many dimensions: web sites, creatives, time of day, user segments (e.g. gender, age, location) to make the exchange work as efficiently and effectively as possible Yahoo! TAO Business Challenge
  • 7. 7 Yahoo! TAO Technical Requirements 680,000,000Visitors to Yahoo! Branded sites: Ad Impressions: 3,500,000,000(perday) Refresh Frequency: Hourly 464,000,000,000(perqtr) Rows Loaded: Average Query Time: <10 seconds
  • 8. 8 Yahoo! TAO Platform Architecture How did we load so much so quickly? Data Archive & Staging Oracle 11G RAC File 1 File 2 File N Partition 1 Partition 2 Partition N Partition 1 Partition 2 Partition N 24TB Cube /qtr 1.2TB /day 135GB/day compressed 2PB cluster Data Aggregation & ETL Hadoop BI Server SQL Server Analysis Services 2008 R2
  • 9. 9 BI Query Servers SQL Server Analysis Services 2008 R2 24TB Cube /qtr Adhoc Query/Visualization Tableau Desktop 7 Optimization Application Custom J2EE App Yahoo! TAO Platform Architecture Queries at the “speed of thought” 464B rows of event level data /qtr • Dimensions: 42 • Attributes: 296 • Measures: 278 Avg Query Time: 2 secs Avg Query Time: 5 secs
  • 10. 10 Yahoo! TAO Return on Investment For campaigns optimized using TAO, advertisers spent more with Yahoo! than before For campaigns optimized using TAO, more eCPMs (revenue)!
  • 11. 11 Yahoo! TAO Return on Investment Yahoo! TAO exposed customer segment performance to campaign managers and advertisers for the first time! No longer “flying audience blind”
  • 12. 12 Yahoo! TAO Future Direction Increase Segments by 3x Increase data size and cartesian No longer doing distinct count Built frequency reports and sampling to deliver this due to the inherent complexity! Current Challenge Hadoop to SSAS cube (more later) External access to cubes More disk due to need for more IO
  • 13. 13 Big Data Analytics Challenges Cube F
  • 15. 15 Extracting the data File Generation Hadoop jobs create many files that are exported / dumped to disk in tabular format File Staging Files are propped to a staging folder for relational dB access Oracle External Tables Generate external tables that point to the staged files No need to import the data Processing is slow
  • 16. 16 AS on Oracle Case Oracle OLEDB 10K rows/sec 100K rows/sec SSIS Connector 20K rows/sec Oracle Analysis Services Oracle SQL Analysis Services
  • 17. 17 Passthrough Query to Linked Server http://msdn.microsoft.com/en-us/library/jj710329.aspx
  • 19. 19 PartitionsPartitions Yahoo Example – “Fast” Oracle Load • Data is streamed in to Oracle to files • To get max processing, 30 threads are fired because all T (temp) partitions are processed concurrently • Super fast data loads • Problem is that it requires constant merging of partitions Files are streamed in as they become available 10/10/10 T360772 10/10/10 T360773 … 10/10/10 T361645 10/10/10 T360772 Oracle 10g 10/10/10 T360773 10/10/10 T361645 … 10/10/10 T360772 10/10/10 T360773 10/10/10 T361645 … SSAS 10/10/10 Merge
  • 20. 20 Partitions – Directly Merging Partitions 10/10/10 00:00 Oracle 10g 10/10/10 01:00 10/10/10 23:00 … • New model allows for set hourly partitions • No more streaming data but with hourly partitions, cannot have as many threads for fast data loads, unless… • Process multiple cubes or measure groups in parallel Partitions 10/10/10 00:00 10/10/10 01:00 10/10/10 23:00 … SSAS Segments 10/10/10 00:00 10/10/10 01:00 10/10/10 23:00 … Activities 10/10/10 00:00 10/10/10 01:00 10/10/10 23:00 … Uniques
  • 21. 21 It is the order of things
  • 22. 22 It is the order of things “I am a Jem'Hadar. He is a Vorta. It is the order of things." "Do you really want to give up your life for the 'order of things'?" "It is not my life to give up, Captain – and it never was.” Rocks and Shoals, Deep Space Nine Written by Ronald D. Moore
  • 23. 23 Segments and the importance of sort order Data File Sorted Not Sorted % Diff fact.data 195,708,592 344,502,968 43.19% agg.rigid.data 106,825,677 106,825,677 0.00% dim1.dim2.fact.map 17,332,729 32,989,946 47.46% dim1.dim3.fact.map 16,923,276 32,222,813 47.48% dim1.dim4.fact.map 6,079,396 12,286,978 50.52% dim5.dim6.fact.map 2,630,888 6,057,334 56.57% dim1.dim7.fact.map 1,809,725 3,904,004 53.64% dim8.dim9.fact.map 1,592,886 3,793,452 58.01% dim1.dim10.fact.map 1,419,255 3,108,248 54.34% dim8.dim11.fact.map 1,301,221 3,042,638 57.23% dim1.dim12.fact.map 2,949,432 2,949,432 0.00% dim1.dim13.fact.map 2,934,836 2,934,836 0.00% dimA.dimA.fact.map 1,101,552 2,716,289 59.45% dim8.dimB.fact.map 961,332 2,451,956 60.79% dim1.dimC.fact.map 1,027,305 2,323,906 55.79% dim8.dim8.fact.map 1,592,886 2,308,232 30.99% dimA.dimD.fact.map 851,095 2,170,962 60.80% Not Sorted Sorted
  • 24. 24 Across the Eighth Dimension! How do you associate dimensions with Star Trek Into Darkness? Cube
  • 25. 25
  • 26. 26 Back to cube dimensions Running ProcessUpdate Takes a long time to run because all of the fact partitions are re-indexed! Minimize likelihood by building SCD-2 dimensions Composite Key based on lowest level unique values to represent row Sometimes identity can be just as effective though hashing requires mapping or lookuptables Create SK to allow for SCD-2 dimensions Key is that we keep the memory space of the SK small Composite(Composite) or Hash(Composite) is good for dimensions loaded from fact BUT do not expect Type-2 for fact-based dimensions Important to call out restatement based on current data (high cost associated with keeping versioned history of dimension tables)
  • 28. April 10-12, Chicago, IL Thank you! Diamond Sponsor

Editor's Notes

  1. Like the NYSE, the Yahoo! ad network behaves like an exchange for display advertising Advertisers are the buyers Publishers (web sites) are the sellers (Yahoo! is one of the publishers) Yahoo! needs to create the most efficient exchange as possible
  2. Performance display advertiser requires that we can: Identify the target audience for a campaign Monitor how they behave across a number of different dimensions
  3. Huge opportunity for optimization but difficult given the large number of discrete dimensions
  4. The number of ad performance factors (i.e. dimensions) and the number of ad impressions per day is huge Yahoo! branded sites attract 680 million unique visitors worldwide 3.5B performance display ad impressions served on Yahoo! exchange per day Large many to many relationships (consumers can be a member of more than one segment) Each consumer is a member of an average of 10 segments – explodes the data by 10x 161B rows per quarter for impression data 203B rows per quarter for segment data (compressed but # of rows processed is really 10x = 2 trillion) Given the number of permutations, query performance needs to be speed of thought or the system is useless Traditional ROLAP is too slow Hundred of dimensions, attributes and metrics create complexity Need integration with good visualization tools to find relevant trends and performance improvement opportunities Data needs to be fresh (from ad impression to query in less than 24 hours) or opportunities are lost Display ad campaigns have very short timeframes (< 2 weeks)
  5. Key design concepts are: Use standard, off the shelf parts Loosely coupled components (using a pull architecture) Centralize data aggregation on grid using Hadoop Leverage Oracle’s external table feature to make data available to SSAS with minimal latency One to one match of SASS partitions to Oracle partitions so not aggregation needed & partition pruning enabled (30+ trillion rows in Oracle tables) Maximize parallel loading (90+ threads loading in parallel) Separate cube building from cube querying Improvements in HW/Design 9h -> 2.5h: Change in HW: IBM x3560 M3 256GB RAM, 48 cores; EMC Clariion SAN 2.5h -> 1.25h: Use of Data Direct / Attunity drivers
  6. Cube is complex due to nature of the ad business Need to provide an “anything by anything” query environment to find the optimization opportunities If queries aren’t fast, we lose the value Need to update the cube continuously given that there’s limited time to optimize a display ad campaign (data needs to be updated 4x day at minimum) Used SASS aggregations extensively – cut down on Hadoop aggregations dramatically Only 8 fact tables loaded (4 areas, 1 detail, 1 aggregate) As opposed to an existing ROLAP application at Yahoo! that requires 3,600 facts (aggregate) tables
  7. Doubled the eCPM (revenue) by allowing our campaign managers to “tune” campaign targeting and creatives Drove increase in spend from advertisers since they got better performance by advertising through Yahoo!
  8. IMPORTANT: Sorting is require for both the source and the cube partition queries.
  9. Haven’t used UBO yet due to the 2005 issues Creates own spreadsheet (above) to hand-make aggregations Extremely difficult to make/explain aggs Analysis: once you split; how long is ProcessData v.s ProcessIndexes To determine if aggregation creation is the issue or not