Feeds:
Posts
Comments

As part of the excitement of the Strata Conference this week, Microsoft has been talking about Big Data and Hadoop.  It started off with Dave Campbell’s question: Do we have the tools we need to navigate the New World of Data?.  And some of the tooling call outs specific to Microsoft include references to PowerPivot, Power View, and the Hadoop JavaScript framework (Hadoop JavaScript– Microsoft’s VB shift for Big Data).

As noticed by GigaOM’s article Microsoft’s Hadoop play is shaping up, and it includes Excel; the great call out is:

to make Hadoop data analyzable via both a JavaScript framework and Microsoft Excel, meaning many millions of developers and business users will be able to work with Hadoop data using their favorite tools.

 .

Big Data for Everyone!

The title of the Microsoft BI blog post says it the best: Big Data for Everyone: Using Microsoft’s Familiar BI Tools with Hadoop – it’s about helping make Big Data accessible to everyone by use of one of the most popular and powerful BI tools – Excel.

So what does accessible to everyone mean – in the BI sense?  It’s about being to go from this (which is a pretty nice view of Hive query against Hadoop on Azure Hive Console)

image

and getting it Excel or PowerPivot.

The most important call out here is that you can use PowerPivot and Excel to merge data sets not just from Hadoop, but also bring in data sets from SQL Server, SQL Azure, PDW Oracle, Teradata, Reports, Atom feeds, Text files, other Excel files, and via ODBC – all within Excel! (thanks @sqlgal for that reminder!)

From here users can manipulate the data using Excel macros and PowerPivot DAX language respectively.  Below is a screenshot of data extracted from Hive and placed into PowerPivot for Excel.

image

But even more cooler – data visualization wise – your PowerPivot for Excel workbook (once uploaded to SharePoint 2010 with SQL Server 2012) and you can create an interactive Power View report.

image

For more information on how to get PowerPivot and Power View to connect to Hadoop (in this case, its Hadoop on Azure but conceptually they are the same), please reference the links below:

 .

So what’s so Big about Big Data?

As noted by in the post What’s so Big about Big Data?, we call out that Big Data is important because of the sheer amount of machine generated data that needs to be made sense of.

As noted by Alexander Stojanovic (@stojanovic), the Founder and General Manager of Hadoop on Windows and Azure:

It’s not just your “Big Data” problems, it’s about your BIG “Data Problems”

 .

To learn more, check out the my 24HOP (24 Hours of PASS) session:

Tier-1 BI in the Age of Bees and Elephants

In this age of Big Data, data volumes become exceedingly larger while the technical problems and business scenarios become more complex.  This session dives provides concrete examples of how these can be solved. Highlighted will be the use of Big Data technologies including Hadoop (elephants) and Hive (bees) with Analysis Services.  Customer examples including Klout and Yahoo! (with their 24TB cube) will highlight both the complexities and solutions to these problems.
 .

Making this real, a great case study showcasing this includes the one at Klout, which includes a great blog post: Big Data, Bigger Brains And below is a link to Bruno Aziza (@brunoaziza) and Dave Mariani’s (@dmariani) YouTube video on how Klout Leverages Hadoop and Microsoft BI Technologies To Manage Big Data.

Enjoy!

Disclaimer: This blog post (like other blog posts on dennyglee.com) are written by the author Denny Lee. I am a Microsoft employee but the opinions below are my own. I have been working with the Isotope team (code name for Hadoop on Windows and Hadoop on Azure) since its inception while part of the SQL Customer Advisory Team.

One of the cool things about the Hadoop on Azure CTP is its Interactive JavaScript Console – it allows users query and visualize data on top HDFS using a JavaScript framework.  For example, below is a graph pie visualization within a browser generated by the Interactive JavaScript console using graph.pie function.

image

Why is this important and cool at the same time?  As one can note with amazing projects like node.js, JavaScript is being seen by many as its own first class programming / application language: The rise of Node.js: JavaScript graduates to the server.

In the realm of Big Data, Hadoop on Azure is showcasing the ability to use JavaScript to create MapReduce jobs as well as interact with Pig and Hive from a browser.  This opens up the possibility of a new path for the many JavaScript developers to jump onboard into the world of Big Data.  Hence my opinion:

Hadoop JavaScript – Microsoft’s VB shift for Big Data

Just like Microsoft brought VB developers into the Enterprise by COM and later .NET (so VB forms creators could become Enterprise application developers back in the 90s), Hadoop JavaScript is a way to help bring JavaScript developers into the world of Hadoop using their own already powerful skillset.

And yes, the JavaScript layer is something Microsoft intends to give back to the Apache community.  Check out the jira HADOOP-8079: Proposal for enhancements to Hadoop for Windows Server and Windows Azure development and runtime environments.

To know more about the Hadoop + JavaScript, check out the links below as well as the Introduction to the Hadoop on Azure Interactive JavaScript Console video.

As well, if you’re going to 2012 Strata Conference in Santa Clara, check out Asad Khan’s session: Hadoop + JavaScript: what we learned.

Enjoy!

—-

Disclaimer: This blog post (like other blog posts on dennyglee.com) are written by the author Denny Lee. I am a Microsoft employee but the opinions below are my own. I have been working with the Isotope team (code name for Hadoop on Windows and Hadoop on Azure) since its inception while part of the SQL Customer Advisory Team.

The post Connecting PowerPivot to Hadoop on Azure – Self Service BI to Big Data in the Cloud provided the step-by-step details on how to connect PowerPivot to your Hadoop on Azure cluster.   And while this is really powerful, one of the great features as part of SQL Server 2012 is Power View (formerly known as Project Crescent).  With Power ‘View, the SQL Server BI stack extends the concept of Self Service BI (PowerPivot) to Self service Reporting.

image

Above is a screenshot of the Power View Mobile Hive Sample that is built on top of the PowerPivot workbook created in the Connecting PowerPivot to Hadoop on Azure blog post.  But taking a different medium, the steps to create a Power View report with Hadoop on Azure source can be seen in the YouTube video below.

Power View Report to Hadoop on Azure

Enjoy!

braised lamb shank

Braised Lamb Shank in Cumin and Star Anise

Photo credit goes to Steph L. on Yelp

Vancouver is regularly ranked in the top liveable city list (for awhile, they were #1) – the temperate environment, beautiful scenery, and general aura of the city is just amazing.  Just as awesome is the city’s quality and diversity of great food.

Amazing Malaysian Cuisine

One of those places is most certainly Banana Leaf Malaysian Cuisine.  As of this post, they have four restaurants in the Vancouver area – I personally have went to both the one on Denman and the one in Kits – both are amazingly good.

Normally, I’m one to try all sort of food – I rather pride myself on the more than willingness to eat all sorts of things that people would find … weird.  In this, I am definitely Chinese.   thecorruptor

As noted in the movie “The Corruptor” (not the best movie even with the great actors Chow Yun-Fat and Mark Wahlberg)

You wanna be Chinese, you gotta eat the nasty stuff”

Fortunately, this is certainly NOT the case for Banana Leaf – aromatic, bold, delictable, and flavourful (not salty – and yes, I did spell flavour the Canadian way) are the words that come to mind.

While there are many great dishes – my personal call outs are:

Braised Lamb Shank in Cumin & Star Anise: Just as the menu describes it, the lamb shank is so tender it literally falls off the bone.  Amazingly good with just the right amount of spices to bring out the flavour of the lamb instead of over powering it.  The lamb shank is cooked just right to ensure that the meat is amazingly tender (over cook lamb, and you’ve got yourself one awesome piece of rubber).  If you order nothing else, this IS the dish you order. Period.

Pineapple Fried Rice with Seafood & Chicken: I almost never, ever, ever, ever, ever, …, ever order fried rice from an Asian restaurant…ever.  Often its too much soy sauce, too much MSG, adding ketchup (seriously WTF, fried rice made with ketchup!!!), …. etc.  This is seriously not the case.  Served in a scooped out pineapple – the fried rice is lightly sweet with solid heapings of seafood and chicken.  Quite filling – in a good way!

pisang

And the final call out of course goes to dessert.  I certainly have what one would call a “sweet tooth”.  And the Pisang Goreng is my current dessert choice du jour.  Crispy fried banana, vanilla ice cream, crushed peanuts, and gula melaka (which I found out from Wikipedia is palm sugar) – what’s not to love!

So the next time you’re in Vancouver and you’re up for some good Malaysian cuisine – or just good food in general – check it out.  I leave you with the wisdom of George Bernard Shaw:

There is no love sincerer than the love of food.

Arthur_GrosserDorky attempts at geek Shakespere aside; as the volume, complexity, and variability of your data systems increase in … entropy …, this becomes a fundamental question in whether one scales up or scale out their data problem.

Apologies for the nerdy chemistry references in advance – which starts with this picture of Dr. Arthur Grosser (more later)

As noted in the previous post Scale Up or Scale Out your Data Problems? A Space Analogy, the decision to scaling up or scaling out your data problem is a key facet in your Big Data problem.  But just as important as the ability to distribute the data across commoditized hardware, another key facet is the movement of data.

Latencies (i.e. slower performance) are introduced when you need to move data from one location to another.  To solve this problem within the data world, you can solve this by making it easier to move the data faster (e.g. compression, delta transfer, faster connectivity, etc.) or you design a system that reduces the need to move the data in the first place (i.e. moving data to compute or compute to data).

Scaling Up the Problem / Moving Data to Compute

To help describe the problem, the diagram below is a representation of a scale up traditional RDBMS.  The silver database boxes on the left represent the database servers (each with blue platters representing local disks), the box with 9 blue platters represents a disk array (e.g. SAN, DAS, etc.), the blue arrows represent fiber channel connections (between the server and disk array), and the green arrows represent the network connectivity.

image

In an optimized scale up RDBMS, we often will setup DAS or SANs to quickly transfer data from the disk array to the RDBMS server or compute node (often allocating the local disk for the compute node to hold temp/backup/cache files).  This scenario works great under the specific scenario that you can ensure low latencies.

image

And this is where things can get complicated, because if you were to lose disks on the array and/or fiber channel connectivity to the disk array – the RDBMS would go offline.    But as described in the above diagram, perhaps you setup active clustering so the secondary RDBMS can take over.

image

Yet, if you were to lose network connectivity (e.g. the secondary RDBMS is not aware the primary is offline) or lose fiber channel connectivity, you would also lose the secondary.

The Importance of ACID

It is important to note that many RDBMS systems have features or designs that work around these problems.  But to ensure availability and redundancy, if often requires more expensive hardware to work around the problematic network and disk failure points.

As well, this is not to say that RDBMS are based design – they are designed with ACID in mind – atomicity, consistency, isolation, and durability – to guarantee the reliability and robustness of database transactions (for more info, check out the Wikipedia entry: ACID).

Scaling Out the Problem / Moving Compute to Data

In a scale out or distributed solution, the idea is to have many commodity servers; they are many points of failure but there are also many paths for success.

image

Key to a distributed system is that as data comes in (the blue file icon on the right represent data such as web logs), the data is distributed and replicated in chunks to many nodes within the cluster.  In the case of Hadoop, files are broken into 64MB / 128MB chunks and each of these chunks are placed into three different locations (if you set the replication factor to 3).

image

While you are using more disk space to replicate the data, now that you have placed the data into the system, you have ensured redundancy by replicating the data within it.

image

What is great about these types of distributed systems, they are designed right from the beginning to handle latency issues whether they be disk or network connectivity problems to out right losing a node.  In the above diagram, a user is requesting data, but there is a loss to some disks and some network connections.

image

Nevertheless, there are other nodes that do have network connectivity and the data has been replicated so it is available.    Systems that are designed to scale out and distribute like Hadoop can ensure availability of the data and will complete the query just as long as the data exists (it may take longer if nodes are lost, but the query will be completed).

The importance of BASE

By using many commodity boxes, you distribute and replicate your data to multiple systems.  But as there are many moving parts, distributed systems like these cannot ensure the reliability and robustness of database transactions.  Instead, they fall under the domain of eventual consistency where over a period of time (i.e. eventually) the data within the entire system will be consistent (e.g. all data modifications will be replicated throughout the cluster).  This concept is also known as BASE (as opposed to ACID) – Basically Available, Soft State, Eventually Consistent.  For more information, check out the Wikipedia reference: Eventual Consistency.

Discussion

Similar to the post Scale Up or Scale Out your Data Problems? A Space Analogy, choosing whether ACID or BASE works for you is not a matter of which one to use – but which one to use when.  For example, as noted in the post What’s so BIG about “Big Data”?, the Yahoo! Analysis Services cube is 24 TB (certainly a case of moving data to compute with my obsession on random IO with SSAS) and the source of this cube is a 2PB of data from a huge Hadoop cluster (moving compute to data).

Yahoo Hadoop to Cube

Each one has its own set of issues – scaling out increases the complexity of maintaining so many nodes, scaling up becomes more expensive to ensure availability and reliability, etc.   It will be important to understand the pros/cons of each type – often it will be a combination of these two.   Another great example can be seen in Dave Mariani (@mariani)’s post: Big Data, Bigger Brains at Klout’s blog.

ACID and BASE each have their own set of problems, the good news is that mixing them together often neutralizes the problems.

Okay, what’s with the picture of Dr. Arthur Grosser?

dork Oh, Dr. Arthur Grosser is an actor whose filmography includes Assassin’s Creed II, Splinter Cell, and the 90s TV show Urban Angel. But more importantly – to me anyways – is that he was my chemistry professor at McGill University. He was a great professor able to balance deep academic research and learning with making chemistry fun and entertaining. He also showed to me (and I think many other students) that dorky and nerdy could still be cool.

A big shout out to Brad Sarsfield (@bradoop) for creating these great How-To videos for Hadoop on Azure.

 

How To: Upload Data and Use the WordCount Sample with Hadoop Services for Windows Azure (video)

 

 

Run the Pi Estimator Sample on Hadoop on Windows Azure (video)

As I am writing more about Big Data, I’m been asked whether we need to have traditional relational or cube systems now that we have Big Data / NoSQL / Hadoop.  My responses are to note that these are different systems that serve different purposes even though both are used to better understand data.

But before we dive into the specifics surrounding relational databases compared to Hadoop / Big Data, we need to first talk about the differences between solving the a data problem by scaling up the problem or scaling it out.

One way to understand the difference is to use a space analogy.  (If you’re part of the SQL Twitter community, you’ll notice a prevalence of NASA and Space tweets)

Scale Up Space Analogy

Antennae Galaxies

The amazing image above is The Antennae Galaxies / NGC 4038-4039 from the Hubble Telescope – click on the image to see the full extra-large image, it is magnificent.

Database

In terms of space analogies, the Hubble Telescope is analogous of a scale up technology such as your traditional relational database.

  • It is often utilizes non-commodity hardware (or if it is commodity, it’s enterprise commodity hardware).
  • Specialized equipment was designed and built for the Hubble telescope.  While not as astronomic (!) in price, enterprise database performance requires more specialized (and expensive) hardware.
  • The Hubble telescope in itself is a single point of failure in that if we were to lose the telescope (or a lens), we would lose the ability to get all of these amazing images.

This is NOT to say scale up is a bad thing, after all we get these amazing images from the Hubble telescope precisely because we (well, NASA scientists) have focused on creating Hubble with non-commodity specialized hardware.  Following the same analogy, relational database systems (or cube systems) also benefit from this scale up approach because it becomes possible to provide query results quickly.

Scale Out Space Analogy

Galaxies to SETI

The flow of the above image is my representation of the Search for Extra Terrestrial Intelligence (SETI) project.  SETI itself has the ultimate example of scale out distribution with their SETI@Home project.  The search for extra terrestrial intelligence:

  • Begins with the search of radio waves through out the galaxies.  The image above is that of the Giant Galactic Nebula NGC 3603 (also from the Hubble telescope)
  • Those radio waves are detected by radio observatories located in various locations on the planet.
  • All of this data is stored and initially crunched by super computers like the Barcelona Supercomputing Center.
  • Yet how a lot of this crunching is done by BOINC-based projects like SETI@Home – i.e. making use of screensavers on one’s home machine.

That is, a sizeable chunk in the search for radio waves in the astrological heavens is being done by home computer screensavers – around 5.2 Million participants processing 769 teraFLOPS (11/14/2009) of data!

The key facets of commoditized distributed computing are then:

  • The problems can be broken down into small enough chunks so they can be distributed and calculated locally.
  • The system is designed – right from the beginning – to engage with hundreds or thousands of machines.
  • The system can easily handle many points of failure transparently.  Auto-replication is one of the ways to prevent this from being a problem.

distro

BOINC projects like SETI@Home are designed to handle the problems associated with distributed computing – network latency, loss of connectivity, tracking tasks, tracking jobs, etc. The fundamental being the ability to break down the problem into a small enough chunks so data can be easily transferred, processed, and transferred back – while keeping track of all of those chunks to ensure that data processing has been completed.

Elephant

Bring this back to data systems, Hadoop and distributed data systems are able to take the problem and distribute this across tens / hundreds / thousands of machines with ease.  This is because they were designed with the idea of distributed processing in the first place (e.g. replication, fault tolerance, task restart-ability, etc.).

Discussion

There are many more concepts that need to be covered when we really dive into relational databases compared to Hadoop / Big Data systems.  But the fundamental to start with is that of “scale up” and “scale out”.

As you can see with the Hubble Telescope / SETI projects analogy – both are important and both solve their respective problems in different ways.  This doesn’t mean one is right and one is wrong – this is really more about the adage of “Use the right tool for the right problem”.   After all, it would be really hard to get the amazing images of Hubble telescope by using hundreds or thousands of smaller commodity telescopes from Costco.   Nor would it be possible for a single powerful telescope to examine all of the radio waves from the observatories on Earth.

So when it comes to scaling up or scale out for data problems – it’s not about which one to use, it’s about which one to use when.

Follow

Get every new post delivered to your Inbox.

Join 1,049 other followers