Feeds:
Posts
Comments

A big shout out to Brad Sarsfield (@bradoop) for creating these great How-To videos for Hadoop on Azure.

 

How To: Upload Data and Use the WordCount Sample with Hadoop Services for Windows Azure (video)

 

 

Run the Pi Estimator Sample on Hadoop on Windows Azure (video)

As I am writing more about Big Data, I’m been asked whether we need to have traditional relational or cube systems now that we have Big Data / NoSQL / Hadoop.  My responses are to note that these are different systems that serve different purposes even though both are used to better understand data.

But before we dive into the specifics surrounding relational databases compared to Hadoop / Big Data, we need to first talk about the differences between solving the a data problem by scaling up the problem or scaling it out.

One way to understand the difference is to use a space analogy.  (If you’re part of the SQL Twitter community, you’ll notice a prevalence of NASA and Space tweets)

Scale Up Space Analogy

Antennae Galaxies

The amazing image above is The Antennae Galaxies / NGC 4038-4039 from the Hubble Telescope – click on the image to see the full extra-large image, it is magnificent.

Database

In terms of space analogies, the Hubble Telescope is analogous of a scale up technology such as your traditional relational database.

  • It is often utilizes non-commodity hardware (or if it is commodity, it’s enterprise commodity hardware).
  • Specialized equipment was designed and built for the Hubble telescope.  While not as astronomic (!) in price, enterprise database performance requires more specialized (and expensive) hardware.
  • The Hubble telescope in itself is a single point of failure in that if we were to lose the telescope (or a lens), we would lose the ability to get all of these amazing images.

This is NOT to say scale up is a bad thing, after all we get these amazing images from the Hubble telescope precisely because we (well, NASA scientists) have focused on creating Hubble with non-commodity specialized hardware.  Following the same analogy, relational database systems (or cube systems) also benefit from this scale up approach because it becomes possible to provide query results quickly.

Scale Out Space Analogy

Galaxies to SETI

The flow of the above image is my representation of the Search for Extra Terrestrial Intelligence (SETI) project.  SETI itself has the ultimate example of scale out distribution with their SETI@Home project.  The search for extra terrestrial intelligence:

  • Begins with the search of radio waves through out the galaxies.  The image above is that of the Giant Galactic Nebula NGC 3603 (also from the Hubble telescope)
  • Those radio waves are detected by radio observatories located in various locations on the planet.
  • All of this data is stored and initially crunched by super computers like the Barcelona Supercomputing Center.
  • Yet how a lot of this crunching is done by BOINC-based projects like SETI@Home – i.e. making use of screensavers on one’s home machine.

That is, a sizeable chunk in the search for radio waves in the astrological heavens is being done by home computer screensavers – around 5.2 Million participants processing 769 teraFLOPS (11/14/2009) of data!

The key facets of commoditized distributed computing are then:

  • The problems can be broken down into small enough chunks so they can be distributed and calculated locally.
  • The system is designed – right from the beginning – to engage with hundreds or thousands of machines.
  • The system can easily handle many points of failure transparently.  Auto-replication is one of the ways to prevent this from being a problem.

distro

BOINC projects like SETI@Home are designed to handle the problems associated with distributed computing – network latency, loss of connectivity, tracking tasks, tracking jobs, etc. The fundamental being the ability to break down the problem into a small enough chunks so data can be easily transferred, processed, and transferred back – while keeping track of all of those chunks to ensure that data processing has been completed.

Elephant

Bring this back to data systems, Hadoop and distributed data systems are able to take the problem and distribute this across tens / hundreds / thousands of machines with ease.  This is because they were designed with the idea of distributed processing in the first place (e.g. replication, fault tolerance, task restart-ability, etc.).

Discussion

There are many more concepts that need to be covered when we really dive into relational databases compared to Hadoop / Big Data systems.  But the fundamental to start with is that of “scale up” and “scale out”.

As you can see with the Hubble Telescope / SETI projects analogy – both are important and both solve their respective problems in different ways.  This doesn’t mean one is right and one is wrong – this is really more about the adage of “Use the right tool for the right problem”.   After all, it would be really hard to get the amazing images of Hubble telescope by using hundreds or thousands of smaller commodity telescopes from Costco.   Nor would it be possible for a single powerful telescope to examine all of the radio waves from the observatories on Earth.

So when it comes to scaling up or scale out for data problems – it’s not about which one to use, it’s about which one to use when.

Sunny Sunday: Tofino

TofinoIsolatedBeach

This is a picture of an isolated beach in the wonderful town of Tofino (yes, I actually took it!).  Located on the west coast of Vancouver Island – if you are surfer, camper, hiker, or just plain old nature lover – this is a beautiful place to hang out.  Drive up from Victoria to Nanaimo (yes, of the famed Nanimo bars) and then cut through Vancouver Island to its west coast – it is a wonderfully scenic drive (this from a person that doesn’t like driving). 

For more information on Tofino, check out Tofino’s Wikipedia page – and check out Bing’s images of Tofino.  Oh, and if you go there – I highly suggest the local seafood joint The Schooner Restaurant

 

 

About “Sunny Sunday”: The Sunny Sunday blog posts are photos from various travel and/or outdoor (hiking) trips.

PASS 2011 Keynote Isotope
.

“I caught a fish thiiiiis biiig”

– On stage with Ted Kummert during the PASS 2011 Keynote on Big Data (thanks to Karen Lopez @datachick for the pic)

.
.
During the PASS 2011 Keynote (back in October 2011), I had the honor to demo Hadoop on Windows / Azure.   One of the key showcases during that presentation was to show how to connect PowerPivot to Hadoop on Windows.  In this post, I show the steps on how to connect PowerPivot to Hadoop on Azure.

Pre-requisites

.

Configuration Steps

1) Reference the following steps from How To Connect Excel to Hadoop on Azure via HiveODBC

The steps to follow are the:

  • Install the HiveODBC Driver (we will configure the DSN later)
  • Steps 1 – 3 from Using the Excel Hive Add-In to open the ports in Hadoop on Azure

image
.
.
2) Create a Hive ODBC Data Source > File DSN

Here, we will go about creating a File DSN Hive ODBC Data Source.

Thanks to Andrew Brust (@andrewbrust), the better way to make a connection from PowerPivot to Hadoop on Azure is to create a File DSN.  This allows the full connection string to be stored directly within the PowerPivot workbook instead of relying on an existing DSN.

To do this:

  • Go to the ODBC Data Sources Administrator and click on the File DSN tab.

image

  • Click on Add, Choose HIVE, Click Next, Click Browse to choose a location of the file; click Finish.

image

  • Open the File DSN you just created and click Configure.  The ODBC Hive Setup and configure the host (e.g. [clustername].cloudapp.net) and authentication information (the username is what you had specified when you had created the cluster)

image
.
.
3) Connect PowerPivot to Hadoop on Azure via the HiveODBC File DSN

  • Open up the PowerPivot ribbon and click on the Get External Data from Other Sources.

image

  • From the Table Import Wizard, click on the Others (OLEDB/ODBC) and click Next.

image

  • From here, click Build and the Data Link Properties, click on Provider, and ensure the Microsoft OLEDB Provider for ODBC Drivers is selected. Click Next.
  • In the Data Link Properties dialog, choose “Use connection string”, and click Build and choose the File DSN you had created from Step #2.  Enter in the password to your Hadoop on Azure cluster.  Click OK.

image

  • The Data Link Properties now contains a connection string do the Hadoop on Azure cluster.

image

Note, after this dialog, verify that the password has been entered into the connection string that that has been built into the Table Import Wizard.  Note, the blue arrow points to a lack of a PWD=<password> clause.  If the password isn’t specified, make sure to add it back in.

image

  • Click OK, click Next.  From here you will get the Table Import Wizard and we are back to the usual PowerPivot steps.
  • Click on “Select from a list of tables and views to choose the data to import”

image

  • Choose your table (e.g. hivesampletable) and import the data in.

image

It looks like a lot of steps but once you get into the flow of things, it’s actually a relatively easy flow.

Enjoy!

Warehouse 13

.

.

.

I love Pittsburgh, they put fries on nachos here.

– Pete Lattimer, Warehouse 13

.

.

.
For those un-familar with the reference, Warehouse 13 is an awesome Syfy show…and Costco is a warehouse store – yeah, weak connection here.

Yet another themed blog series

Starting with the recent Foodie Friday blog post (Foodie Friday: Taiwanese dessert 芋圓), figured I should add another non-geek themed blog post series – Travel Tuesday – probably every week or two, eh?!

Okay…so what’s this about Costco?

yeah right – so back to the title of this post – what are the top 3 reasons to go to Costco in Taiwan (from a US ex-pat).  The background is that my family and I are hanging out in Taiwan for the next few month – so lots of weird tips and foodie tips from Taiwan over the next few months, eh?!

5) Well, it’s Costco after all!

Where else can I buy enough toilet paper to survive the next ice age?  Or have such easy returns?  Or have actually good customer service?  But what’s great is that this is the same here in Taiwan too!

It’s still a huge warehouse with tonnes of the stuff that us ex-pats recognize but also plenty of stuff that’s made for the local market such as great Korean pears, Japanese quality fruit, etc.

4) OMG – Spacious Parking!!

If you drive around in Taiwan – almost anywhere in Taiwan that isn’t a highway – you are absolutely surrounded by a million scooters.  And if you don’t get a migraine from avoiding running over any of the scooters, parking in Taiwan…well, parking just sucks.  Small parking spaces, vehicles parking in places that just …aren’t parking spaces (e.g. in the middle of the road), …, ugh!

And at Costco – that’s just nice.  Wide lanes so two cars can actually comfortably fit, plenty of spots, and most importantly – spacious parking spots so I can actually park, get out of the car, and not worry that another car will trap me from getting into the car.

3) “Reminds me” of Seattle

And as many of you know, I’m based out of Seattle… and so is Costco! Costco’s home office is in Issaquah (suburb of Seattle). Costco’s Kirkland brand name is in homage to their original headquarters which was in Kirkland, WA (another suburb of Seattle).

So whenever we’re missing home – we just head off to Costco and we’re good to go!  Sort of reminiscent Garbage’s Only Happy When it Rains!

2) The ability bulk order scotch

Yeah, that’s just cool!  ‘nuff said!

But the primary reason you want to go to Costco when in Taiwan (from a US-expat)

1) Toilet Seat Covers!

I don’t think I need (nor do you want me) to explain this one!

Confucius says: Man who stand on toilet is high on pot!

026_23A

When you hike up to Rattlesnake Ridge, you get a nice 270’ view of Rattlesnake Ridge area.  For more info, check out: ttp://www.wta.org/go-hiking/hikes/rattle-snake-ledge

About “Sunny Sunday”: The Sunny Sunday blog posts are photos from various travel and/or outdoor (hiking) trips.

The posting Setup Azure Blob Store for Hadoop on Azure CTP provides a quick way to upload files to your Azure Blob storage account and connect Hadoop on Azure CTP to it.  Now that you have done that, one of the first things you may want to do is to interact with the data.

To do this, let’s create a Hive table within Hadoop on Azure CTP that is connected to the files you uploaded to your Azure Blob storage account and query it.  We will be referencing the scenario noted at: Hadoop on Azure Scenario: Query a web log via HiveQL

The tasks we will be performing are:

  1. Setup Azure Blob Store for Hadoop on Azure CTP
  2. Create a Hive table referencing the files in the Azure Blob Storage account
  3. Execute a simple query

1) Setup Azure Blob Store for Hadoop on Azure CTP

To do this, please refer to Setup Azure Blob Store for Hadoop on Azure CTP

.

2) Create a Hive table referencing the files in the Azure Blob Storage account

Following the Hadoop on Azure Scenario: Query a web log via HiveQL scenario

  • Go to the Hadoop on Azure Interactive Hive Console
  • Create a Hive table using the statement below

CREATE EXTERNAL TABLE weblog_sample_asv (
evtdate STRING,
evttime STRING,
svrsitename STRING,
svrip STRING,
csmethod STRING,
csuristem STRING,
csuriquery STRING,
svrport INT,
csusername STRING,
cip STRING,
UserAgent STRING,
Referer STRING,
scstatus STRING,
scsubstatus STRING,
scwin32status STRING,
scbytes STRING,
csbytes STRING,
timetaken STRING
)
COMMENT ‘This is a web log sample ASV’
ROW FORMAT DELIMITED FIELDS TERMINATED by ’32′
STORED AS TEXTFILE
LOCATION ‘asv://weblog/sample’;

Note that the only difference between the original HiveQL script (which goes to HDFS) and the one that goes to the Azure Blob storage is the highlighted LOCATION statement using the asv protocol.

NOTE: As noted in Setup Azure Blob Store for Hadoop on Azure CTP, we are using the protocol of asv://<container>/<folder> so that way its possible for Hadoop to view any and all files uploaded to the sample folder.

image

 

3. Execute a simple query

Now that you have created a Hive EXTERNAL table that points to the files located in the weblog/sample folder of your Azure Blob storage account, you can now query it.

The query below is the result from:

select * from weblog_sample_asv limit 10;

image

One of the cool ways to run Hadoop on Azure is to have it connect to Azure Blob storage via your Windows Azure Storage account.  To setup your Azure storage account, please refer to http://windows.azure.com. The tasks below will allow you to setup your Hadoop on Azure CTP account to connect to an existing Azure Blob Storage account using the asv protocol.  For example, within Hadoop, you normally would get a listing of files within HDFS using the command line interface:

hadoop fs –ls /

In the case of accessing files within Azure Blob storage, you can run the command:

hadoop fs –ls asv://<container>/<folder>

The basic steps are:

  1. Obtain the Azure Blobstore Storage Account Name and Access Key.
  2. Set up ASV connection between Hadoop on Azure CTP and your Windows Azure Blob Storage account.
  3. Upload files to your Azure Blob Storage account

1) Obtain the Azure Blobstore Storage Account Name and Access Key

Access your Azure Blobstore Storage account through the Windows Azure Platform dashboard via http://windows.azure.com/.  From here, the navigation path is [Hosted Services, Storage Accounts & CDN] (bottom left) –> [Storage Accounts] (mid-top left).

  • The name blobstore account name is the Storage Account under the subscription as noted within the middle pane.  In this case, I have a storage account called isocatstore.
  • To get the access key, click on the [View] button on the properties right pane after clicking on the storage account in question.

image

 

2) Set up ASV connection between Hadoop on Azure CTP and your Windows Azure Blob Storage account.

From the Hadoop on Azure CTP portal page, click on the [Manage Data] tile.  From here, click on the [Set up ASV] button on the right.

Manage Data

From here, you can supply the credentials of your Azure Blob Storage account that you had obtained in Step 1.

image

Click on [Save Settings] and you are good to go.

 

3) Upload files to your Azure Blob Storage account

A great way to upload files to your Azure Blob Storage account is to use CloudXplorer – you can download it from here: http://clumsyleaf.com/products/cloudxplorer

NOTE: When you upload the files, please ensure to place the files within a folder within a container of your blobstore account.  It is important to do this so that way Hadoop will be able to list all of the files within the folder instead of you needing to access each file individually (which is what would happen if you placed the files directly within the container).

From CloudXplorer, you can quickly create a container and a folder; in this case, I had created the weblog container and the sample folder.

image

Using the intuitive UI, copy your files from your local box to the Azure Blob Storage account.

By doing it in this fashion, you will be able to get a listing of your files from the Hadoop command line interface using the command:

hadoop fs –ls asv://weblog/sample

As well, from the Hadoop on Azure JavaScript Interface, you can view a listing of files using the command

#ls asv://weblog/sample

image

I realized that from my tweets and Facebook posts (thanks Facebook Timeline) – I often go outside of the realm of being a nerd and display all sort of passion on travel, outdoor stuff, and of course food!

So one of the things, I’m going to start adding to my repertoire of blog posts will be Foodie Friday.  Every Friday (well, maybe every 2nd Friday), I’ll blog post about some awesome food that I like and that you may want to try!

Taiwanese “Small Eats” and Night Markets

So for my first post, let’s talk about 芋圓 – also known as taro circles.  In Taiwanese – it’s pronounced like ong-yi.  Taiwan is famous for many of its 小吃 – or “small eats”.  Instead of eating a large portion of one particular item, the key is to have many small portions of A LOT of different variety.   You walk to down the many 夜市 – night markets – in any almost any Taiwanese cities – and there are tons of small vendors that make just a few items – specializing in it – for a matter of years.  It’s the old adage of the old Taiwanese cook that has used the same iron pot for 30 years to make fried rice.  But the fried rice tastes better because the pot has many different layers of the same fried rice made the same way for the last 30 years.

Taro Circles or 芋圓

So back to the 芋圓 – basically it’s a simple dessert made typically made with multi-colored taro circles, herbal black jelly, tapioca, green beans and/or red beans on top of crushed ice swimming in sugar / glucose water.

Yet, it is deliciously yummy and very light – very typical of many Taiwanese desserts.

There are many locations in Taiwan and you pretty much find it at any night market and there are usually plenty of stands on every major street.  Typical of many Taiwanese foods, people will have different takes and preferences – and debate adamantly about which stand serves better 芋圓 than another.  There could be three stands all next to each other – and you’ll see the vast majority of people lined up at just one of those stands just because it was the first, or has the better red bean, or better peanuts, or better taro, etc. etc.

The thought has occurred to many many times as I visit my favorite place in 斗六 – Doulieu, it’s in central Taiwan about an 40min southwest of 台中 (Taichung) – called 青麥芋圓 – to create one in Seattle.  I wonder how well it would do …

P.S. if you ever are in 斗六, do check out 青麥芋圓 and order the 芋圓 #4 and #5.  Enjoy!

Doctor Who Amys Choice

.

Funny how you can say something in your head and it sounds fine

– Doctor Who (Matt Smith in Amy’s Choice)

.

Recently I had posted the wiki article: Hadoop on Azure Scenario: Query a web log via HiveQL.

In it described how to analyze a sample web log using HiveQL on the Hadoop on Azure CTP (HadoopOnAzure.com) using

  • Interactive Hive console,
  • Interactive Javascript console,
  • Secure FTP using curl to transfer data to HDFS
  • Creating an EXTERNAL table against compressed log files
  • Executing some simple HiveQL queries.

HiveQL weblog

Screenshot of the Hadoop on Azure Interactive Hive Console executing a HiveQL query.

A bunch of people had followed up to me directly on the article with the question:

Gee – do I have to do all of these steps to do work with Hadoop on Azure?!

The quick answer is: No, the purpose of the wiki article was to showcase how one can interact with Hadoop on Azure – which is very different from the traditional command line interface (CLI).  A great blog post on how the design of Hadoop on Azure was conceived, jump over to Dave Vronay’s blog post: The Design of the Portal for HadoopOnAzure.com.

The wiki entry is a “scenarios” article – so that way you can use all of the different functionality available to you on the Hadoop on Azure portal – like using the interactive consoles, running HiveQL queries, etc.

image

Screenshot of the Hadoop on Azure portal where you interact with live tiles to open ports, remote desktop in, and use an interactive JavaScript or Hive console to run your queries.

Most importantly – because of the cool design, try out the scenarios not only on your desktop/laptop but also on your mobile devices, eh?!

Enjoy!

Older Posts »

Follow

Get every new post delivered to your Inbox.

Join 850 other followers