Quick Tip for extracting SQL Server data to Hive

While I have documented various techniques to transfer data from Hadoop to SQL Server / Analysis Services (e.g. How Klout changed the landscape of social media with Hadoop and BI Slides Updated, SQL Server Analysis Services to Hive, etc.), this post calls out the reverse – how to quickly extract SQL Server data to Hadoop / Hive.   This is a common scenario where SQL Server is being used as your transactional store and you want to push data to some other repository for analysis where you are mashing together semi-structured and structured data.

How to minimize impact on SQL Server

Analogous to the above linked documents, you could use a JDBC driver to connect and extract this data from SQL Server (other techniques may involve SQOOP, HiveODBC, etc.).  But if your security policies allow for this, a fast way to extract data from your SQL Server with minimal impact is to BCP the data out.
In many scenarios, the data in your SQL Server is already partitioned by date, so you can run a BCP queryout statement that utilizes the partition key so that the bulk copy query will only hit the specific daily partition – i.e. no table scans, limited to no partition scans, minimizing IO, etc.

But what field terminator?

But one of the key problems is what field terminator should you use to extract out the data.  Traditional field terminators are tabs (\t), commas (,), or pipes (|) – yet with the types of events, messages, and logs stored these days the previously mentioned field terminators will likely be in the data extract as well.   For example, if a Hive external table is built against comma-delimited files with the fields themselves containing commas, your schema will get messed up.

By default, Hive itself uses the ^A (01) as its field terminator; if you do not specify a field terminator when creating a Hive table, the file(s) Hive generates it will terminate fields using ^A.  So if you could create bulk extract from SQL Server that is ^A-delimited, you’re set!

So how do I create a ^A-delimited file via BCP from SQL Server

To create a ^A-delimited file via BCP from SQL Server, the syntax is as follows:

bcp “select * from [table] with (no lock) where tx_date >= ’12/26/2013 00:00:00′ and tx_date < ’12/27/2013 00:00:00′” queryout test.log -S $servername -T -c -t 0x01 -a 65535 -b 100000

As noted in the BCP syntax above, the -t (field terminator parameter) has the value of 0x01 which corresponds to ^A (01) in Windows.

By bulk exporting your data in this fashion, you can now copy / put your data into your Hadoop cluster in Hive’s default field terminator.


One Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s