In the process of building a set of SSIS packages that would process an immense amount of data, below are some learnings that have been determined in the process. The context of these tips is to build a set of packages that would bulk insert many files into a SQL database.
For Each Loop
In order to bulk insert many files from the file system to a database, one of the nice options included within SSIS is the inclusion of the “Foreach Loop” Container. An obvious selection would be to use the “Foreach File Enumerator”. After all, it allows you to “link” to a directory, loop through all of the files within that folder, and perform whatever process (e.g. bulk insert the data in those files) that needs to be performed.
This particular enumerator may have some issues when new files are constantly being added to the folder being executed against. Note, I do emphasize may here because some of the empirical tests executed were inconclusive on this last statement. Note, if you want to use the ForEach enumerator anyways, please do refer to Jamie Thomson’s blog at: http://blogs.conchango.com/jamiethomson/archive/2006/10/13/SSIS_3A00_-Expressions-on-ForEach-enumerators.aspx (Thanks to Jamie for correcting this blog as well! )
To resolve this, one should use the “Foreach from Variable Enumerator” which will then allow you to dynamically note which file location at run-time and add a checkpoint if there are files streamlining into that file location. Saying this below is a partial screenshot of a SSIS package of how to bulk insert multiples files into a SQL database.
File Loop Package
As you can see from the graphic below, the basic objects needed for this package are the “Get Item FileList” task, Foreach Loop container, Update BulkFile Connection task, Bulk Insert Task, the two connections noted within the “Connection Managers”
and the four global variables noted in the graphic below
The reason for using global variables is because this allows one to either input the values of these variables at run-time and/or obtain these values from a metadata repository (e.g. registry, database, configuration file, etc.). This will allow a degree of flexibility especially when running the system on an enterprise environment and/or mix environments.
The four global variables needed for this package are:
ConnString: This string variable allows one to enter the connection string of the SQL database to be connected. Note, the connection within the connection manager itself may already be connected to your desired database and then you may not need this variable. But, just in case you want the ability to do this, you can create a global variable which contains the connection string text within it. Later in this article you will see how to modify the connection manager to use this variable.
FileDir: This string variable contains the directory location containing the files one wants to process. Because this is variable, one can build an additional script task (not included in this article) which can obtain the file directory from metadata.
FileList: This is object variable which will be used to store the list of files found within the specified directory. The above noted “Get Item FileList” task list (more details below) performs the task of populating this variable and the “ForEach Loop” container will extract the list of files from this variable.
FileName: This string variable is used by the “Foreach Loop” container. As you can see from the Foreach Loop dialog below in the Foreach Loop Container section, notice the Variable User::FileList.
Changing Connection Manager via Expressions
As noted above, the ConnString variable was created so one can dynamically create a connection string at run-time. Note, an important thing about using global variable connection strings is that it is important to provide a valid database connection string right at run-time (e.g. via command line parameter and/or configuration). If the connection string is invalid (blank, etc.), SSIS will automatically validate the connection to an invalid database and thus the package will immediately fail.
Presuming you are comfortable with these constraints, then you can update the properties DbConnection (within the Connection Managers) by adding an “Expressions property” as per above. Specifically, the “ConnectionString Expressions property” can be pointed to a global variable (in this case, the ConnString variable) such that this database connection can be dynamically generated at run-time.
Get Item FileList
As alluded to above, the “Get Item FileList” script task will perform the task of retrieving the file list from the directory specified in the “FileDir” global variable. It will then populate the “FileList” object global variable with the full list of files. By initially setting the ScriptTask of “ReadOnlyVariables” with “FileDir” and “ReadWriteVariables” with “FileList”, the Visual Basic .NET script code-below will perform the task of updating the “FileList” global variable.
Public Class ScriptMain
Private listForEnumerator As ArrayList
Public Sub Main()
Dim directoryName As String
Dim localFiles() As String
Dim localFile As String
Dim fileFilter As String
‘ Looking for files with the naming convention of "MyFiles*.csv"
fileFilter = "MyFiles*.csv"
‘ Create an array list that will contain the files.
listForEnumerator = New ArrayList()
‘ Obtain directory name that contains the files from the
‘ global variable "FileDir"
directoryName = Dts.Variables.Item("FileDir").Value.ToString()
‘ Create the array of local files which allows you to filter for
‘ file names of only "MyFiles*.csv" as specified in the above
localFiles = Directory.GetFiles(directoryName, fileFilter)
‘ Additional logic can be added in the for loop below
‘ e.g. based on ModifiedDateTime, additional name filters, etc.
For Each localFile In localFiles
’ Transfer the array of files to the "FileList" object variable
Dts.Variables("FileList").Value = listForEnumerator
Dts.TaskResult = Dts.Results.Success
The above code has comments that explains the code. Note, some of this code has been copied from the SSIS samples that can be found at http://msdn.microsoft.com/.
Foreach Loop Container
As noted in the beginning of this article, the Foreach Loop Container should be using the “Foreach From Variable Enumerator” pointing to the “FileList” global variable. Within the “Variable Mappings” of the “Foreach Loop Editor”, set the Variable to point to the “User::FileName” variable with an Index of 0. Therefore, the tasks within this container (“Update BulkFile Connection” task and the “Bulk Insert task”) will reference the “FileName” global variable and loop through the entire “FileList” array until complete.
Update BulkFile Connection
For the database connection (DbConnection) within the connection manager, to dynamically change the connection string we used a global variable and an expression to evaluate that global variable. But for the BulkFile connection, we are using some code instead. The BulkFile connection is a file connection to a specific file. But the problem with trying to process many files is that you will need to update the BulkFikle connection to be constantly updated so that it is pointing to a new file.
Since the “Update BulkFile Connection” script task is within the “Foreach Loop” container, this script task will have access to the “FileName” global variable that will be updated with a new file name for each loop. Therefore, the script task will need to make use of it, and then update the BulkFile connection manager to point to this new file.
To do this, make sure to:
1) The "Update BulkFile Connection" script task has Script -> ReadOnlyVaribles pointing to the "FileName" global variable.
2) The VB.NET Script code within this script task is quite simple and noted below
Public Class ScriptMain
Public Sub Main()
‘ Set BulkFile connection to gvBulkFile global variable
Dts.Connections.Item("BulkFile").ConnectionString = Dts.Variables.Item("FileName").Value.ToString()
Dts.TaskResult = Dts.Results.Success
For every loop within the “Foreach Loop” container, the “FileName” global variable is updated from the “FileList” object global variable with a new file name. This file name is then passed on to the above VB.NET Script code which will then update the “BulkFile” connection to point to this new file.
Bulk Insert Task
The final task within the “ForEach Loop” container is to actually perform the task of bulk inserting the data from the file to the database. To set it up, create a bulk insert task with the configuration noted below.
As you can see:
Destination Connection: This points to the database (in this case) which is already specified by the DbConnection connection manager. The Destination Table itself can be pre-specified as per above (test.dbo.CsvData) or you can use an expression to point to some global variable which contains the table name.
Format: In the case of this example, I’m using a CSV formatted file hence the format choices made. Saying this, like any bulk insert task you can specify TSV, binary, and/or your own format file to bulk insert the data.
Source Connection: This is pointing to the BulkFile connection manager which was manipulated just the task before (Update BulkFile Connection) to point to the new file.
As you can see within this article, above are some pointers on how to use the connection managers, expressions, and script code to bulk insert many files into a database. A key for any enterprise or diverse production environment is that the use of global variables will also allow you to use configuration files, metadata, and/or command line parameters to manipulate the processing at run-time.