A blog about SQL Server, SSIS, C# and whatever else I happen to be dealing with in my professional life.

Find ramblings

Tuesday, November 29, 2011

The worst query I ever wrote

I work in an unusual environment---not everything is documented. I know, that sounds too unlikely to be true but it is. There was data in the database that needed to be reported on but not through the supplied UI. In this case, the specifics was the data was stored in Tfs_DefaultCollection and somewhat reportable through Visual Studio Team Explorer but the users wanted queries and reports that weren't supported through the API.

Enter my big mouth saying it's a database I can find a way to make it work

After randomly picking tables and scrolling through values, I needed a smarter approach. Smartness for me meant using the system meta data to find any column that could hold my data (right data type and minimum length). I then build and execute a query looking for an example of the search key in that column. Below is an example of a search for the word change. What I like about this query, is it ought to be fairly portable/database agnostic as I'm using the INFORMATION_SCHEMA views to identify the candidate schemas, tables and columns.


SET NOCOUNT ON;

DECLARE
    -- actual query 
    @query nvarchar(max)
    -- templated query
,   @template nvarchar(max)
    -- string value we wish to find
    -- Length of 50 is arbitrary
,   @searchKey varchar(50);

SET
    @searchKey = 'change';

-- Return column & schema/table combo anywhere
-- there is data in the column that starts like the
-- search key.
SELECT
    @template = N'
    SELECT TOP 1 
        ''<COLUMN_NAME/>'' AS cname
    , ''<SCHEMA/>.<TABLE_NAME/>'' AS tname 
    FROM 
        <SCHEMA/>.<TABLE_NAME/> T 
    WHERE 
        T.<COLUMN_NAME/> LIKE ''<TARGET/>%''';

SELECT
    @template = REPLACE(@template, '<TARGET/>', @searchKey);

DECLARE 
    CSR CURSOR
FOR
-- Iterate through all the columns that are
-- character data types and are at least
-- as long as the search key
SELECT
    ISC.TABLE_SCHEMA
,   ISC.TABLE_NAME
,   ISC.COLUMN_NAME
FROM
    INFORMATION_SCHEMA.COLUMNS ISC
WHERE
    -- filter out tables/views I know I can skip
    ISC.TABLE_NAME not like 'x%'
    AND
    (
        ISC.DATA_TYPE IN ('char', 'nchar', 'varchar', 'nvarchar')
        AND ISC.CHARACTER_MAXIMUM_LENGTH >= LEN(@searchKey)
    );

-- Cursor variables for capturing candidate schemas, tables and columns
DECLARE
    @table_schema sysname
,   @table_name sysname
,   @column_name sysname;

DECLARE
    @RESULTS TABLE
(
    column_name sysname
,   table_schema nvarchar(500) NOT NULL
);

OPEN
    CSR;

FETCH NEXT
FROM
    CSR
INTO
    @table_schema
,   @table_name
,   @column_name;

WHILE (@@FETCH_STATUS = 0)
BEGIN
    -- stub in actual names, make 'em safe via quotename function
    SET @query = REPLACE(@template, '<SCHEMA/>', quotename(@table_schema));
    SET @query = REPLACE(@query, '<TABLE_NAME/>', quotename(@table_name));
    SET @query = REPLACE(@query, '<COLUMN_NAME/>', quotename(@column_name));

    BEGIN TRY
        --PRINT @query
        
        -- Dump results into a table variable
        INSERT INTO
            @RESULTS
        EXECUTE(@query);
        
    END TRY
    BEGIN CATCH
        -- print failing query
        PRINT @query;
        
    END CATCH
    FETCH NEXT
    FROM
        CSR
    INTO
        @table_schema
    ,   @table_name
    ,   @column_name;
END
CLOSE CSR;
DEALLOCATE CSR;

-- Show all the columns and fully qualified tables
-- that contained the value
SELECT 
    R.column_name
,   R.table_schema
,   'SELECT T.* FROM ' + R.table_schema + ' T WHERE T.' + R.column_name + ' LIKE  ''' + @searchKey + '%''' AS explore_query
FROM 
    @results R;

I can run the above against Tfs_DefaultCollection.

column_name table_schema explore_query
[ChildItem] [dbo].[tbl_Version] SELECT T.* FROM [dbo].[tbl_Version] T WHERE T.[ChildItem] LIKE  'change%'
[ProjectName] [dbo].[tbl_Project] SELECT T.* FROM [dbo].[tbl_Project] T WHERE T.[ProjectName] LIKE  'change%'
[ChildItem] [dbo].[tbl_VersionedItem] SELECT T.* FROM [dbo].[tbl_VersionedItem] T WHERE T.[ChildItem] LIKE  'change%'
And easily switch over to Tfs_Warehouse to try and track those values over there
column_name table_schema explore_query
[ChangesetTitle] [dbo].[DimChangeset] SELECT T.* FROM [dbo].[DimChangeset] T WHERE T.[ChangesetTitle] LIKE  'change%'
[FileName] [dbo].[DimFile] SELECT T.* FROM [dbo].[DimFile] T WHERE T.[FileName] LIKE  'change%'
[ProjectNodeName] [dbo].[DimTeamProject] SELECT T.* FROM [dbo].[DimTeamProject] T WHERE T.[ProjectNodeName] LIKE  'change%'

Warnings, caveats, etc

While I'd think it's blindingly obvious, never run code from the interblag. Beyond that, this code is not going to be fast by any stretch of the imagination. It could lock a table for the read. It might bog down your disk infrastructure, especially if you're querying against a very large table. And it's going to do it over and over again so buyer beware.

Monday, November 28, 2011

Filter list in SSIS

I saw the question asked, how can I split/filter a data flow based on an array-like data source.. I thought that sounded like a decent question so I investigated some alternate approaches.

Problem definition

Given an iterable/sequence, what are the options for using that reference data as a filter in SSIS? To test this, I created a reference list of US States and their FIPS code as well as an indicator of their allegiance in the Civil War. There's no significance to the data beyond it was a publicly available dataset. I threw against that reference set 60 numbers (monotonically increasing values) to how it could be accomplished. As there are only 50 states and not all were in existence during the Civil War, I expect my data to be split into four buckets: Union, Confederate, Undeclared (valid FIPS code but not in existence), Unknown (invalid code).

Package Setup

I created 3 sets of variables (Confederate, Union and Undeclared) in 2 flavors (String and Object).
variables

I have a script that runs that populates them.

public void Main()
{
    // User::ConfederateList,User::ConfederatesStates,User::UndeclaredList,User::UndeclaredStates,User::UnionList,User::UnionStates
    List<int> Confederates = new List<int> { 01, 05, 12, 13, 22, 28, 37, 45, 47, 48, 51 };
    List<int> Union = new List<int> { 06, 09, 10, 17, 18, 19, 20, 21, 23, 24, 25, 26, 27, 29, 32, 33, 34, 36, 39, 41, 42, 44, 50, 54, 55 };
    List<int> Undeclared = new List<int> { 02, 04, 08, 15, 16, 30, 31, 35, 38, 40, 46, 49, 53, 56 };

    this.Dts.Variables["ConfederateList"].Value = Confederates;
    this.Dts.Variables["ConfederatesStates"].Value = string.Format(":{0}:", string.Join(":", Confederates.ConvertAll<string>(delegate(int i) { return i.ToString(); }).ToArray()));

    this.Dts.Variables["UnionList"].Value = Union;
    this.Dts.Variables["UnionStates"].Value = string.Format(":{0}:", string.Join(":", Union.ConvertAll<string>(delegate(int i) { return i.ToString(); }).ToArray()));

    this.Dts.Variables["UndeclaredList"].Value = Undeclared;
    this.Dts.Variables["UndeclaredStates"].Value = string.Format(":{0}:", string.Join(":", Undeclared.ConvertAll<string>(delegate(int i) { return i.ToString(); }).ToArray()));

    Dts.TaskResult = (int)ScriptResults.Success;
 }

Once this script has completed, these variables will look like
locals window

I also use this script as the source in my Data Flow tasks.

public override void CreateNewOutputRows()
{
    for (int i = 0; i < 60; i++)
    {
        Output0Buffer.AddRow();
        Output0Buffer.StateCode = i;
    }
}

Conditional split

The Conditional Split transformation and the expression language itself does not offer any lookup type operation so a complex object like an array or List is out of the question. However, I was able to elicit a lookup-like functionality by using the FINDSTRING backwards. Normally, I think of using it as FINDSTRING(MyColumn, "SomeValue", 1). That is, if SomeValue exists anywhere in MyColumn, the return value would be the one-based position where the value is found, zero for no match.

However, if you think of it opposite, you could use the reference value as the first argument and the current value in the second. Let's put some values to put this thinking in perspective. Missouri's code is 29. The set of Union states is 6, 9, 10, 17, 18, 19, 20, 21, 23, 24, 25, 26, 27, 29, 32, 33, 34, 36, 39, 41, 42, 44, 50, 54, 55. If I could find a way to represent the Union states set in a string format, I could use FINDSTRING on it. The first problem you'll run into though, is California aka 6. 6 will match 6 but it will also match 26 (Michigan), 36 (New York). We'd luck out in that scenario as they're both Union states. Idaho, 16, would end up being tagged as Union state even though they would not achieve statehood until 25 years after the Civil War. Same story with 46---South Dakota and 56---Wyoming, they were not around as states to cast their lot in the war. 60 too would be recognized as a Union state.

For my data set, I am restricting the FIPS State set to only the 50 states currently recognized. The standard itself recognizes many more entities than just the 50 though so 60 is assigned to American Samoa. 60 should be kicked out to the Unknown bucket based on our input set. Clearly we need to do something more than search for our term, we need to give our search the concept of a word boundary. Using a regular expression, it'd be as easy as using \b. Here, we'd need to introduce our own artificial word boundaries. Chose a value that should never appear in your source data. I chose to delimit everything with a colon. Thus, the above set would be represented as :6:9:10:17:18:19:20:21:23:24:25:26:27:29:32:33:34:36:39:41:42:44:50:54:55: I would then need to modify my input value by concatenating a leading and trailing colon. (FINDSTRING(@[User::UnionStates],":" + (DT_WSTR,2)StateCode + ":",1)) > 0

I chose to create derived columns before the conditional split but the logic could just as easily been performed within the Conditional split itself.

(FINDSTRING(@[User::ConfederatesStates],":" + (DT_WSTR,2)StateCode + ":",1)) > 0
(FINDSTRING(@[User::UnionStates],":" + (DT_WSTR,2)StateCode + ":",1)) > 0
(FINDSTRING(@[User::UndeclaredStates],":" + (DT_WSTR,2)StateCode + ":",1)) > 0
Conditional split

Complex objects

Conditional split works good enough, I suppose but it smells. Maintaining that logic is not going to be pleasant and there's no good way to know if your delimiter character suddenly becomes part of your source data. If it's purely numeric, then yes, you can figure that out but if the task is to find strings in a set of strings, that can get tricky. What would be nice is to use a something cleaner like set based logic for finding membership.

To make this work, we need to use the script transformation but possibly in a way you've never used it.
List object asynchronous script task

This seems simple enough, remarkably like our conditional split approach. The major difference though is this script task is asynchronous. This means the data in our buffers are going to be copied (expensive) from the input buffer to the corresponding output buffer. I also needed to create the various output buffers and define the shape of the data. That's time consuming and nothing I'd care to maintain.

The code is straight forward. Create 3 List variables for the class, instantiate them in the PreExecute and then use the value of the current row in the Contains method of the class objects. Once we find a match, create a row on the appropriate buffer and fill it with data.

private List<int> Confederates;
private List<int> Union;
private List<int> Undeclared;

public override void PreExecute()
{
    base.PreExecute();
    this.Confederates = this.Variables.ConfederateList as List<int>;
    this.Union = this.Variables.UnionList as List<int>;
    this.Undeclared = this.Variables.UndeclaredList as List<int>;
}

public override void Input0_ProcessInputRow(Input0Buffer Row)
{
    if (this.Confederates.Contains(Row.StateCode))
    {
        ConfederatesBuffer.AddRow();
        ConfederatesBuffer.StateCode = Row.StateCode;
    }
    else if (this.Union.Contains(Row.StateCode))
    {
        UnionBuffer.AddRow();
        UnionBuffer.StateCode = Row.StateCode;
    }
    else if (this.Undeclared.Contains(Row.StateCode))
    {
        UndeclaredBuffer.AddRow();
        UndeclaredBuffer.StateCode = Row.StateCode;
    }
    else
    {
        UnknownBuffer.AddRow();
        UnknownBuffer.SateCode = Row.StateCode;
    }
}

As an afterthought, we could make this identical in execution by offloading the conditional split logic to the native component and simply used a script task in synchronous fashion, much like we did for the Conditional Split's Derived Column transformation "Assign values". This would actually be a better approach all around in terms of resource usage, processing speed, maintenance, etc.

Cached Connection Manager

I must confess, until recently, I never thought about CCM. I cut my teeth on SQL Server 2005's SSIS so some of the "new" features aren't as ingrained in my problem solving toolbox like they should be. CCM is one I've definitely overlooked but am coming around to appreciating it's utility. This approach will have us load the available values into a cached connection manager which we will then leverage in the subsequent data flow to use the native Lookup task to determine values.

This post runs long as it is so I won't show the whole code but there are 250 lines of this code enumerating all the states and their affiliation. When executed, that data is sent to a cache connection manager for later consumption.

Output0Buffer.AddRow();
Output0Buffer.StateName = "Alabama";
Output0Buffer.StateAbbreviation = "AL";
Output0Buffer.StateCode = 01;
Output0Buffer.SideDeclared = "C";

Output0Buffer.AddRow();
Output0Buffer.StateName = "Alaska";
Output0Buffer.StateAbbreviation = "AK";
Output0Buffer.StateCode = 02;
Output0Buffer.SideDeclared_IsNull = true;

I use the Lookup task to retrieve the SideDeclared value. I "redirect rows to no match output" to my Unknown bucket. The found rows are sorted into their appropriate bucket with a conditional split.
Cached Connection Manager Lookup

Again, for consistency I could have also sent the unknown rows down to the Conditional Split (Ignore failure option) and done all the redirects there. My reason for not doing so here is laziness. I saw no need to write the conditional split logic when the native component already offered it.

Performance

I scaled up to 60 million rows and used the modulus operator on the (%67) on the source script for sending data to ensure we were sending consistent batches to all the options. I added row counts to all the output so the optimizer didn't factor branches out and ran it five times from the command-line (all in 32 bit mode). As a last minute test, I'm rerunning the processing in 64 bit mode.
Conditional Split list
This is the approach described above abusing FINDSTRING.
Smarter Script
This is the synchronous version of my script working against the List objects
Script task
This is the asynchronous version of my script working against the List objects
Use lookups
This uses the Cache Connection Manager
SourceRowsAverage task duration (ms)σ task durationAverage throughput (rows / ms)σ througputMode
Conditional Split list60,000,000337,80097688.0819249.7932bit
Conditional Split list60,000,000254,5004500235.834.1764bit
Smarter Script60,000,000458,60011,038.12130.913.1432bit
Smarter Script60,000,000429,50030,500140.419.9764bit
Script task60,000,000471,80022,301.57127.465.9932bit
Script task60,000,000466,50019,500128.845.3964bit
Use lookups60,000,000485,80014246.4123.613.6632bit
Use lookups60,000,000457,50011500131.233.364bit

Conclusion

I am shocked on the performance results. For this problem domain, the FINDSTRING approach was the most efficient, by a considerable margin. The other three approaches consistently averaged a throughput of within 7 rows per millisecond of each other. I did find it interesting that the standard deviation of the FINDSTRING approach fluctuated so much. While this box is older and slower, there was not a considerable amount of activity going on during the package executions.

64 vs 32 bit

With a sample size of 2, all the 64 bit tasks saw an increase in throughput over their 32 bit counterparts. FINDSTRING method showed the most improvement, it's now nearly 100 rows/ms faster than the next component. Interestingly enough, the asynchronous script task saw a middling 1 row/ms increase compared to the nearly 10 row/ms increase on the other two tasks. I'll collect more data and see if the pattern holds true.

My package is available on my google site for your own evaluations.

Tuesday, November 22, 2011

SSIS Package Extract from MSDB

A question came up on StackOverflow of how someone can retrieve their packages from the MSDB. In this case, their local development went south and they only had the packages in production. Let this be an object lesson in using version control.

To extract packages from the MSDB, you must first identify where in the msdb they exist. For that, you can query sysssispackagefolders and sysssispackages or you can just use my query SSIS Package Query

Armed with that query, the column of interest is the PackagePath column. Couple that with dtutil and you have an extract-o-matic for package recovery.

The base form of an extract from MSDB on localhost to the current folder in the file system would look like.

dtutil /sourceserver localhost /SQL "Package" /copy file;.\Package.dtsx

Extract-o-matic

Run this query in Text mode (ctrl-T) This query generates a series of dtutil calls which in turn extracts SSIS packages from a server.

;
WITH FOLDERS AS
(
    -- Capture root node
    SELECT
        cast(PF.foldername AS varchar(max)) AS FolderPath
    ,   PF.folderid
    ,   PF.parentfolderid
    ,   PF.foldername
    FROM
        msdb.dbo.sysssispackagefolders PF
    WHERE
        PF.parentfolderid IS NULL

    -- build recursive hierarchy
    UNION ALL
    SELECT
        cast(F.FolderPath + '\' + PF.foldername AS varchar(max)) AS FolderPath
    ,   PF.folderid
    ,   PF.parentfolderid
    ,   PF.foldername
    FROM
        msdb.dbo.sysssispackagefolders PF
        INNER JOIN
            FOLDERS F
            ON F.folderid = PF.parentfolderid
)
,   PACKAGES AS
(
    -- pull information about stored SSIS packages
    SELECT
        P.name AS PackageName
    ,   P.id AS PackageId
    ,   P.description as PackageDescription
    ,   P.folderid
    ,   P.packageFormat
    ,   P.packageType
    ,   P.vermajor
    ,   P.verminor
    ,   P.verbuild
    ,   suser_sname(P.ownersid) AS ownername
    FROM
        msdb.dbo.sysssispackages P
)
SELECT 
    -- assumes default instance and localhost
    -- use serverproperty('servername') and serverproperty('instancename') 
    -- if you need to really make this generic
    'dtutil /sourceserver ' + @@SERVERNAME + ' /SQL "'+ F.FolderPath + '\' + P.PackageName + '" /copy file;".\' + P.PackageName +'.dtsx"' AS cmd
FROM 
    FOLDERS F
    INNER JOIN
        PACKAGES P
        ON P.folderid = F.folderid
-- uncomment this if you want to filter out the 
-- native Data Collector packages
-- WHERE
--     F.FolderPath <> '\Data Collector'

For me, that generated the following code.

dtutil /sourceserver localhost /SQL "\Package" /copy file;.\Package.dtsx
dtutil /sourceserver localhost /SQL "\Data Collector\PerfCountersCollect" /copy file;.\PerfCountersCollect.dtsx
dtutil /sourceserver localhost /SQL "\Data Collector\PerfCountersUpload" /copy file;.\PerfCountersUpload.dtsx
dtutil /sourceserver localhost /SQL "\Data Collector\QueryActivityCollect" /copy file;.\QueryActivityCollect.dtsx
dtutil /sourceserver localhost /SQL "\Data Collector\QueryActivityUpload" /copy file;.\QueryActivityUpload.dtsx
dtutil /sourceserver localhost /SQL "\Data Collector\SqlTraceCollect" /copy file;.\SqlTraceCollect.dtsx
dtutil /sourceserver localhost /SQL "\Data Collector\SqlTraceUpload" /copy file;.\SqlTraceUpload.dtsx
dtutil /sourceserver localhost /SQL "\Data Collector\TSQLQueryCollect" /copy file;.\TSQLQueryCollect.dtsx
dtutil /sourceserver localhost /SQL "\Data Collector\TSQLQueryUpload" /copy file;.\TSQLQueryUpload.dtsx
Hooray, my bacon is saved and now I can put my packages into version control like a good developer.

Friday, November 11, 2011

PowerShell manipulation of SSIS packages

One of the developers has been hand-coding some very repetitive SSIS packages. They perform a full replication of data out of an Informix database into our SQL Server. For the final database, he had about 50 tables that needed to be processed. I convinced him that we can make his life and maintenance better by creating lots of little packages vs the behemoth package with umpteen data flows in it.

I built out a template package, PolSumAgentCountyDir.dtsx, that is an Execute SQL Task, wired to a Data Flow. The trick is the expressions. I have one variable, base_table that drives the operation of the package. There are variables that build the delete statement, source query and destination table. If I change the value of base_table from foo to bar, the OLE DB datasource in the package is going to get very angry because I've swapped out the meta data on it. Back in the DTS days, I could have passed a list to the process and it'd have been able to make it work but this is SSIS and that doesn't fly

The original approach was we were going to split the list of tables to replicate and meet in the middle. Each person would open the template, update the value of base_table, save the package with a name based on the table it's replicating (along with the SSIS's package's Name property) and then fix the meta data. Dull work, but doable.

If I were clever, know what I'd do? I'd spend an hour watching StarTrek FY Edition and writing some PowerShell. Calling the Function below with a fully qualified path to my template package, the table name and the English name of the table (ls_pol_sum => LivestockPolicySummary) results it instantiating an instance of the SSIS object model, loading the template package and programmatically manipulating it and saving it out with a new name. Dead simple but until you try it, or see it, you won't believe how so. I spent more time defining the hash table with table names to English than I did the code for manipulating objects. I'm sure some of that's due to my familiarity with the object model from my PowerShell SSIS deployment and maintenance script.

[Reflection.Assembly]::LoadWithPartialName("Microsoft.SQLServer.ManagedDTS") | out-null

# Utility script for making clones of a template SSIS package
# and modifying the variable (base_table) that drives operation
# of the package
# Created by Bill Fellows
# 2011-11-10

Function ModifyTemplate
{
    param
    (
        [string]$template,
        [string]$tableName,
        [string]$PackageSuffix
    )
    
    $events = $null
    $userName = $null
    $password = $null

    try
    {
        # open template package
        # save as package suffix
        # replace AgentCountyDir in template with suffix
        $newName = $template.Replace("AgentCountyDir", $PackageSuffix)
        $app = New-Object Microsoft.SqlServer.Dts.Runtime.Application
        $package = $app.LoadPackage($template, $null)

        # modify the table variable
        $package.Variables["base_table"].Value = $tableName
        $package.Name = $package.Name.Replace("AgentCountyDir", $PackageSuffix)
        $app.SaveToXml($newName, $package, $events)
    
    }
    catch
    {
        Write-Error ([string]::Format("Failed to do something. Table {0} on server {1}", $tableName, $PackageSuffix))
        Write-Error $_ | fl * -Force
        exit(0)
    }

}

Who loves PowerShell? This guy

Friday, November 4, 2011

Find all the tables without a clustered index

We were having performance issues at work and I got to take a quick look at it. Performing multiple table scans against a heap is not good. When that heap is 6+ million rows, not good becomes atrocious. Using a pair of queries, I put together the following.

Run this on any server and it'll spit back a report listing all the tables per database that have no clustered index and their associated row counts. Don't worry, the row count query is lightning fast as it isn't physically counting rows. Instead it's using a DMV so SQL Server 2005+ only please


DECLARE @database_name sysname  
DECLARE
    @REPORT TABLE
(
    server_name sysname
,   database_name sysname
,   schemaname sysname
,   table_name sysname
,   row_counts bigint
)
    

DECLARE Csr CURSOR FORWARD_ONLY STATIC FOR  
SELECT DB.[name] 
FROM master.dbo.sysdatabases DB
WHERE DB.[name] NOT IN ('tempdb'
,'master'
,'model'
,'msdb'
,'ReportServer'
,'ReportServerTempDB'
)

DECLARE @queryTemplate nvarchar(max) = N'
USE <DB_NAME/>;
WITH OF_INTEREST AS
(
    SELECT DISTINCT 
        @@servername AS server_name
        , DB_NAME(DB_ID()) AS database_name
        , schema_name(T.schema_id) AS schemaname
        , OBJECT_NAME(I.object_id) AS table_name
    FROM SYS.INDEXES I
    INNER JOIN
        sys.tables T
        ON T.object_id = I.object_id
    WHERE INDEX_ID = 0
    AND OBJECTPROPERTY(I.object_id,''IsUserTable'') = 1
)
, RC AS
(
    SELECT
        s.[Name] as [SchemaName]
    ,   t.[name] as [TableName]
    ,   SUM(p.rows) as [RowCounts]
    FROM 
        sys.schemas s
        LEFT JOIN 
            sys.tables t
            ON s.schema_id = t.schema_id
        LEFT JOIN 
            sys.partitions p
            ON t.object_id = p.object_id
        LEFT JOIN  
            sys.allocation_units a
            ON p.partition_id = a.container_id
    WHERE 
        p.index_id  in(0,1) -- 0 heap table , 1 table with clustered index
        AND p.rows is not null
        AND a.type = 1  -- row-data only , not LOB
    GROUP BY 
        s.[Name]
    ,   t.[name]
)
SELECT
OI.*
, RC.RowCounts
FROM
OF_INTEREST OI
INNER JOIN
    RC
    ON RC.schemaname = OI.schemaname
    AND RC.tablename = OI.table_name
    
'
, @query nvarchar(max)


OPEN Csr  
FETCH NEXT FROM Csr INTO @database_name  
WHILE @@FETCH_STATUS = 0  
BEGIN

    SELECT @query = replace(@queryTemplate, '<DB_NAME/>', @database_name)
    BEGIN TRY
        INSERT INTO
            @REPORT
        EXECUTE (@query)
    END TRY
    BEGIN CATCH
        SELECT
            ERROR_NUMBER()AS error_number --returns the number of the error.
        ,   ERROR_SEVERITY() AS error_severity --returns the severity.
        ,   ERROR_STATE()AS error_state  --returns the error state number.
        ,   ERROR_PROCEDURE() AS error_procedure --returns the name of the stored procedure or trigger where the error occurred.
        ,   ERROR_LINE() AS error_line --returns the line number inside the routine that caused the error.
        ,   ERROR_MESSAGE() AS error_message --returns the complete text of the error message. The text includes the values supplied for any substitutable parameters, such as lengths, object names, or times.
    END CATCH

    FETCH NEXT FROM Csr INTO @database_name  
END  

CLOSE Csr  
DEALLOCATE Csr
;

SELECT * FROM @REPORT R

Thanks to Pinal Dave for the query to identify heaps and James Beresford for the fast row count query.

Wednesday, November 2, 2011

Using Excel in an SSIS lookup transformation

There was a question on StackOverflow: Excel Source as Lookup Transformation Connection asking how one can use Excel as a source in a lookup. As I worked on the answer, I figured it'd be a good thing to post here so I'm copying and pasting my work. The important thing is that this approach isn't restricted to just the lookup task. If you are having trouble using Excel in SSIS as a normal data source because it thinks the data types have changed, access it via the OLE DB driver.

As you have correctly discerned, you cannot use the Excel Connection Manager in a Lookup task, it only accepts the OLE DB connection manager. The "normal" Excel connection isn't an OLE DB type. The trick then, is to use an OLE DB connection manager with an Excel spreadsheet.

For this example, I have a spreadsheet with state codes and their full name and my source data only has the abbreviations flowing through. This walk through will wire up a lookup task against Excel to retrieve those values.

Keep these caveats in mind: Lookups are case sensitive, regardless of whether the source database (or file in this case) is case sensitive. Excel strings are always going to be interpreted as unicode/nvarchar.

Source Data

Given a simple file sitting at C:\tmp\LookupReference.xlsx that looks like the following
Simple key value pairs in Excel

Connection Manager

We must first establish an OLE DB Connection Manager. Instead of the default "Native OLE DB\SQL Server Native Client 10.0" change that to "Native OLE DB\Microsoft Office 12.0 Access Database Engine OLE DB Provider". In the Server or file name, locate the source file. Clicking Test Connection at this point will result in an error.

OLE DB Connection Manager Connection Tab

Here comes the "magic." Click the All tab and scroll to the top. In Extended Properties, add the indicated string "Excel 12.0;HDR=YES; IMEX=1;" This tells the provider that we are going to use Excel 12.0 (.xlsx format) with a header row and the IMEX 1 tells the driver there will be intermixed data.

OLE DB Connection Manager All Tab

Your package sould now look something like this. A connection manager with extended properties set and assumes a preexisting data flow

Control Flow

Data Flow

To simplify matters, I have a script source that generates 3 rows of data with state codes MO, KS and NE and sends them down the pipeline. Your source will obviously be different but the concept will remain the same. Sample code provided in the annotation.

In your lookup transformation, you will need to write a query against the spreadsheet. It's similar to a normal database query except your table is going to be Sheet1$ unless you have a named range in which your table would be MyRange Note the $ is required when referencing a sheet. Using the sample spreadsheet above, my query would be

SELECT
    S.StateCode
,   S.StateName
FROM 
    `Sheet1$` S

I map my Script task column StateCode to the reference query's StateCode column and check the StateName field as I want to add that to my data flow.

Data Flow

I've put a Data viewer after the Lookup task to verify my lookups worked

Data Viewer

Everything works and we're all happy.

Old Excel

If you are using a .xls file, you need to make the following changes. - In your Connection Manager, instead of the Office 12 provider, select the "Native OLE DB\Microsoft Jet 4.0 OLE DB Provider" - The Extended Properties become "EXCEL 8.0;HDR=Yes; IMEX=1;"

Notes for 2008+

A tip of the hat to Matt Masson (Blog|Twitter) for reminding me of the Cached Connection Manager. In particular, he points out that "you get weird behavior if you have multiple lookups against the same file." So, if you are on 2008+, look at using the CCM.

Cached Connection Manager Sequence Container

Before you need the cached data, load it into the cache like so

Cached Connection Manager Data Flow

The lookup basically remains the same. In the General tab, change the source to the cache connection manager and in the Connection tab, change to the CCM.