World of Whatever

A blog about SQL Server, SSIS, C# and whatever else I happen to dealing with in my professional life.

Find ramblings

Loading...

Monday, January 30, 2012

EzAPI Overview

I started looking at EzAPI for programmatically creating SSIS packages and I think it's an efficient product to work with. I have built SSIS packages purely through code before and up to a point, it's fun. After that, swimming in the ocean with a bad case of road rash is to be preferred.

Prerequisites

Pre-reqs are simple, you need to have something that can compile code and the EzAPI assembly.
  • Visual Studio 2008/Visual Studio 2010/SharpDevelop
  • SQL Server 2008 SSIS libraries (C:\Program Files (x86)\Microsoft SQL Server\100\SDK\Assemblies). This might be optional, I have not tried, but it would depend on your usage.
  • EzAPI library. Start with the installer but you'll most likely want to step up to the source code if the base functionality doesn't scratch your itch

Getting Started with EzAPI

Create a new project, I went with Console application as it's my standby. Anything that "runs" will work.
From EzAPI
Inside that project, add a reference to the following DLLs from the .NET tab.
From EzAPI
  • Microsoft.SQLServer.ManagedDTS
  • Microsoft.SQLServer.DTSRuntimeWrap
  • EzAPI
  • Where are my references?

    If you don't see those DLL's listed, look at the message that says this list is filtered. No problem, just change the target framework from 4.0 Client profile to the normal one.
    From EzAPI
    Right-click on the project and select Properties. Change the Target framework from ".NET Framework 4 Client Profile" to ".NET Framework 4".
    No, really, where are they?
    As you may be able to tell, I started writing this post using VS2010 as I'm trying to make it my default VS instance now that Denali/Engine of the Devil/SQL 2012 is coming. The only problem is, with 2010 they've changed the way Visual Studio handles references. I gave up trying to make heads or tails out of it and just went and added the Assemblies via the Browse tab. It's not idea but I no longer care, I just want it resolved. I went in and manually added
    • c:\Program Files (x86)\Microsoft SQL Server\100\SDK\Assemblies\Microsoft.SQLServer.ManagedDTS.dll
    • c:\Program Files (x86)\Microsoft SQL Server\100\SDK\Assemblies\Microsoft.SQLServer.DTSRuntimeWrap.dll
    • C:\Program Files (x86)\Microsoft SQL Server\100\Tools\Samples\EzAPI\EzAPI.dll
    If you are not a 64 bit OS, then omit the " (x86)" path in this series. These two articles cover a bit of what they were doing with VS2010 and references but that will be my adventure for another evening.
  • http://blogs.msdn.com/b/jason_howell/archive/2010/08/18/visual-studio-2010-solution-build-process-give-a-warning-about-indirect-dependency-on-the-net-framework-assembly-due-to-ssis-references.aspx
  • http://msdn.microsoft.com/en-us/library/wkze6zky.aspx
  • On with the show

    Code

    I'm using C# here but any .NET language will suffice. For your first taste of the EzAPI, we'll start dirt simple and create a package that has nothing in it. Yes, be still your beating heart. We'll assign two properties as the package level and save to a file. The only thing that is required is to add the EzAPI namespace to our project (Microsoft.SqlServer.SSIS.EzAPI) and then start referencing a bevy of Ez* named classes.

    On line 36 we declare a variable of type EzPackage and instantiate it on line 38. Lines 41 and 42 we assign values to the properties and on 43 we save the object to a file on disk. Once the code has executed, add the resulting package into a .dtproj and you have created your first package without using BIDS/SSDT.

       1:  //-----------------------------------------------------------------------
       2:  // <copyright file="Driver.cs" company="billfellows.net">
       3:  //     I mention copyright so StyleCop is happy.
       4:  // </copyright>
       5:  //-----------------------------------------------------------------------
       6:  namespace EzAPIRecipies
       7:  {
       8:      using System;
       9:      using System.Collections.Generic;
      10:      using System.Linq;
      11:      using System.Text;
      12:      using Microsoft.SqlServer.Dts;
      13:      using Microsoft.SqlServer.SSIS.EzAPI;
      14:      
      15:      /// <summary>
      16:      /// Driver class to demonstrate using EzAPI to build SSIS packages
      17:      /// </summary>
      18:      public class Driver
      19:      {
      20:          /// <summary>
      21:          /// This is the GO button.
      22:          /// </summary>
      23:          /// <param name="args">Command line arguments</param>
      24:          public static void Main(string[] args)
      25:          {
      26:              MakeSimpleEzAPIPackage();
      27:          }
      28:   
      29:          /// <summary>
      30:          /// Build the most basic of SSIS packages. Assign a name and description to the package and
      31:          /// nothing else. 
      32:          /// </summary>
      33:          public static void MakeSimpleEzAPIPackage()
      34:          {
      35:              string outputFile = string.Empty;
      36:              EzPackage ezPackage = null;
      37:              outputFile = @"C:\sandbox\SSISHackAndSlash2008\SSISHackAndSlash2008\SimpleEzAPIPackage.dtsx";
      38:              ezPackage = new EzPackage();
      39:   
      40:              // Assigning a Description leads to a value of <EzName></EzName>Content here
      41:              ezPackage.Description = "I was built using EzAPI";
      42:              ezPackage.Name = "SimpleEzAPIPackage";
      43:              ezPackage.SaveToFile(outputFile);
      44:          }
      45:      }
      46:  }
    One thing you can observe is the values written to those properties. Name has SimpleEzAPIPackage but look at the Description, that's not what we assigned. There are quirks in the library and this is one of them.
    From EzAPI

    As a contrast, the following code represents building the same package using the shipped assemblies. There's little difference between the two sets of code, beyond of course the second correctly assigns the Description...

            /// <summary>
            /// This is the classic approach to building a package
            /// </summary>
            public static void MakeSimplePackage()
            {
                string outputFile = string.Empty;
                Package package = null;
                Application app = null;
    
                outputFile = @"C:\sandbox\SSISHackAndSlash2008\SSISHackAndSlash2008\SimplePackage.dtsx";
                package = new Package();
                app = new Application();
    
                package.Description = "I was built using the straight .NET api";
                package.Name = "SimplePackage";
                app.SaveToXml(outputFile, package, null);
            }
    Finally, you could have simply right-clicked on the project, selected Add new package, clicked Properties and assigned these values but everyone knows that route.

    Convinced of the cost savings yet? Of course not, tune in to the rest of the series to see cost benefit of using EzAPI.

    Tuesday, November 29, 2011

    The worst query I ever wrote

    I work in an unusual environment---not everything is documented. I know, that sounds too unlikely to be true but it is. There was data in the database that needed to be reported on but not through the supplied UI. In this case, the specifics was the data was stored in Tfs_DefaultCollection and somewhat reportable through Visual Studio Team Explorer but the users wanted queries and reports that weren't supported through the API.

    Enter my big mouth saying it's a database I can find a way to make it work

    After randomly picking tables and scrolling through values, I needed a smarter approach. Smartness for me meant using the system meta data to find any column that could hold my data (right data type and minimum length). I then build and execute a query looking for an example of the search key in that column. Below is an example of a search for the word change. What I like about this query, is it ought to be fairly portable/database agnostic as I'm using the INFORMATION_SCHEMA views to identify the candidate schemas, tables and columns.

    
    SET NOCOUNT ON;
    
    DECLARE
        -- actual query 
        @query nvarchar(max)
        -- templated query
    ,   @template nvarchar(max)
        -- string value we wish to find
        -- Length of 50 is arbitrary
    ,   @searchKey varchar(50);
    
    SET
        @searchKey = 'change';
    
    -- Return column & schema/table combo anywhere
    -- there is data in the column that starts like the
    -- search key.
    SELECT
        @template = N'
        SELECT TOP 1 
            ''<COLUMN_NAME/>'' AS cname
        , ''<SCHEMA/>.<TABLE_NAME/>'' AS tname 
        FROM 
            <SCHEMA/>.<TABLE_NAME/> T 
        WHERE 
            T.<COLUMN_NAME/> LIKE ''<TARGET/>%''';
    
    SELECT
        @template = REPLACE(@template, '<TARGET/>', @searchKey);
    
    DECLARE 
        CSR CURSOR
    FOR
    -- Iterate through all the columns that are
    -- character data types and are at least
    -- as long as the search key
    SELECT
        ISC.TABLE_SCHEMA
    ,   ISC.TABLE_NAME
    ,   ISC.COLUMN_NAME
    FROM
        INFORMATION_SCHEMA.COLUMNS ISC
    WHERE
        -- filter out tables/views I know I can skip
        ISC.TABLE_NAME not like 'x%'
        AND
        (
            ISC.DATA_TYPE IN ('char', 'nchar', 'varchar', 'nvarchar')
            AND ISC.CHARACTER_MAXIMUM_LENGTH >= LEN(@searchKey)
        );
    
    -- Cursor variables for capturing candidate schemas, tables and columns
    DECLARE
        @table_schema sysname
    ,   @table_name sysname
    ,   @column_name sysname;
    
    DECLARE
        @RESULTS TABLE
    (
        column_name sysname
    ,   table_schema nvarchar(500) NOT NULL
    );
    
    OPEN
        CSR;
    
    FETCH NEXT
    FROM
        CSR
    INTO
        @table_schema
    ,   @table_name
    ,   @column_name;
    
    WHILE (@@FETCH_STATUS = 0)
    BEGIN
        -- stub in actual names, make 'em safe via quotename function
        SET @query = REPLACE(@template, '<SCHEMA/>', quotename(@table_schema));
        SET @query = REPLACE(@query, '<TABLE_NAME/>', quotename(@table_name));
        SET @query = REPLACE(@query, '<COLUMN_NAME/>', quotename(@column_name));
    
        BEGIN TRY
            --PRINT @query
            
            -- Dump results into a table variable
            INSERT INTO
                @RESULTS
            EXECUTE(@query);
            
        END TRY
        BEGIN CATCH
            -- print failing query
            PRINT @query;
            
        END CATCH
        FETCH NEXT
        FROM
            CSR
        INTO
            @table_schema
        ,   @table_name
        ,   @column_name;
    END
    CLOSE CSR;
    DEALLOCATE CSR;
    
    -- Show all the columns and fully qualified tables
    -- that contained the value
    SELECT 
        R.column_name
    ,   R.table_schema
    ,   'SELECT T.* FROM ' + R.table_schema + ' T WHERE T.' + R.column_name + ' LIKE  ''' + @searchKey + '%''' AS explore_query
    FROM 
        @results R;
    

    I can run the above against Tfs_DefaultCollection.

    column_name table_schema explore_query
    [ChildItem] [dbo].[tbl_Version] SELECT T.* FROM [dbo].[tbl_Version] T WHERE T.[ChildItem] LIKE  'change%'
    [ProjectName] [dbo].[tbl_Project] SELECT T.* FROM [dbo].[tbl_Project] T WHERE T.[ProjectName] LIKE  'change%'
    [ChildItem] [dbo].[tbl_VersionedItem] SELECT T.* FROM [dbo].[tbl_VersionedItem] T WHERE T.[ChildItem] LIKE  'change%'
    
    And easily switch over to Tfs_Warehouse to try and track those values over there
    column_name table_schema explore_query
    [ChangesetTitle] [dbo].[DimChangeset] SELECT T.* FROM [dbo].[DimChangeset] T WHERE T.[ChangesetTitle] LIKE  'change%'
    [FileName] [dbo].[DimFile] SELECT T.* FROM [dbo].[DimFile] T WHERE T.[FileName] LIKE  'change%'
    [ProjectNodeName] [dbo].[DimTeamProject] SELECT T.* FROM [dbo].[DimTeamProject] T WHERE T.[ProjectNodeName] LIKE  'change%'
    

    Warnings, caveats, etc

    While I'd think it's blindingly obvious, never run code from the interblag. Beyond that, this code is not going to be fast by any stretch of the imagination. It could lock a table for the read. It might bog down your disk infrastructure, especially if you're querying against a very large table. And it's going to do it over and over again so buyer beware.

    Monday, November 28, 2011

    Filter list in SSIS

    I saw the question asked, how can I split/filter a data flow based on an array-like data source.. I thought that sounded like a decent question so I investigated some alternate approaches.

    Problem definition

    Given an iterable/sequence, what are the options for using that reference data as a filter in SSIS? To test this, I created a reference list of US States and their FIPS code as well as an indicator of their allegiance in the Civil War. There's no significance to the data beyond it was a publicly available dataset. I threw against that reference set 60 numbers (monotonically increasing values) to how it could be accomplished. As there are only 50 states and not all were in existence during the Civil War, I expect my data to be split into four buckets: Union, Confederate, Undeclared (valid FIPS code but not in existence), Unknown (invalid code).

    Package Setup

    I created 3 sets of variables (Confederate, Union and Undeclared) in 2 flavors (String and Object).
    variables

    I have a script that runs that populates them.

    public void Main()
    {
        // User::ConfederateList,User::ConfederatesStates,User::UndeclaredList,User::UndeclaredStates,User::UnionList,User::UnionStates
        List<int> Confederates = new List<int> { 01, 05, 12, 13, 22, 28, 37, 45, 47, 48, 51 };
        List<int> Union = new List<int> { 06, 09, 10, 17, 18, 19, 20, 21, 23, 24, 25, 26, 27, 29, 32, 33, 34, 36, 39, 41, 42, 44, 50, 54, 55 };
        List<int> Undeclared = new List<int> { 02, 04, 08, 15, 16, 30, 31, 35, 38, 40, 46, 49, 53, 56 };
    
        this.Dts.Variables["ConfederateList"].Value = Confederates;
        this.Dts.Variables["ConfederatesStates"].Value = string.Format(":{0}:", string.Join(":", Confederates.ConvertAll<string>(delegate(int i) { return i.ToString(); }).ToArray()));
    
        this.Dts.Variables["UnionList"].Value = Union;
        this.Dts.Variables["UnionStates"].Value = string.Format(":{0}:", string.Join(":", Union.ConvertAll<string>(delegate(int i) { return i.ToString(); }).ToArray()));
    
        this.Dts.Variables["UndeclaredList"].Value = Undeclared;
        this.Dts.Variables["UndeclaredStates"].Value = string.Format(":{0}:", string.Join(":", Undeclared.ConvertAll<string>(delegate(int i) { return i.ToString(); }).ToArray()));
    
        Dts.TaskResult = (int)ScriptResults.Success;
     }
    

    Once this script has completed, these variables will look like
    locals window

    I also use this script as the source in my Data Flow tasks.

    public override void CreateNewOutputRows()
    {
        for (int i = 0; i < 60; i++)
        {
            Output0Buffer.AddRow();
            Output0Buffer.StateCode = i;
        }
    }
    

    Conditional split

    The Conditional Split transformation and the expression language itself does not offer any lookup type operation so a complex object like an array or List is out of the question. However, I was able to elicit a lookup-like functionality by using the FINDSTRING backwards. Normally, I think of using it as FINDSTRING(MyColumn, "SomeValue", 1). That is, if SomeValue exists anywhere in MyColumn, the return value would be the one-based position where the value is found, zero for no match.

    However, if you think of it opposite, you could use the reference value as the first argument and the current value in the second. Let's put some values to put this thinking in perspective. Missouri's code is 29. The set of Union states is 6, 9, 10, 17, 18, 19, 20, 21, 23, 24, 25, 26, 27, 29, 32, 33, 34, 36, 39, 41, 42, 44, 50, 54, 55. If I could find a way to represent the Union states set in a string format, I could use FINDSTRING on it. The first problem you'll run into though, is California aka 6. 6 will match 6 but it will also match 26 (Michigan), 36 (New York). We'd luck out in that scenario as they're both Union states. Idaho, 16, would end up being tagged as Union state even though they would not achieve statehood until 25 years after the Civil War. Same story with 46---South Dakota and 56---Wyoming, they were not around as states to cast their lot in the war. 60 too would be recognized as a Union state.

    For my data set, I am restricting the FIPS State set to only the 50 states currently recognized. The standard itself recognizes many more entities than just the 50 though so 60 is assigned to American Samoa. 60 should be kicked out to the Unknown bucket based on our input set. Clearly we need to do something more than search for our term, we need to give our search the concept of a word boundary. Using a regular expression, it'd be as easy as using \b. Here, we'd need to introduce our own artificial word boundaries. Chose a value that should never appear in your source data. I chose to delimit everything with a colon. Thus, the above set would be represented as :6:9:10:17:18:19:20:21:23:24:25:26:27:29:32:33:34:36:39:41:42:44:50:54:55: I would then need to modify my input value by concatenating a leading and trailing colon. (FINDSTRING(@[User::UnionStates],":" + (DT_WSTR,2)StateCode + ":",1)) > 0

    I chose to create derived columns before the conditional split but the logic could just as easily been performed within the Conditional split itself.

    (FINDSTRING(@[User::ConfederatesStates],":" + (DT_WSTR,2)StateCode + ":",1)) > 0
    (FINDSTRING(@[User::UnionStates],":" + (DT_WSTR,2)StateCode + ":",1)) > 0
    (FINDSTRING(@[User::UndeclaredStates],":" + (DT_WSTR,2)StateCode + ":",1)) > 0
    Conditional split

    Complex objects

    Conditional split works good enough, I suppose but it smells. Maintaining that logic is not going to be pleasant and there's no good way to know if your delimiter character suddenly becomes part of your source data. If it's purely numeric, then yes, you can figure that out but if the task is to find strings in a set of strings, that can get tricky. What would be nice is to use a something cleaner like set based logic for finding membership.

    To make this work, we need to use the script transformation but possibly in a way you've never used it.
    List object asynchronous script task

    This seems simple enough, remarkably like our conditional split approach. The major difference though is this script task is asynchronous. This means the data in our buffers are going to be copied (expensive) from the input buffer to the corresponding output buffer. I also needed to create the various output buffers and define the shape of the data. That's time consuming and nothing I'd care to maintain.

    The code is straight forward. Create 3 List variables for the class, instantiate them in the PreExecute and then use the value of the current row in the Contains method of the class objects. Once we find a match, create a row on the appropriate buffer and fill it with data.

    private List<int> Confederates;
    private List<int> Union;
    private List<int> Undeclared;
    
    public override void PreExecute()
    {
        base.PreExecute();
        this.Confederates = this.Variables.ConfederateList as List<int>;
        this.Union = this.Variables.UnionList as List<int>;
        this.Undeclared = this.Variables.UndeclaredList as List<int>;
    }
    
    public override void Input0_ProcessInputRow(Input0Buffer Row)
    {
        if (this.Confederates.Contains(Row.StateCode))
        {
            ConfederatesBuffer.AddRow();
            ConfederatesBuffer.StateCode = Row.StateCode;
        }
        else if (this.Union.Contains(Row.StateCode))
        {
            UnionBuffer.AddRow();
            UnionBuffer.StateCode = Row.StateCode;
        }
        else if (this.Undeclared.Contains(Row.StateCode))
        {
            UndeclaredBuffer.AddRow();
            UndeclaredBuffer.StateCode = Row.StateCode;
        }
        else
        {
            UnknownBuffer.AddRow();
            UnknownBuffer.SateCode = Row.StateCode;
        }
    }
    

    As an afterthought, we could make this identical in execution by offloading the conditional split logic to the native component and simply used a script task in synchronous fashion, much like we did for the Conditional Split's Derived Column transformation "Assign values". This would actually be a better approach all around in terms of resource usage, processing speed, maintenance, etc.

    Cached Connection Manager

    I must confess, until recently, I never thought about CCM. I cut my teeth on SQL Server 2005's SSIS so some of the "new" features aren't as ingrained in my problem solving toolbox like they should be. CCM is one I've definitely overlooked but am coming around to appreciating it's utility. This approach will have us load the available values into a cached connection manager which we will then leverage in the subsequent data flow to use the native Lookup task to determine values.

    This post runs long as it is so I won't show the whole code but there are 250 lines of this code enumerating all the states and their affiliation. When executed, that data is sent to a cache connection manager for later consumption.

    Output0Buffer.AddRow();
    Output0Buffer.StateName = "Alabama";
    Output0Buffer.StateAbbreviation = "AL";
    Output0Buffer.StateCode = 01;
    Output0Buffer.SideDeclared = "C";
    
    Output0Buffer.AddRow();
    Output0Buffer.StateName = "Alaska";
    Output0Buffer.StateAbbreviation = "AK";
    Output0Buffer.StateCode = 02;
    Output0Buffer.SideDeclared_IsNull = true;
    

    I use the Lookup task to retrieve the SideDeclared value. I "redirect rows to no match output" to my Unknown bucket. The found rows are sorted into their appropriate bucket with a conditional split.
    Cached Connection Manager Lookup

    Again, for consistency I could have also sent the unknown rows down to the Conditional Split (Ignore failure option) and done all the redirects there. My reason for not doing so here is laziness. I saw no need to write the conditional split logic when the native component already offered it.

    Performance

    I scaled up to 60 million rows and used the modulus operator on the (%67) on the source script for sending data to ensure we were sending consistent batches to all the options. I added row counts to all the output so the optimizer didn't factor branches out and ran it five times from the command-line (all in 32 bit mode). As a last minute test, I'm rerunning the processing in 64 bit mode.
    Conditional Split list
    This is the approach described above abusing FINDSTRING.
    Smarter Script
    This is the synchronous version of my script working against the List objects
    Script task
    This is the asynchronous version of my script working against the List objects
    Use lookups
    This uses the Cache Connection Manager
    SourceRowsAverage task duration (ms)σ task durationAverage throughput (rows / ms)σ througputMode
    Conditional Split list60,000,000337,80097688.0819249.7932bit
    Conditional Split list60,000,000254,5004500235.834.1764bit
    Smarter Script60,000,000458,60011,038.12130.913.1432bit
    Smarter Script60,000,000429,50030,500140.419.9764bit
    Script task60,000,000471,80022,301.57127.465.9932bit
    Script task60,000,000466,50019,500128.845.3964bit
    Use lookups60,000,000485,80014246.4123.613.6632bit
    Use lookups60,000,000457,50011500131.233.364bit

    Conclusion

    I am shocked on the performance results. For this problem domain, the FINDSTRING approach was the most efficient, by a considerable margin. The other three approaches consistently averaged a throughput of within 7 rows per millisecond of each other. I did find it interesting that the standard deviation of the FINDSTRING approach fluctuated so much. While this box is older and slower, there was not a considerable amount of activity going on during the package executions.

    64 vs 32 bit

    With a sample size of 2, all the 64 bit tasks saw an increase in throughput over their 32 bit counterparts. FINDSTRING method showed the most improvement, it's now nearly 100 rows/ms faster than the next component. Interestingly enough, the asynchronous script task saw a middling 1 row/ms increase compared to the nearly 10 row/ms increase on the other two tasks. I'll collect more data and see if the pattern holds true.

    My package is available on my google site for your own evaluations.

    Tuesday, November 22, 2011

    SSIS Package Extract from MSDB

    A question came up on StackOverflow of how someone can retrieve their packages from the MSDB. In this case, their local development went south and they only had the packages in production. Let this be an object lesson in using version control.

    To extract packages from the MSDB, you must first identify where in the msdb they exist. For that, you can query sysssispackagefolders and sysssispackages or you can just use my query SSIS Package Query

    Armed with that query, the column of interest is the PackagePath column. Couple that with dtutil and you have an extract-o-matic for package recovery.

    The base form of an extract from MSDB on localhost to the current folder in the file system would look like.

    dtutil /sourceserver localhost /SQL "Package" /copy file;.\Package.dtsx

    Extract-o-matic

    Run this query in Text mode (ctrl-T) This query generates a series of dtutil calls which in turn extracts SSIS packages from a server.

    ;
    WITH FOLDERS AS
    (
        -- Capture root node
        SELECT
            cast(PF.foldername AS varchar(max)) AS FolderPath
        ,   PF.folderid
        ,   PF.parentfolderid
        ,   PF.foldername
        FROM
            msdb.dbo.sysssispackagefolders PF
        WHERE
            PF.parentfolderid IS NULL
    
        -- build recursive hierarchy
        UNION ALL
        SELECT
            cast(F.FolderPath + '\' + PF.foldername AS varchar(max)) AS FolderPath
        ,   PF.folderid
        ,   PF.parentfolderid
        ,   PF.foldername
        FROM
            msdb.dbo.sysssispackagefolders PF
            INNER JOIN
                FOLDERS F
                ON F.folderid = PF.parentfolderid
    )
    ,   PACKAGES AS
    (
        -- pull information about stored SSIS packages
        SELECT
            P.name AS PackageName
        ,   P.id AS PackageId
        ,   P.description as PackageDescription
        ,   P.folderid
        ,   P.packageFormat
        ,   P.packageType
        ,   P.vermajor
        ,   P.verminor
        ,   P.verbuild
        ,   suser_sname(P.ownersid) AS ownername
        FROM
            msdb.dbo.sysssispackages P
    )
    SELECT 
        -- assumes default instance and localhost
        -- use serverproperty('servername') and serverproperty('instancename') 
        -- if you need to really make this generic
        'dtutil /sourceserver localhost /SQL "'+ F.FolderPath + '\' + P.PackageName + '" /copy file;.\' + P.PackageName +'.dtsx' AS cmd
    FROM 
        FOLDERS F
        INNER JOIN
            PACKAGES P
            ON P.folderid = F.folderid
    -- uncomment this if you want to filter out the 
    -- native Data Collector packages
    -- WHERE
    --     F.FolderPath <> '\Data Collector'

    For me, that generated the following code.

    dtutil /sourceserver localhost /SQL "\Package" /copy file;.\Package.dtsx
    dtutil /sourceserver localhost /SQL "\Data Collector\PerfCountersCollect" /copy file;.\PerfCountersCollect.dtsx
    dtutil /sourceserver localhost /SQL "\Data Collector\PerfCountersUpload" /copy file;.\PerfCountersUpload.dtsx
    dtutil /sourceserver localhost /SQL "\Data Collector\QueryActivityCollect" /copy file;.\QueryActivityCollect.dtsx
    dtutil /sourceserver localhost /SQL "\Data Collector\QueryActivityUpload" /copy file;.\QueryActivityUpload.dtsx
    dtutil /sourceserver localhost /SQL "\Data Collector\SqlTraceCollect" /copy file;.\SqlTraceCollect.dtsx
    dtutil /sourceserver localhost /SQL "\Data Collector\SqlTraceUpload" /copy file;.\SqlTraceUpload.dtsx
    dtutil /sourceserver localhost /SQL "\Data Collector\TSQLQueryCollect" /copy file;.\TSQLQueryCollect.dtsx
    dtutil /sourceserver localhost /SQL "\Data Collector\TSQLQueryUpload" /copy file;.\TSQLQueryUpload.dtsx
    
    Hooray, my bacon is saved and now I can put my packages into version control like a good developer.

    Friday, November 11, 2011

    PowerShell manipulation of SSIS packages

    One of the developers has been hand-coding some very repetitive SSIS packages. They perform a full replication of data out of an Informix database into our SQL Server. For the final database, he had about 50 tables that needed to be processed. I convinced him that we can make his life and maintenance better by creating lots of little packages vs the behemoth package with umpteen data flows in it.

    I built out a template package, PolSumAgentCountyDir.dtsx, that is an Execute SQL Task, wired to a Data Flow. The trick is the expressions. I have one variable, base_table that drives the operation of the package. There are variables that build the delete statement, source query and destination table. If I change the value of base_table from foo to bar, the OLE DB datasource in the package is going to get very angry because I've swapped out the meta data on it. Back in the DTS days, I could have passed a list to the process and it'd have been able to make it work but this is SSIS and that doesn't fly

    The original approach was we were going to split the list of tables to replicate and meet in the middle. Each person would open the template, update the value of base_table, save the package with a name based on the table it's replicating (along with the SSIS's package's Name property) and then fix the meta data. Dull work, but doable.

    If I were clever, know what I'd do? I'd spend an hour watching StarTrek FY Edition and writing some PowerShell. Calling the Function below with a fully qualified path to my template package, the table name and the English name of the table (ls_pol_sum => LivestockPolicySummary) results it instantiating an instance of the SSIS object model, loading the template package and programmatically manipulating it and saving it out with a new name. Dead simple but until you try it, or see it, you won't believe how so. I spent more time defining the hash table with table names to English than I did the code for manipulating objects. I'm sure some of that's due to my familiarity with the object model from my PowerShell SSIS deployment and maintenance script.

    [Reflection.Assembly]::LoadWithPartialName("Microsoft.SQLServer.ManagedDTS") | out-null
    
    # Utility script for making clones of a template SSIS package
    # and modifying the variable (base_table) that drives operation
    # of the package
    # Created by Bill Fellows
    # 2011-11-10
    
    Function ModifyTemplate
    {
        param
        (
            [string]$template,
            [string]$tableName,
            [string]$PackageSuffix
        )
        
        $events = $null
        $userName = $null
        $password = $null
    
        try
        {
            # open template package
            # save as package suffix
            # replace AgentCountyDir in template with suffix
            $newName = $template.Replace("AgentCountyDir", $PackageSuffix)
            $app = New-Object Microsoft.SqlServer.Dts.Runtime.Application
            $package = $app.LoadPackage($template, $null)
    
            # modify the table variable
            $package.Variables["base_table"].Value = $tableName
            $package.Name = $package.Name.Replace("AgentCountyDir", $PackageSuffix)
            $app.SaveToXml($newName, $package, $events)
        
        }
        catch
        {
            Write-Error ([string]::Format("Failed to do something. Table {0} on server {1}", $tableName, $PackageSuffix))
            Write-Error $_ | fl * -Force
            exit(0)
        }
    
    }
    

    Who loves PowerShell? This guy

    Friday, November 4, 2011

    Find all the tables without a clustered index

    We were having performance issues at work and I got to take a quick look at it. Performing multiple table scans against a heap is not good. When that heap is 6+ million rows, not good becomes atrocious. Using a pair of queries, I put together the following.

    Run this on any server and it'll spit back a report listing all the tables per database that have no clustered index and their associated row counts. Don't worry, the row count query is lightning fast as it isn't physically counting rows. Instead it's using a DMV so SQL Server 2005+ only please

    
    DECLARE @database_name sysname  
    DECLARE
        @REPORT TABLE
    (
        server_name sysname
    ,   database_name sysname
    ,   schemaname sysname
    ,   table_name sysname
    ,   row_counts bigint
    )
        
    
    DECLARE Csr CURSOR FORWARD_ONLY STATIC FOR  
    SELECT DB.[name] 
    FROM master.dbo.sysdatabases DB
    WHERE DB.[name] NOT IN ('tempdb'
    ,'master'
    ,'model'
    ,'msdb'
    ,'ReportServer'
    ,'ReportServerTempDB'
    )
    
    DECLARE @queryTemplate nvarchar(max) = N'
    USE <DB_NAME/>;
    WITH OF_INTEREST AS
    (
        SELECT DISTINCT 
            @@servername AS server_name
            , DB_NAME(DB_ID()) AS database_name
            , schema_name(T.schema_id) AS schemaname
            , OBJECT_NAME(I.object_id) AS table_name
        FROM SYS.INDEXES I
        INNER JOIN
            sys.tables T
            ON T.object_id = I.object_id
        WHERE INDEX_ID = 0
        AND OBJECTPROPERTY(I.object_id,''IsUserTable'') = 1
    )
    , RC AS
    (
        SELECT
            s.[Name] as [SchemaName]
        ,   t.[name] as [TableName]
        ,   SUM(p.rows) as [RowCounts]
        FROM 
            sys.schemas s
            LEFT JOIN 
                sys.tables t
                ON s.schema_id = t.schema_id
            LEFT JOIN 
                sys.partitions p
                ON t.object_id = p.object_id
            LEFT JOIN  
                sys.allocation_units a
                ON p.partition_id = a.container_id
        WHERE 
            p.index_id  in(0,1) -- 0 heap table , 1 table with clustered index
            AND p.rows is not null
            AND a.type = 1  -- row-data only , not LOB
        GROUP BY 
            s.[Name]
        ,   t.[name]
    )
    SELECT
    OI.*
    , RC.RowCounts
    FROM
    OF_INTEREST OI
    INNER JOIN
        RC
        ON RC.schemaname = OI.schemaname
        AND RC.tablename = OI.table_name
        
    '
    , @query nvarchar(max)
    
    
    OPEN Csr  
    FETCH NEXT FROM Csr INTO @database_name  
    WHILE @@FETCH_STATUS = 0  
    BEGIN
    
        SELECT @query = replace(@queryTemplate, '<DB_NAME/>', @database_name)
        BEGIN TRY
            INSERT INTO
                @REPORT
            EXECUTE (@query)
        END TRY
        BEGIN CATCH
            SELECT
                ERROR_NUMBER()AS error_number --returns the number of the error.
            ,   ERROR_SEVERITY() AS error_severity --returns the severity.
            ,   ERROR_STATE()AS error_state  --returns the error state number.
            ,   ERROR_PROCEDURE() AS error_procedure --returns the name of the stored procedure or trigger where the error occurred.
            ,   ERROR_LINE() AS error_line --returns the line number inside the routine that caused the error.
            ,   ERROR_MESSAGE() AS error_message --returns the complete text of the error message. The text includes the values supplied for any substitutable parameters, such as lengths, object names, or times.
        END CATCH
    
        FETCH NEXT FROM Csr INTO @database_name  
    END  
    
    CLOSE Csr  
    DEALLOCATE Csr
    ;
    
    SELECT * FROM @REPORT R
    
    

    Thanks to Pinal Dave for the query to identify heaps and James Beresford for the fast row count query.

    Wednesday, November 2, 2011

    Using Excel in an SSIS lookup transformation

    There was a question on StackOverflow: Excel Source as Lookup Transformation Connection asking how one can use Excel as a source in a lookup. As I worked on the answer, I figured it'd be a good thing to post here so I'm copying and pasting my work. The important thing is that this approach isn't restricted to just the lookup task. If you are having trouble using Excel in SSIS as a normal data source because it thinks the data types have changed, access it via the OLE DB driver.

    As you have correctly discerned, you cannot use the Excel Connection Manager in a Lookup task, it only accepts the OLE DB connection manager. The "normal" Excel connection isn't an OLE DB type. The trick then, is to use an OLE DB connection manager with an Excel spreadsheet.

    For this example, I have a spreadsheet with state codes and their full name and my source data only has the abbreviations flowing through. This walk through will wire up a lookup task against Excel to retrieve those values.

    Keep these caveats in mind: Lookups are case sensitive, regardless of whether the source database (or file in this case) is case sensitive. Excel strings are always going to be interpreted as unicode/nvarchar.

    Source Data

    Given a simple file sitting at C:\tmp\LookupReference.xlsx that looks like the following
    Simple key value pairs in Excel

    Connection Manager

    We must first establish an OLE DB Connection Manager. Instead of the default "Native OLE DB\SQL Server Native Client 10.0" change that to "Native OLE DB\Microsoft Office 12.0 Access Database Engine OLE DB Provider". In the Server or file name, locate the source file. Clicking Test Connection at this point will result in an error.

    OLE DB Connection Manager Connection Tab

    Here comes the "magic." Click the All tab and scroll to the top. In Extended Properties, add the indicated string "Excel 12.0;HDR=YES; IMEX=1;" This tells the provider that we are going to use Excel 12.0 (.xlsx format) with a header row and the IMEX 1 tells the driver there will be intermixed data.

    OLE DB Connection Manager All Tab

    Your package sould now look something like this. A connection manager with extended properties set and assumes a preexisting data flow

    Control Flow

    Data Flow

    To simplify matters, I have a script source that generates 3 rows of data with state codes MO, KS and NE and sends them down the pipeline. Your source will obviously be different but the concept will remain the same. Sample code provided in the annotation.

    In your lookup transformation, you will need to write a query against the spreadsheet. It's similar to a normal database query except your table is going to be Sheet1$ unless you have a named range in which your table would be MyRange Note the $ is required when referencing a sheet. Using the sample spreadsheet above, my query would be

    SELECT
        S.StateCode
    ,   S.StateName
    FROM 
        `Sheet1$` S
    

    I map my Script task column StateCode to the reference query's StateCode column and check the StateName field as I want to add that to my data flow.

    Data Flow

    I've put a Data viewer after the Lookup task to verify my lookups worked

    Data Viewer

    Everything works and we're all happy.

    Old Excel

    If you are using a .xls file, you need to make the following changes. - In your Connection Manager, instead of the Office 12 provider, select the "Native OLE DB\Microsoft Jet 4.0 OLE DB Provider" - The Extended Properties become "EXCEL 8.0;HDR=Yes; IMEX=1;"

    Notes for 2008+

    A tip of the hat to Matt Masson (Blog|Twitter) for reminding me of the Cached Connection Manager. In particular, he points out that "you get weird behavior if you have multiple lookups against the same file." So, if you are on 2008+, look at using the CCM.

    Cached Connection Manager Sequence Container

    Before you need the cached data, load it into the cache like so

    Cached Connection Manager Data Flow

    The lookup basically remains the same. In the General tab, change the source to the cache connection manager and in the Connection tab, change to the CCM.

    Blog Archive