A blog about SQL Server, SSIS, C# and whatever else I happen to be dealing with in my professional life.

Find ramblings

Wednesday, December 22, 2021

Extracting queries from SSIS packages

Extracting queries from SSIS packages

We received an abomination of an SSIS package from a third party. It was a way to write a package that I don't think I would have suggested.

Our job was to rewrite this into something more manageable and it appears Azure Data Factory will be the winner. Before we can do that, we need to document what the existing package is doing (the vendor has supplied the incremental load logic) so we can replicate it but in a more economical form. It appears to have the pattern (squint really hard at the picture) Execute SQL Task -> Execute SQL Task -> Sequence container => many data flows -> Data Flow -> Execute SQL Task. The Data Flow Task is named after the table being loaded. An ODBC source with a expression based query, named "ODBC Source 1" wired to an OLE DB Destination, named "OLE DB Destination". How would you do it, especially given that there are 236 Data Flow Tasks embedded in a single container?

Biml it!

As with so many things SSIS-related, Biml is the answer. Install BimlExpress and reverse engineer that dtsx package into Biml. I'll add a blank BimlScript file that I called Inspector.biml

Let's look at a sample DataFlow task

            <Dataflow Name="ATableName">
              <Expressions>
                <Expression ExternalProperty="[ODBC Source 1].[SqlCommand]">"SELECT * FROM dbo.ATableName where modifiedutc > '" +(DT_WSTR, 30)@[User::LastModified] + "'  AND modifiedutc <= '" + (DT_WSTR, 30)@[User::MostRecent] + "'"</Expression>
              </Expressions>
              <Transformations>
                <OdbcSource Name="ODBC Source 1" Connection="Source2">
                  <DirectInput>SELECT * FROM dbo.ATableName where modifiedutc > '0'  AND modifiedutc <= '0'</DirectInput>
                </OdbcSource>
                <DataConversion Name="Data Conversion">
                  <Columns>
                    <Column SourceColumn="id" TargetColumn="Copy of id" DataType="AnsiString" Length="255" CodePage="1252" />
                  </Columns>
                </DataConversion>
                <OleDbCommand Name="Delete Old Rows from ATableName" ConnectionName="Destination2">
                  <Parameters>
                    <Parameter SourceColumn="ye_id" TargetColumn="Param_0" DataType="AnsiStringFixedLength" Length="255" CodePage="1252" />
                    <Parameter SourceColumn="modifiedutc" TargetColumn="Param_1" DataType="Int64" />
                  </Parameters>
                  <DirectInput>delete from dbo.ATableName where ye_id = ? and modifiedutc  < ?</DirectInput>
                </OleDbCommand>
                <ConditionalSplit Name="Conditional Split">
                  <OutputPaths>
                    <OutputPath Name="Case 1">
                      <Expression>ISNULL(ye_id)</Expression>
                    </OutputPath>
                  </OutputPaths>
                </ConditionalSplit>
                <OleDbDestination Name="OLE DB Destination" ConnectionName="Destination2">
                  <ExternalTableOutput Table=""dbob"."ATableName"" />
                </OleDbDestination>
              </Transformations>
            </Dataflow>

All I want to do is find all the Data Flow Tasks in the sequence containers. I need to generate a key value pair of TableName and the source query. I could dive into the Transformations layer and find the ODBC source and extract the DirectInput node from the OdbcSource and then parse the table name from the OleDbDestination's ExternalTableOutput but look, I can "cheat" here. Everything I need is at the outer Data Flow Task level. The Name of the DataFlow is my table name and since it's ODBC and the source component doesn't support a direct Expression on it, it's defined at the Data Flow Level. That makes this Biml easy.

  
<#

// A dictionary of TableName and the Expression
Dictionary<string, string> incremental = new Dictionary<string, string>();

foreach (AstPackageNode p in this.RootNode.Packages)
{
    // Loop through the Sequence Container
    foreach (var c in p.Tasks.OfType<AstContainerTaskNode>()/**/)
    {
        foreach (AstDataflowTaskNode t in c.Tasks.OfType<AstDataflowTaskNode>())
        {
            if (p.Name == "Postgres_to_MSSQL_Incremental")
            {
                incremental[t.Name] = t.Expressions[0].Expression;
            }
        }
    }
}

WriteLine("<!--");
foreach (KeyValuePair<string, string> k in incremental)
{
    // WriteLine("<!-- {0}: {1} -->", k.Key, k.Value);
    WriteLine("{0}|{1}", k.Key, k.Value.Replace("\n", " "));
}
WriteLine("-->");

#>
<Biml xmlns="http://schemas.varigence.com/biml.xsd">
</Biml>
  

I define a dictionary that will hold the Table and the associated Expression. The first foreach loop specifies that I want to enumerate through all the packages that have been reverse engineered. If you're using BimlExpress, you'll need to shift click on all the source Biml files as well as the Inspector package.

The next foreach enumerator looks at all the elements in the Control Flow task and we're going to filter it to just things that are of type AstContainerTaskNode, aka a Container. That "OfType" filter syntax is very handy to focus on only the type of item you're looking for. Linq is so powerful, I love it.

The innermost foreach enumerator uses the OfType again to filter tasks to only those that are DataFlows, AstDataFlowTaskNode. The if statement ensures I'm only working on the Incremental package (they supplied an initial load as well). Finally, I add the Task's Name to the key of the dictionary and the value becomes the first Expression. Again, I can cheat here because the vendor package was very consistent, which is amazing for an SSIS package that's been hand crafted. That source package was 40.5 MB and had been saved 422 times according to the VersionBuild number. Kudos to them for quality control.

Once the looping is complete, all that is left is to emit the information so I can use it elsewhere. Thus the final foreach loop. I'm working in BimlStudio so comments are emitted and I'm going to take advantage of that by simply writing the key/value pair with a Pipe delimiter and then copy/paste the output into a CSV. If you're working in BimlExpress, I'd just write directly to a file with a System.IO.Text.WriteAllLines (name approximate) but this was just a "quick and dirty get it done" task and corresponding blog post to show that Biml and metadata programming are still relevant.

Eagle eyed viewers will note that I am missing the single DataFlow task after the Container. My partner also notice it and so if you need to also look for any data flow tasks at the Package level, I added this loop after the Seqence Container loop.

    foreach (AstDataflowTaskNode t in p.Tasks.OfType<AstDataflowTaskNode>())
    {
        if (p.Name == "Postgres_to_MSSQL_Incremental")
        {
            incremental[t.Name] = t.Expressions[0].Expression;
        }
    }

Wrapup

I'm a slow/infrequent blogger and this post took me 50 minutes. I think after we were done scratching our heads at the source packages, it took less time to write the script and generate the list of tables and associated queries than this blog post took.

No comments: