A blog about SQL Server, SSIS, C# and whatever else I happen to be dealing with in my professional life.

Find ramblings

Monday, November 28, 2011

Filter list in SSIS

I saw the question asked, how can I split/filter a data flow based on an array-like data source.. I thought that sounded like a decent question so I investigated some alternate approaches.

Problem definition

Given an iterable/sequence, what are the options for using that reference data as a filter in SSIS? To test this, I created a reference list of US States and their FIPS code as well as an indicator of their allegiance in the Civil War. There's no significance to the data beyond it was a publicly available dataset. I threw against that reference set 60 numbers (monotonically increasing values) to how it could be accomplished. As there are only 50 states and not all were in existence during the Civil War, I expect my data to be split into four buckets: Union, Confederate, Undeclared (valid FIPS code but not in existence), Unknown (invalid code).

Package Setup

I created 3 sets of variables (Confederate, Union and Undeclared) in 2 flavors (String and Object).

I have a script that runs that populates them.

public void Main()
    // User::ConfederateList,User::ConfederatesStates,User::UndeclaredList,User::UndeclaredStates,User::UnionList,User::UnionStates
    List<int> Confederates = new List<int> { 01, 05, 12, 13, 22, 28, 37, 45, 47, 48, 51 };
    List<int> Union = new List<int> { 06, 09, 10, 17, 18, 19, 20, 21, 23, 24, 25, 26, 27, 29, 32, 33, 34, 36, 39, 41, 42, 44, 50, 54, 55 };
    List<int> Undeclared = new List<int> { 02, 04, 08, 15, 16, 30, 31, 35, 38, 40, 46, 49, 53, 56 };

    this.Dts.Variables["ConfederateList"].Value = Confederates;
    this.Dts.Variables["ConfederatesStates"].Value = string.Format(":{0}:", string.Join(":", Confederates.ConvertAll<string>(delegate(int i) { return i.ToString(); }).ToArray()));

    this.Dts.Variables["UnionList"].Value = Union;
    this.Dts.Variables["UnionStates"].Value = string.Format(":{0}:", string.Join(":", Union.ConvertAll<string>(delegate(int i) { return i.ToString(); }).ToArray()));

    this.Dts.Variables["UndeclaredList"].Value = Undeclared;
    this.Dts.Variables["UndeclaredStates"].Value = string.Format(":{0}:", string.Join(":", Undeclared.ConvertAll<string>(delegate(int i) { return i.ToString(); }).ToArray()));

    Dts.TaskResult = (int)ScriptResults.Success;

Once this script has completed, these variables will look like
locals window

I also use this script as the source in my Data Flow tasks.

public override void CreateNewOutputRows()
    for (int i = 0; i < 60; i++)
        Output0Buffer.StateCode = i;

Conditional split

The Conditional Split transformation and the expression language itself does not offer any lookup type operation so a complex object like an array or List is out of the question. However, I was able to elicit a lookup-like functionality by using the FINDSTRING backwards. Normally, I think of using it as FINDSTRING(MyColumn, "SomeValue", 1). That is, if SomeValue exists anywhere in MyColumn, the return value would be the one-based position where the value is found, zero for no match.

However, if you think of it opposite, you could use the reference value as the first argument and the current value in the second. Let's put some values to put this thinking in perspective. Missouri's code is 29. The set of Union states is 6, 9, 10, 17, 18, 19, 20, 21, 23, 24, 25, 26, 27, 29, 32, 33, 34, 36, 39, 41, 42, 44, 50, 54, 55. If I could find a way to represent the Union states set in a string format, I could use FINDSTRING on it. The first problem you'll run into though, is California aka 6. 6 will match 6 but it will also match 26 (Michigan), 36 (New York). We'd luck out in that scenario as they're both Union states. Idaho, 16, would end up being tagged as Union state even though they would not achieve statehood until 25 years after the Civil War. Same story with 46---South Dakota and 56---Wyoming, they were not around as states to cast their lot in the war. 60 too would be recognized as a Union state.

For my data set, I am restricting the FIPS State set to only the 50 states currently recognized. The standard itself recognizes many more entities than just the 50 though so 60 is assigned to American Samoa. 60 should be kicked out to the Unknown bucket based on our input set. Clearly we need to do something more than search for our term, we need to give our search the concept of a word boundary. Using a regular expression, it'd be as easy as using \b. Here, we'd need to introduce our own artificial word boundaries. Chose a value that should never appear in your source data. I chose to delimit everything with a colon. Thus, the above set would be represented as :6:9:10:17:18:19:20:21:23:24:25:26:27:29:32:33:34:36:39:41:42:44:50:54:55: I would then need to modify my input value by concatenating a leading and trailing colon. (FINDSTRING(@[User::UnionStates],":" + (DT_WSTR,2)StateCode + ":",1)) > 0

I chose to create derived columns before the conditional split but the logic could just as easily been performed within the Conditional split itself.

(FINDSTRING(@[User::ConfederatesStates],":" + (DT_WSTR,2)StateCode + ":",1)) > 0
(FINDSTRING(@[User::UnionStates],":" + (DT_WSTR,2)StateCode + ":",1)) > 0
(FINDSTRING(@[User::UndeclaredStates],":" + (DT_WSTR,2)StateCode + ":",1)) > 0
Conditional split

Complex objects

Conditional split works good enough, I suppose but it smells. Maintaining that logic is not going to be pleasant and there's no good way to know if your delimiter character suddenly becomes part of your source data. If it's purely numeric, then yes, you can figure that out but if the task is to find strings in a set of strings, that can get tricky. What would be nice is to use a something cleaner like set based logic for finding membership.

To make this work, we need to use the script transformation but possibly in a way you've never used it.
List object asynchronous script task

This seems simple enough, remarkably like our conditional split approach. The major difference though is this script task is asynchronous. This means the data in our buffers are going to be copied (expensive) from the input buffer to the corresponding output buffer. I also needed to create the various output buffers and define the shape of the data. That's time consuming and nothing I'd care to maintain.

The code is straight forward. Create 3 List variables for the class, instantiate them in the PreExecute and then use the value of the current row in the Contains method of the class objects. Once we find a match, create a row on the appropriate buffer and fill it with data.

private List<int> Confederates;
private List<int> Union;
private List<int> Undeclared;

public override void PreExecute()
    this.Confederates = this.Variables.ConfederateList as List<int>;
    this.Union = this.Variables.UnionList as List<int>;
    this.Undeclared = this.Variables.UndeclaredList as List<int>;

public override void Input0_ProcessInputRow(Input0Buffer Row)
    if (this.Confederates.Contains(Row.StateCode))
        ConfederatesBuffer.StateCode = Row.StateCode;
    else if (this.Union.Contains(Row.StateCode))
        UnionBuffer.StateCode = Row.StateCode;
    else if (this.Undeclared.Contains(Row.StateCode))
        UndeclaredBuffer.StateCode = Row.StateCode;
        UnknownBuffer.SateCode = Row.StateCode;

As an afterthought, we could make this identical in execution by offloading the conditional split logic to the native component and simply used a script task in synchronous fashion, much like we did for the Conditional Split's Derived Column transformation "Assign values". This would actually be a better approach all around in terms of resource usage, processing speed, maintenance, etc.

Cached Connection Manager

I must confess, until recently, I never thought about CCM. I cut my teeth on SQL Server 2005's SSIS so some of the "new" features aren't as ingrained in my problem solving toolbox like they should be. CCM is one I've definitely overlooked but am coming around to appreciating it's utility. This approach will have us load the available values into a cached connection manager which we will then leverage in the subsequent data flow to use the native Lookup task to determine values.

This post runs long as it is so I won't show the whole code but there are 250 lines of this code enumerating all the states and their affiliation. When executed, that data is sent to a cache connection manager for later consumption.

Output0Buffer.StateName = "Alabama";
Output0Buffer.StateAbbreviation = "AL";
Output0Buffer.StateCode = 01;
Output0Buffer.SideDeclared = "C";

Output0Buffer.StateName = "Alaska";
Output0Buffer.StateAbbreviation = "AK";
Output0Buffer.StateCode = 02;
Output0Buffer.SideDeclared_IsNull = true;

I use the Lookup task to retrieve the SideDeclared value. I "redirect rows to no match output" to my Unknown bucket. The found rows are sorted into their appropriate bucket with a conditional split.
Cached Connection Manager Lookup

Again, for consistency I could have also sent the unknown rows down to the Conditional Split (Ignore failure option) and done all the redirects there. My reason for not doing so here is laziness. I saw no need to write the conditional split logic when the native component already offered it.


I scaled up to 60 million rows and used the modulus operator on the (%67) on the source script for sending data to ensure we were sending consistent batches to all the options. I added row counts to all the output so the optimizer didn't factor branches out and ran it five times from the command-line (all in 32 bit mode). As a last minute test, I'm rerunning the processing in 64 bit mode.
Conditional Split list
This is the approach described above abusing FINDSTRING.
Smarter Script
This is the synchronous version of my script working against the List objects
Script task
This is the asynchronous version of my script working against the List objects
Use lookups
This uses the Cache Connection Manager
SourceRowsAverage task duration (ms)σ task durationAverage throughput (rows / ms)σ througputMode
Conditional Split list60,000,000337,80097688.0819249.7932bit
Conditional Split list60,000,000254,5004500235.834.1764bit
Smarter Script60,000,000458,60011,038.12130.913.1432bit
Smarter Script60,000,000429,50030,500140.419.9764bit
Script task60,000,000471,80022,301.57127.465.9932bit
Script task60,000,000466,50019,500128.845.3964bit
Use lookups60,000,000485,80014246.4123.613.6632bit
Use lookups60,000,000457,50011500131.233.364bit


I am shocked on the performance results. For this problem domain, the FINDSTRING approach was the most efficient, by a considerable margin. The other three approaches consistently averaged a throughput of within 7 rows per millisecond of each other. I did find it interesting that the standard deviation of the FINDSTRING approach fluctuated so much. While this box is older and slower, there was not a considerable amount of activity going on during the package executions.

64 vs 32 bit

With a sample size of 2, all the 64 bit tasks saw an increase in throughput over their 32 bit counterparts. FINDSTRING method showed the most improvement, it's now nearly 100 rows/ms faster than the next component. Interestingly enough, the asynchronous script task saw a middling 1 row/ms increase compared to the nearly 10 row/ms increase on the other two tasks. I'll collect more data and see if the pattern holds true.

My package is available on my google site for your own evaluations.

1 comment:

Christopher Harsch said...

I ran these and got a slightly different results. I dont know if it is hardware difference or what.

I took the list implementation and used sorted dictionary instead, and was able to get a small improvement in speed.

265 226000 22339 Script task Sorted Dictionaries
259 231000 22337 List implementation
235 255000 22345 Use lookups
224 267000 22341 Smarter Script
213 281000 22335 Conditional Split list