I saw the question asked, how can I split/filter a data flow based on an array-like data source.. I thought that sounded like a decent question so I investigated some alternate approaches.
Problem definition
Given an iterable/sequence, what are the options for using that reference data as a filter in SSIS? To test this, I created a reference list of US States and their FIPS code as well as an indicator of their allegiance in the Civil War. There's no significance to the data beyond it was a publicly available dataset. I threw against that reference set 60 numbers (monotonically increasing values) to how it could be accomplished. As there are only 50 states and not all were in existence during the Civil War, I expect my data to be split into four buckets: Union, Confederate, Undeclared (valid FIPS code but not in existence), Unknown (invalid code).
Package Setup
I created 3 sets of variables (Confederate, Union and Undeclared) in 2 flavors (String and Object).
I have a script that runs that populates them.
public void Main() { // User::ConfederateList,User::ConfederatesStates,User::UndeclaredList,User::UndeclaredStates,User::UnionList,User::UnionStates List<int> Confederates = new List<int> { 01, 05, 12, 13, 22, 28, 37, 45, 47, 48, 51 }; List<int> Union = new List<int> { 06, 09, 10, 17, 18, 19, 20, 21, 23, 24, 25, 26, 27, 29, 32, 33, 34, 36, 39, 41, 42, 44, 50, 54, 55 }; List<int> Undeclared = new List<int> { 02, 04, 08, 15, 16, 30, 31, 35, 38, 40, 46, 49, 53, 56 }; this.Dts.Variables["ConfederateList"].Value = Confederates; this.Dts.Variables["ConfederatesStates"].Value = string.Format(":{0}:", string.Join(":", Confederates.ConvertAll<string>(delegate(int i) { return i.ToString(); }).ToArray())); this.Dts.Variables["UnionList"].Value = Union; this.Dts.Variables["UnionStates"].Value = string.Format(":{0}:", string.Join(":", Union.ConvertAll<string>(delegate(int i) { return i.ToString(); }).ToArray())); this.Dts.Variables["UndeclaredList"].Value = Undeclared; this.Dts.Variables["UndeclaredStates"].Value = string.Format(":{0}:", string.Join(":", Undeclared.ConvertAll<string>(delegate(int i) { return i.ToString(); }).ToArray())); Dts.TaskResult = (int)ScriptResults.Success; }
Once this script has completed, these variables will look like
I also use this script as the source in my Data Flow tasks.
public override void CreateNewOutputRows() { for (int i = 0; i < 60; i++) { Output0Buffer.AddRow(); Output0Buffer.StateCode = i; } }
Conditional split
The Conditional Split transformation and the expression language itself does not offer any lookup type operation so a complex object like an array or ListFINDSTRING(MyColumn, "SomeValue", 1)
. That is, if SomeValue exists anywhere in MyColumn, the return value would be the one-based position where the value is found, zero for no match.
However, if you think of it opposite, you could use the reference value as the first argument and the current value in the second. Let's put some values to put this thinking in perspective. Missouri's code is 29. The set of Union states is 6, 9, 10, 17, 18, 19, 20, 21, 23, 24, 25, 26, 27, 29, 32, 33, 34, 36, 39, 41, 42, 44, 50, 54, 55. If I could find a way to represent the Union states set in a string format, I could use FINDSTRING on it. The first problem you'll run into though, is California aka 6. 6 will match 6 but it will also match 26 (Michigan), 36 (New York). We'd luck out in that scenario as they're both Union states. Idaho, 16, would end up being tagged as Union state even though they would not achieve statehood until 25 years after the Civil War. Same story with 46---South Dakota and 56---Wyoming, they were not around as states to cast their lot in the war. 60 too would be recognized as a Union state.
For my data set, I am restricting the FIPS State set to only the 50 states currently recognized. The standard itself recognizes many more entities than just the 50 though so 60 is assigned to American Samoa. 60 should be kicked out to the Unknown bucket based on our input set. Clearly we need to do something more than search for our term, we need to give our search the concept of a word boundary. Using a regular expression, it'd be as easy as using \b
. Here, we'd need to introduce our own artificial word boundaries. Chose a value that should never appear in your source data. I chose to delimit everything with a colon. Thus, the above set would be represented as :6:9:10:17:18:19:20:21:23:24:25:26:27:29:32:33:34:36:39:41:42:44:50:54:55: I would then need to modify my input value by concatenating a leading and trailing colon. (FINDSTRING(@[User::UnionStates],":" + (DT_WSTR,2)StateCode + ":",1)) > 0
I chose to create derived columns before the conditional split but the logic could just as easily been performed within the Conditional split itself.
(FINDSTRING(@[User::ConfederatesStates],":" + (DT_WSTR,2)StateCode + ":",1)) > 0 (FINDSTRING(@[User::UnionStates],":" + (DT_WSTR,2)StateCode + ":",1)) > 0 (FINDSTRING(@[User::UndeclaredStates],":" + (DT_WSTR,2)StateCode + ":",1)) > 0
Complex objects
Conditional split works good enough, I suppose but it smells. Maintaining that logic is not going to be pleasant and there's no good way to know if your delimiter character suddenly becomes part of your source data. If it's purely numeric, then yes, you can figure that out but if the task is to find strings in a set of strings, that can get tricky. What would be nice is to use a something cleaner like set based logic for finding membership.
To make this work, we need to use the script transformation but possibly in a way you've never used it.
This seems simple enough, remarkably like our conditional split approach. The major difference though is this script task is asynchronous. This means the data in our buffers are going to be copied (expensive) from the input buffer to the corresponding output buffer. I also needed to create the various output buffers and define the shape of the data. That's time consuming and nothing I'd care to maintain.
The code is straight forward. Create 3 List As an afterthought, we could make this identical in execution by offloading the conditional split logic to the native component and simply used a script task in synchronous fashion, much like we did for the Conditional Split's Derived Column transformation "Assign values". This would actually be a better approach all around in terms of resource usage, processing speed, maintenance, etc.
I must confess, until recently, I never thought about CCM. I cut my teeth on SQL Server 2005's SSIS so some of the "new" features aren't as ingrained in my problem solving toolbox like they should be. CCM is one I've definitely overlooked but am coming around to appreciating it's utility. This approach will have us load the available values into a cached connection manager which we will then leverage in the subsequent data flow to use the native Lookup task to determine values.
This post runs long as it is so I won't show the whole code but there are 250 lines of this code enumerating all the states and their affiliation. When executed, that data is sent to a cache connection manager for later consumption.
I use the Lookup task to retrieve the SideDeclared value. I "redirect rows to no match output" to my Unknown bucket. The found rows are sorted into their appropriate bucket with a conditional split.
Again, for consistency I could have also sent the unknown rows down to the Conditional Split (Ignore failure option) and done all the redirects there. My reason for not doing so here is laziness. I saw no need to write the conditional split logic when the native component already offered it.
I am shocked on the performance results. For this problem domain, the FINDSTRING approach was the most efficient, by a considerable margin. The other three approaches consistently averaged a throughput of within 7 rows per millisecond of each other. I did find it interesting that the standard deviation of the FINDSTRING approach fluctuated so much. While this box is older and slower, there was not a considerable amount of activity going on during the package executions.
private List<int> Confederates;
private List<int> Union;
private List<int> Undeclared;
public override void PreExecute()
{
base.PreExecute();
this.Confederates = this.Variables.ConfederateList as List<int>;
this.Union = this.Variables.UnionList as List<int>;
this.Undeclared = this.Variables.UndeclaredList as List<int>;
}
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
if (this.Confederates.Contains(Row.StateCode))
{
ConfederatesBuffer.AddRow();
ConfederatesBuffer.StateCode = Row.StateCode;
}
else if (this.Union.Contains(Row.StateCode))
{
UnionBuffer.AddRow();
UnionBuffer.StateCode = Row.StateCode;
}
else if (this.Undeclared.Contains(Row.StateCode))
{
UndeclaredBuffer.AddRow();
UndeclaredBuffer.StateCode = Row.StateCode;
}
else
{
UnknownBuffer.AddRow();
UnknownBuffer.SateCode = Row.StateCode;
}
}
Cached Connection Manager
Output0Buffer.AddRow();
Output0Buffer.StateName = "Alabama";
Output0Buffer.StateAbbreviation = "AL";
Output0Buffer.StateCode = 01;
Output0Buffer.SideDeclared = "C";
Output0Buffer.AddRow();
Output0Buffer.StateName = "Alaska";
Output0Buffer.StateAbbreviation = "AK";
Output0Buffer.StateCode = 02;
Output0Buffer.SideDeclared_IsNull = true;
Performance
I scaled up to 60 million rows and used the modulus operator on the (%67) on the source script for sending data to ensure we were sending consistent batches to all the options. I added row counts to all the output so the optimizer didn't factor branches out and ran it five times from the command-line (all in 32 bit mode). As a last minute test, I'm rerunning the processing in 64 bit mode.
Source Rows Average task duration (ms) σ task duration Average throughput (rows / ms) σ througput Mode Conditional Split list 60,000,000 337,800 97688.08 192 49.79 32bit Conditional Split list 60,000,000 254,500 4500 235.83 4.17 64bit Smarter Script 60,000,000 458,600 11,038.12 130.91 3.14 32bit Smarter Script 60,000,000 429,500 30,500 140.41 9.97 64bit Script task 60,000,000 471,800 22,301.57 127.46 5.99 32bit Script task 60,000,000 466,500 19,500 128.84 5.39 64bit Use lookups 60,000,000 485,800 14246.4 123.61 3.66 32bit Use lookups 60,000,000 457,500 11500 131.23 3.3 64bit Conclusion
64 vs 32 bit
With a sample size of 2, all the 64 bit tasks saw an increase in throughput over their 32 bit counterparts. FINDSTRING method showed the most improvement, it's now nearly 100 rows/ms faster than the next component. Interestingly enough, the asynchronous script task saw a middling 1 row/ms increase compared to the nearly 10 row/ms increase on the other two tasks. I'll collect more data and see if the pattern holds true.
My package is available on my google site for your own evaluations.
1 comment:
I ran these and got a slightly different results. I dont know if it is hardware difference or what.
I took the list implementation and used sorted dictionary instead, and was able to get a small improvement in speed.
265 226000 22339 Script task Sorted Dictionaries
259 231000 22337 List implementation
235 255000 22345 Use lookups
224 267000 22341 Smarter Script
213 281000 22335 Conditional Split list
Post a Comment