World of Whatever

Databricks sparksql concat is not your SQL Server concat

2023-09-26T08:00:00.058-05:00

Databricks sparksql concat is not your SQL Server concat

One of these is not like the other...

The concat function is super handy in the database world but be aware that the SQL Server one is way better because it solves two problems. It combines everything into a string and it does not require NULL checking. In the before times, one had to down cast to a n/var/char type as well as check for NULL before appending strings via the plus sign.

In Databricks CONCAT WILL ONLY TAKE CARE OF CASTING TO THE STRING TYPE. NULLS WILL CONTINUE TO BITE YOU IN THE BUTTOCKS.

Given the following example query, we generate two rows in a derived table where the col2 value is either true (boolean 1) or NULL. In the LEFT JOIN LATERAL, which is the Databricks CROSS APPLY equivalent, I concat the 3 columns together with a pipe as separator and behold, my decidedly different results from a SQL Server expectation.

SELECT
*
FROM
(
SELECT * FROM VALUES (1, true, 'B')
UNION ALL SELECT * FROM VALUES (2, NULL, 'C')
)AS X(col1, col2, col3)
LEFT JOIN LATERAL
(
  SELECT concat(X.col1, '|', X.col2, '|', X.col3)
)HK(hkey);

What do you do? You get to wrap every nullable column with a coalesce call. Except, coalesce requires the same datatypes (mostly) so a naive implmentation of

SELECT concat(X.col1, '|', coalesce(X.col2, ''), '|', X.col3)

will result in the following error
AnalysisException: [DATATYPE_MISMATCH.DATA_DIFF_TYPES] Cannot resolve "coalesce(outer(X.col2), )" due to data type mismatch: Input to `coalesce` should all be the same type, but it's ("BOOLEAN" or "STRING")

Instead, one needs to do something along the lines of

SELECT
*
FROM
(
SELECT * FROM VALUES (1, true, 'B')
UNION ALL SELECT * FROM VALUES (2, NULL, 'C')
)AS X(col1, col2, col3)
LEFT JOIN LATERAL
(
  SELECT concat(X.col1, '|', coalesce(concat(X.col2, ''),''), '|', X.col3)
)HK(hkey);

At least I can automate this pattern with the information_schema.columns

Databricks sparksql escaping quote/tick

2023-09-14T08:00:00.003-05:00

If I had to embed a single quote in a query in TSQL, I would double it. In SparkSQL, I escape it like a classic C style string. So, the following shows how one would generate a query that is a query to find the row counts across all tables in SQL or unity catalog. Although for SQL, you're better off just querying the partitions meta table as it's waaaaay faster.

TSQL

SELECT CONCAT('SELECT COUNT(1) AS rc, ''', T.table_name, ''' AS table_name FROM dev.silver.', T.table_name, '') AS rcQ FROM dev.information_schema.tables AS T WHERE T.table_schema = 'silver'

Databricks unity catalog

SELECT CONCAT('SELECT COUNT(1) AS rc, \'', T.table_name, '\' AS table_name FROM dev.silver.', T.table_name, '') AS rcQ FROM dev.information_schema.tables AS T WHERE T.table_schema = 'silver'

ADF, DB2 and unexpected token query error

2023-09-12T08:00:00.052-05:00

ADF, DB2 and unexpected token query error

In yet another exciting episode of head to desk, I have a beautiful DB2 query. It runs just fine in a tool like DBeaver.

SELECT 
    STR.CPY_CD
,   STR.STR_CD
,   STR.DFLT_FAC_NBR -- I think this is division number
,   current date || 'T' || VARCHAR_FORMAT(CURRENT TIMESTAMP, 'HH24:MI:SS')  AS CurrentDateTime
FROM 
    FOO.CUSTMASTER AS STR
FOR FETCH ONLY WITH UR

Punch it into ADF and it fails, repeatedly with this ever-so-helpful error message. What is going on here?

Operation on target Copy StoreDivision to ADLS failed: Failure happened on 'Source' side. ErrorCode=DB2DriverRunFailed,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Error thrown from driver. Sql code: '-104',Source=Microsoft.Connectors.Db2,''Type=Microsoft.HostIntegration.DrdaClient.DrdaException,Message=An unexpected token "END-OF-STATEMENT , STR.DFLT_FAC_NBR " was found following "". Expected tokens may include: "". SQLSTATE=42601 SQLCODE=-104,Source=Microsoft.HostIntegration.Drda.Requester,'

After pasting the query into a text editor and examining each character in hex mode, no, no there are no "weird" unprintable characters hiding in my query. I mean, the only thing weird about the fourth line is that I have a comment there. But that syntax is totally valid for DB2...

Oh gentle reader, I'm sure there's a page somewhere in documentation that indicates carriage returns/line feeds are stripped from queries when subsmitted at least for Db2 connectors but I'd really like to stop running into surprises with Azure Data Factory. At least, that's my best guess as to what is happeneing because BEHOLD, this query works

SELECT 
    STR.CPY_CD
,   STR.STR_CD
,   STR.DFLT_FAC_NBR /* I think this is division number */
,   current date || 'T' || VARCHAR_FORMAT(CURRENT TIMESTAMP, 'HH24:MI:SS')  AS CurrentDateTime
FROM 
    FOO.CUSTMASTER AS STR
FOR FETCH ONLY WITH UR

Why do I think the CR/LFs are stripped? Because if I run this query in a tool like DBeaver, it fails and you won't believe what the error message is

SELECT     STR.CPY_CD,   STR.STR_CD,   STR.DFLT_FAC_NBR -- I think this is division number

SQL Error [42601]: An unexpected token "END-OF-STATEMENT" was found following ", STR.DFLT_FAC_NBR". Expected tokens may include: "".. SQLCODE=-104, SQLSTATE=42601, DRIVER=4.31.10

So, the moral of the story? Always use the /**/ comment syntax in your queries. Not just in ADF but everywhere because you never know when formatting is going to get eaten.

Difference between SparkSQL and TSQL casts

2023-09-11T08:00:00.001-05:00

Yet another thing that has bitten me working in SparkSQL in Databricks---this time it's data types.

In SQL Server, a tinyint ranges from 0 to 255 but both of them allow for 256 total values. If you attempt to cast a value that doesn't fit in that range, you're going to raise an error.

SELECT 256 AS x, CAST(256 AS tinyint) AS boom

Msg 220, Level 16, State 2, Line 1
Arithmetic overflow error for data type tinyint, value = 256.

The range for a tinyint is -128 to 127 in SparkSQL - still 256 total values. Docs call it out as well ByteType: Represents 1-byte signed integer numbers. The range of numbers is from -128 to 127 SELECT CAST('128' AS tinyint) AS WhereIsTheBoom, CAST(128 AS tinyint) As WhatIsThisNonsense Here I select the value 128 as both a string and a number. I honestly have no idea how to interpret these results. A cast from string behaves more like a TRY_CAST but numeric overflows just cycle?

Yeah, the cycle seems to be the thing as SELECT CAST(129 as tinyint) AS Negative127 is -127.

SparkSQL Databricks [INVALID_USAGE_OF_STAR_OR_REGEX] Invalid usage of '*' in expression `alias`.

2023-09-08T08:00:00.001-05:00

SparkSQL Databricks Error in SQL statement: AnalysisException: [INVALID_USAGE_OF_STAR_OR_REGEX] Invalid usage of '*' in expression `alias`.

Hi, it's me. I'm the problem

Dear self, when you recieve the following error Error in SQL statement: AnalysisException: [INVALID_USAGE_OF_STAR_OR_REGEX] Invalid usage of '*' in expression `alias` writing new-to-you sparksql and assuming the TSQL construct you know works just doesn't translate, take a good look at the syntax because I bet you've doubled up the FROM, again!
SELECT * FROM FROM uc.schema.table AS X WHERE X.col1 = 0

Make that
SELECT * FROM uc.schema.table AS X WHERE X.col1 = 0

ADF ForEach Activity

2023-09-06T08:00:00.002-05:00

ADF ForEach Activity

This occasional series of posts I'm calling Notes from the field is primarily for my benefit as I don't work with Azure Data Factory with any regularity and have no muscle memory in it. And the amount of sharp edges and non-intuitive baked in the product is quite frustrating for me. If you derive value from it, all the better.

The Azure Data Factory ForEach Activity shreds some enumerable thing but here I want to reference a Lookup to ForEach.

When I click on the ForEach Activity, in the "Items" section. I click "Add dynamic content" and am presented with these wonderful choices (given that my canvas consists of a Lookup Activity named "LKP_ReferenceTables" and the ForEach wired as a successor:

My options are

LKP_ReferenceTables: LKP_ReferenceTables activity output
LKP_ReferenceTables: LKP_ReferenceTables pipeline return value
LKP_ReferenceTables count: Count of the rows
LKP_ReferenceTables value array: Array of row data

and if I pick "activity output" the dyanmic expression is @activity('LKP_ReferenceTables').output WILLIAM! DO NOT CLICK OK, yet. If you accept the default code, you're going to get a lovely error in the form of Operation on target FE_ReferenceTables failed: The function 'length' expects its parameter to be an array or a string. The provided value is of type 'Object' You need to add the ".value" property to the expression to get the actual stuff you want @activity('LKP_ReferenceTables').output.value

Orrrr, I pick Array of row data and that generates the expected @activity('LKP_ReferenceTables').output.value Le sigh

Formatting a date in SQL Server with the Z indicator

2023-05-03T08:00:00.010-05:00

Formatting a date in SQL Server with the Z indicator

It seems so easy, I was building json in SQL Server and the date format for the API specified it needed to have 3 millsecond digits and the zulu timezone signifier. Easy peasy, lemon squeezey, that is ISO8601 with time zone Z format code 127

SELECT CONVERT(char(24), GETDATE(), 127) AS waitAMinute; Running that query yields something like 2023-05-02T10:47:18.850 Almost there but where's my Z? Hmmm, maybe it's because I need to put this into UTC? SELECT CONVERT(char(24), GETUTCDATE(), 127) AS SwingAndAMiss;

Running that query yields something like 2023-05-02T15:47:18.850 It's in UTC but still no timezone indicator. I guess I can try an explict conversion to a datetimezone and then convert to 127.

SELECT CONVERT(char(24), CAST(GETUTCDATE() AS datetimeoffset) , 127) AS ThisIsGettingRidiculous , CAST(GETUTCDATE() AS datetimeoffset) AS ControlValue;

Once again, that query yields something like 2023-05-02T15:47:18.850 and I can confirm the ControlValue aka unformatted looks like 2023-05-02 15:47:850.7300000 +00:00 We have timezone info, just not the way I need it.

Back to the documentation, let's ready those pesky footnotes.

8 Only supported when casting from character data to datetime or smalldatetime. When casting character data representing only date or only time components to the datetime or smalldatetime data types, the unspecified time component is set to 00:00:00.000, and the unspecified date component is set to 1900-01-01. 9 Use the optional time zone indicator Z to make it easier to map XML datetime values that have time zone information to SQL Server datetime values that have no time zone. Z indicates time zone at UTC-0. The HH:MM offset, in the + or - direction, indicates other time zones. For example: 2022-12-12T23:45:12-08:00.

Those don't appply to me...Oh, wait, they do. In the Z notation is only used for converting stringified dates into native datetime types. There is no cast and convert style code to output a formated ISO 8601 date with the Z indicator.

So what do you do? Much of the internet just proposes using string concatenation to append the Z onto the string and move on. And that's what a rational developer would do but I am not one of those people.

Solution

If you want to get a well formatted ISO8601 with the time zone Z indicator, the one-stop-shop in SQL Server will be the FORMAT function because you can do anything there!

SELECT FORMAT(CAST(D.val AS datetimeoffset), 'yyyy-MM-ddThh:mm:ss.fffZ') AS WinnerWinnerChickenDinner , FORMAT(CAST(D.val AS datetime2(3)), 'yyyy-MM-ddThh:mm:ss.fffZ') AS OrThis FROM ( VALUES ('2023-05-02T15:47:18.850Z') )D(val);

The final thing to note, the return type of FORMAT is different. It defaults to nvarchar(4000) whereas lame string concatenation yields us the right length (24) but the wrong type as concatenation changed our char to varchar. If we were storing this to a table, I'd add a final explicit cast, in either case, to be char(24). There's no unicode values to worry about nor will it ever be shorter than 24 characters.

SELECT DEDFRS.name, DEDFRS.system_type_name FROM sys.dm_exec_describe_first_result_set ( N'SELECT FORMAT(CAST(D.val AS datetimeoffset), ''yyyy-MM-ddThh:mm:ss.fffZ'') AS LookMaUnicode , CONVERT(char(23), CAST(D.val AS datetimeoffset), 127) + ''Z'' AS StillVarChar FROM ( VALUES (''2023-05-02T15:47:18.850Z'') )D(val);' , N'' , 1) AS DEDFRS;

Filed under, "I blogged about it, hopefully I'll remember the solution"

Counting a character in a column

2022-12-22T08:00:00.007-06:00

Counting a character in a column

I ran into an issue today that I wanted to write about so that maybe I remember the solution. We ran into a case where the source data in a column had an unprintable character. In this case, it was a line feed character, which is ASCII value 10, and they had 7 instances in this one row. "How did that get in there? Surely that's an edge case and we can just ignore it," and dear reader, I've been around long enough to know that this is likely a systemic situation. To count the number of line feeds in a single row, for a single column, I can just copy the value into NotePad++ or the like, display all characters, and simpely count.

Now, let's count how many line feeds are in a table with 23.5 million rows by hand - Any takers? Exactly. The trick to this solution is that we're going to make use of the REPLACE function to substitute the empty string for all of the values we want to count. We'll then compare the difference in string lengths between the original and the final.

SELECT
    *
FROM
(
    -- Generate our data
    SELECT
        CONCAT('100', CHAR(10), 'E', CHAR(10), 'Main', CHAR(10), 'St', CHAR(10), 'W', CHAR(10), 'Ste', CHAR(10), '357', CHAR(10) ) AS InputString
) D0
CROSS APPLY
(
    -- What is the difference in string length?
    SELECT
        LEN(D0.InputString) - LEN(REPLACE(D0.InputString, CHAR(10), '')) AS LFCount
)D1;

This solution is elegant in that you only require one pass through a table to figure out the number. Plus, if you want to do the same computation in other languages that may not have a count function or equivalent available for strings, it likely supports a REPLACE operation.

Is there any other way?

Sure but none of them seem to perform as well.

STRING_SPLIT

Assuming you're on SQL Server 2014?+ the second parater to STRING_SPLIT will be the character to split. Intuitively, I think this might be a more obvious solution. What do we want to do? Count how many times a character exists in a field. If we break the field up into multiple rows and then count how many rows were generated, Bob's your uncle!

SELECT
    *
FROM
(
    SELECT
        CONCAT('100', CHAR(10), 'E', CHAR(10), 'Main', CHAR(10), 'St', CHAR(10), 'W', CHAR(10), 'Ste', CHAR(10), '357', CHAR(10) ) AS InputString
) D0
CROSS APPLY
(
    SELECT
        COUNT_BIG(SS.value) AS LFCount
    FROM
        STRING_SPLIT(D0.InputString, CHAR(10)) AS SS
)D1;

What I don't like is the performance. Running those two queries against my original table, it took about 30 seconds to compute the REPLACE's results and 70 seconds for the STRING_SPLIT.

Number table

I'm not going to go find my notes on how to use a number table to perform the same split operation as string_split but I can already tell you, it won't perform as well as string_split and we've already covered that one isn't going to cut it.

I want to count spaces or I'm dealing with unicode data

Fun, but documented, twist --- LEN is not going to count trailing white space. And while I'm a dumb 'murican, unicode length is different so you should probably use DATALENGTH instead, but then divide by 2.

SELECT
    *
FROM
(
    -- Generate our data
    -- We are now using char(32), aka space, as our delimiter
    -- and tacking on an extra 100 at the end
    -- and we set our type to be nchar
    SELECT
        CONCAT(CAST('100' AS nchar(3)), CHAR(32), 'E', CHAR(32), 'Main', CHAR(32), 'St', CHAR(32), 'W', CHAR(32), 'Ste', CHAR(32), '357', CHAR(32), space(100) ) AS InputString
) D0
CROSS APPLY
(
    SELECT LEN(D0.InputString) AS IncorrectLength
    ,   DATALENGTH(D0.InputString)/2 AS CorrectLength
)D01
CROSS APPLY
(
    -- What is the difference in string length?
    SELECT
        (DATALENGTH(D0.InputString) - DATALENGTH(REPLACE(D0.InputString, CHAR(32), '')))/2 AS SpaceCount
    ,   (LEN(D0.InputString) - LEN(REPLACE(D0.InputString, CHAR(32), ''))) AS SpaceCountWrong
)D1;

Can you think of any other ways to crack this nut?

GENERATE_SERIES is a persnickety little function

2022-11-22T08:00:00.018-06:00

GENERATE_SERIES is a persnickety little function

New in SQL Server 2022 is the GENERATE_TIMESERIES table valued function. It's a handy function to generate a series of numbers. Once you understand why a number table is handy, you'll find lots of uses for it but this it not the blog post to cover it.

I wanted to generate a time series using the new hotness and step 1 is to get a list of the timeoffsets I'll add to my starting value. The following query will generate 16565 rows, which I'd then use as the second argument to a DATEADD call.

SELECT *, DATEADD(SECOND, Works.Value, DATETIME2FROMPARTS(2022,11,22,0,0,0,0,1)) FROM GENERATE_SERIES(0, 19565, 1) AS Works;

Perfect, now I have a row for each second from midnight to approximately 5.5 hours later. What if my duration need to vary because I'm going to compute these ranges for a number of different scenarios? I should make that 19565 into a variable and let's overengineer this by making it a bigint.

-- Assume the duration might be "big"
-- TODO: @duration = DATEDIFF(SECOND, Start, Stop)
DECLARE @duration bigint = 19565;
SELECT *, DATEADD(SECOND, LolWorks.Value, DATETIME2FROMPARTS(2022,11,22,0,0,0,0,1)) FROM GENERATE_SERIES(0, @duration, 1) AS LolWorks;

In the immortal words of Press Your Luck, WHAMMIE WHAMMIE WHAMMIE!

Msg 5373, Level 16, State 1, Line 4
All the input parameters should be of the same type. Supported types are tinyint, smallint, int, bigint, decimal and numeric.
Msg 206, Level 16, State 2, Line 4
Operand type clash: int is incompatible with void type
Msg 206, Level 16, State 2, Line 4
Operand type clash: bigint is incompatible with void type
Msg 206, Level 16, State 2, Line 4
Operand type clash: int is incompatible with void type

The fix is easy - just as the error message says, all THREE operands to GENERATE_SERIES must be the same data type. In this case, it required upsizing to bigint. In the technical writers defense (defence), they call it out in the documentation but I grew up only reading documentation after I failed to natively grasp how to make something work. The data type for stop must match the data type for start. SELECT * FROM GENERATE_SERIES(CAST(1 AS bigint), @duration, CAST(1 AS bigint)) AS Fails;

Key take away

Unlike any other situation I can think of off the top of my head, GENERATE_SERIES won't automatically promote/upsize/expand data types.

ADF and MySql.Data.MySqlClient.MySqlException,Message=Got a packet bigger than 'max_allowed_packet'

2022-01-27T08:00:00.001-06:00

Azure Data Factory, ADF, and exception MySql.Data.MySqlClient.MySqlException,Message=Got a packet bigger than 'max_allowed_packet'

My StackOverflow developer profile specifies "I'd prefer to not work with" and honestly, the only thing I don't want to deal with is MySQL. I don't like Visual Basic or Access or plenty of other things but good grief, I find working MySQL to be an absolute cesspit after every other RDBMS I've worked with. Which brings me to an overdue client project, consolidating various MySQL instances to a single reporting server. They have a standard schema on all the boxes (no really, that was my biggest fear but they're good at ensuring the nearly 200 sites have the exact same point release of code) and I needed to bring it to a single server so it can be fed into reports.

It seemed like a great fit for Azure Data Factory but I kept getting an error dealing with some packet size issue. What do I know about packet sizes? Error details Error code 2200 Failure type User configuration issue Details 'Type=MySql.Data.MySqlClient.MySqlException,Message=Got a packet bigger than 'max_allowed_packet' bytes,Source=MySqlConnector,''Type=MySql.Data.MySqlClient.MySqlException,Message=Got a packet bigger than 'max_allowed_packet' bytes,Source=MySqlConnector,'

What's the internet got to say about all this? I checked the setting on a server that worked and one that didn't SHOW VARIABLES LIKE 'max_allowed_packet'; but they both listed 4194304 (bytes).

Beyond changing configuration settings, and no guarantee that solves the issue, the idea of inconsistent table definition sounded promising. But no dice. I tried making all the fields nullable, but to no avail. I ran the mysqldump utility from the commandline to see if I could reproduce the packet issue. Nothing.

After a lot of frustration, I looked hard at the custom integration logs. Before I move data, I copy over the source information_schema.tables for the database and store the TABLE_ROWS for each table. That number is a approximately the number of rows in the table. In the copy activity itself, I log the actual rows transferred and that's when I noticed something. The largest set of data from a single source was 6k rows. ALL OF THE HOSTS THAT GENERATED THE MAX PACKET EXCEPTION HAD MORE ROWS THAN 6K.

Well, what if the issue is one of volume? That's easy enough to test. I put a LIMIT 1000; on the query and pointed ADF at the server that never transferred data for that table. It worked. Sonofa. Ok, LIMIT 5000; Worked. Removed the limit - Failed, got a packet bigger than 'max_allowed_packet'

The error is not being generated from the source as I assumed. Normally, ADF says whether it's the source or the sink that caused the error. The exception makes sense if the error is on the sink. "You're sending too much data in one shot" would be a more useful error.

How do we fix it

The default Write Batch Size for a Copy Activity is 10,000. I dropped the size to 5000 and ran through all the troublesome hosts. Of the 51 hosts that would never transfer the suspect table, every.single.one.worked.

SSIS Azure Feature Pack and the Flexible File components

2022-01-06T08:00:00.001-06:00

SSIS Azure Feature Pack and the Flexible File components

The Azure Feature Pack for SSIS is something I had not worked with before today. I have a client that wants to use the Flexible File Task/Flexible File Source/Flexible File Destination but they were having issues. The Flexible File tools allow you to work with Azure Blob storage. We were dealing with ADLS Gen2 but the feature pack can work with classic blob storage as well. In my hubris, I said no problem, I know SSIS. Dear reader, I did not know as much as I thought I did...

Our scenario was simple. We had a root folder datalake and subfolders of Raw. And into that we were needed to land and then consume files. Easy peasy. The Flexible File Destination allows us to write. The Flexible File Source allows us to read and we can configure a Foreach File enumerator to use the "Foreach Data Lake Storage Gen2 File Enumerator" to interact with the file system. Everything is the same as Windows except we use forward slashes instead of backslashes for path separators.

I started with the Flexible File Source after I manually created a CSV and uploaded it to data lake. One of the things I wasn't sure about was path syntax - do I need to specify the leading slash, do I need to specify the trailing slash, etc. I think I determined it didn't matter, it'd figure out whether folder path needs a trailing slash if it wasn't specified.

As I was testing things, changing that value and the value in the Flat File Destination each time was going to be annoying so I wanted to see what my options were for using Expresssions. Expressions are the secret sauce to making your SSIS packages graceful. The development teams took the same approach they have with the ADO.NET components in that there are no expressions on the components themselves. Instead, if you want to make the Flexible File Source/Flexible File Destination dynamic at all, you're going to have to look at the Data Flow Tasks Expressions and then configure there.

Here I used an SSIS package variable @[User::ADLSPath] so I could easily switch out /datalake, datalake, datalake/raw, datalake/Raw/, /datalake/raw/, and evaluate case sensitivity, path manipulation, etc.

F5

Transfer data error : System.IO.FileNotFoundException: Could not load file or assembly 'Microsoft.Azure.Storage.Common, Version=11.1.3.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35' or one of its dependencies. The system cannot find the file specified. File name: 'Microsoft.Azure.Storage.Common, Version=11.1.3.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35'

I tried it again, same issue. I flipped from 64 to 32 bit (I had installed both sets of the Feature Pack). I reread the install requirements document. I looked through github bugs. I contemplated asking a question on StackOverflow. I ended up trying to reproduce the error on my desktop instead of the client's VM. Same bloody issue! Finally, I said to heck with it, maybe it more strongly worded, and I'll get this assembly and install to the GAC. Maybe something went wonky with the installs on both machines.

How do I install something into the global assembly cache?

StackOverflow answer On your file system, you might have a utility called "gacutil.exe" From a administrative command prompt, I'd type "CD \" and then "dir /s /b gacutil.exe" That should provide a list of all the instances of gacutil on your machine. Pick one, I don't care which. If there's a space in the path name, then you'll need to wrap it with double quotes. "C:\Program Files (x86)\Microsoft SDKs\Windows\v10.0A\bin\NETFX 4.8 Tools\x64\gacutil.exe" /? That would bring up the help text for using the gacutil that lives at that path. If the double quotes were not there, you'd get an error message stating
'C:\Program' is not recognized as an internal or external command, operable program or batch file.

How do I get Microsoft.Azure.Storage.Common.dll?

I hope you comment below and tell me an easier way because I am not a nuget master. Visual Studio has a nuget package manager in it. However, it only works in the contet of a solution or project and the project type must support packages. If you have a single project in your solution and that project is an SSIS project, attempting to use the nuget package manager will result in an error
Operation Failed. No projects supported by NuGet in the solution. Well fiddlesticks, I guess we have to give up.

Or, add a script task to the SSIS package and then click Edit Script. In the new instance of Visual Studio, guess what - it thinks it supports NuGet. It does not, when you close the editor, the packages go away and when the script runs, it does not have the brains to get the assemblies back from NuGet land. So, don't.close.the.editor. yet. In the NuGet Package manager, on the Browse tab, type in Microsoft.Azure.Storage.Common and look at that - Deprecated. Last stable version 11.2.3. But this error indicates the component expects 11.1.3.0 so in the Version, scroll all the way back and find it. Click the check box and click Install button. Yes, you agree to the other packages as well as the MIT license.

At this point, in the volatile/temporary file storage on your computer, you have an on disk representation of your empty script Task with a NuGet package reference. We need to find that location, e.g. C:\Users\bfellows\AppData\Local\Temp\Vsta\SSIS_ST150\VstacIlBNvWB__0KemyB8zd1UMw\ Copy the desired DLLs out into a safe spot (because if I have to do it here, I'll likely have to do it on the server and the other three development VMs) and then use the gacutil to install them.

Right click on Solution VstaProjects and choose Open in Terminal. The assembly we're looking for will be located at .\packages\Azure.Storage.Common.12.9.0\lib\netstandard2.0\Azure.Storage.Common.dll. Assume I copied it to C:\temp

"C:\Program Files (x86)\Microsoft SDKs\Windows\v10.0A\bin\NETFX 4.8 Tools\x64\gacutil.exe" -if C:\temp\Azure.Storage.Common.dll will force install the dll to the GAC. Now when I run SSIS, we'll see whether that has resolved our error.

Will the real error please stand up

It did resolve our error, but not. The error that was reported, missing assembly was Mickey Mouse, mate. Spurious. Not genuine.

Look what error decided to show up now that it could error out "better"
Transfer data error : Microsoft.DataTransfer.Common.Shared.HybridDeliveryException: ADLS Gen2 operation failed for: Operation returned an invalid status code 'BadRequest'. Account: 'datalakedev'. FileSystem: 'datalake'. ErrorCode: 'TlsVersionNotPermitted'. Message: 'The TLS version of the connection is not permitted on this storage account.'

If I go to my storage account, under Settings, Configuration, there I can change Minimum TLS version from 1.2 to 1.1. Oh, except that still isn't kosher - same error. 1.0 it is and lo and behold, I have data transfer. The root cause is not a missing assembly, it is a red herring error message that could only be resolved by adding the assembly to the global assembly cache.

Rant

How in the hell would a normal person make the connection between "Could not load file or assembly" and Oh, I need to change the TLS? What's really galling is the fact that when I used the Flexible File Source for my data flow, I specified a file on blob storage and SSIS was able to connect and read that file because it identified the associated metadata. I has two columns and here are the data types (defaulted to string but who cares, that's consistent with flat file source). BUT IT PICKED UP THE METADATA. IT COULD TALK TO AZURE BLOB STORAGE EVEN THOUGH IT ONLY ALLOWED 1.2! And yet, when it came time to run, it could not talk on the same channel. Can you see how I lost a large portion of my day trying to decipher this foolishness?

By the way, knowing that the root cause it a mismatch between the TLS SSIS/my computer is using and the default on the storage account, let's go back to the writeup for the Azure Feature Pack

Use TLS 1.2 The TLS version used by Azure Feature Pack follows system .NET Framework settings. To use TLS 1.2, add a REG_DWORD value named SchUseStrongCrypto with data 1 under the following two registry keys. HKEY_LOCAL_MACHINE\SOFTWARE\WOW6432Node\Microsoft\.NETFramework\v4.0.30319 HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\.NETFramework\v4.0.30319

So, sometimes an error message isn't the real error message but you have to clear away all the garbage between you and the source of the error to figure out what it really is.

Extracting queries from SSIS packages

2021-12-22T12:00:00.189-06:00

Extracting queries from SSIS packages

We received an abomination of an SSIS package from a third party. It was a way to write a package that I don't think I would have suggested.

Our job was to rewrite this into something more manageable and it appears Azure Data Factory will be the winner. Before we can do that, we need to document what the existing package is doing (the vendor has supplied the incremental load logic) so we can replicate it but in a more economical form. It appears to have the pattern (squint really hard at the picture) Execute SQL Task -> Execute SQL Task -> Sequence container => many data flows -> Data Flow -> Execute SQL Task. The Data Flow Task is named after the table being loaded. An ODBC source with a expression based query, named "ODBC Source 1" wired to an OLE DB Destination, named "OLE DB Destination". How would you do it, especially given that there are 236 Data Flow Tasks embedded in a single container?

Biml it!

As with so many things SSIS-related, Biml is the answer. Install BimlExpress and reverse engineer that dtsx package into Biml. I'll add a blank BimlScript file that I called Inspector.biml

Let's look at a sample DataFlow task

            <Dataflow Name="ATableName">
              <Expressions>
                <Expression ExternalProperty="[ODBC Source 1].[SqlCommand]">"SELECT * FROM dbo.ATableName where modifiedutc > '" +(DT_WSTR, 30)@[User::LastModified] + "'  AND modifiedutc <= '" + (DT_WSTR, 30)@[User::MostRecent] + "'"</Expression>
              </Expressions>
              <Transformations>
                <OdbcSource Name="ODBC Source 1" Connection="Source2">
                  <DirectInput>SELECT * FROM dbo.ATableName where modifiedutc > '0'  AND modifiedutc <= '0'</DirectInput>
                </OdbcSource>
                <DataConversion Name="Data Conversion">
                  <Columns>
                    <Column SourceColumn="id" TargetColumn="Copy of id" DataType="AnsiString" Length="255" CodePage="1252" />
                  </Columns>
                </DataConversion>
                <OleDbCommand Name="Delete Old Rows from ATableName" ConnectionName="Destination2">
                  <Parameters>
                    <Parameter SourceColumn="ye_id" TargetColumn="Param_0" DataType="AnsiStringFixedLength" Length="255" CodePage="1252" />
                    <Parameter SourceColumn="modifiedutc" TargetColumn="Param_1" DataType="Int64" />
                  </Parameters>
                  <DirectInput>delete from dbo.ATableName where ye_id = ? and modifiedutc  < ?</DirectInput>
                </OleDbCommand>
                <ConditionalSplit Name="Conditional Split">
                  <OutputPaths>
                    <OutputPath Name="Case 1">
                      <Expression>ISNULL(ye_id)</Expression>
                    </OutputPath>
                  </OutputPaths>
                </ConditionalSplit>
                <OleDbDestination Name="OLE DB Destination" ConnectionName="Destination2">
                  <ExternalTableOutput Table=""dbob"."ATableName"" />
                </OleDbDestination>
              </Transformations>
            </Dataflow>

All I want to do is find all the Data Flow Tasks in the sequence containers. I need to generate a key value pair of TableName and the source query. I could dive into the Transformations layer and find the ODBC source and extract the DirectInput node from the OdbcSource and then parse the table name from the OleDbDestination's ExternalTableOutput but look, I can "cheat" here. Everything I need is at the outer Data Flow Task level. The Name of the DataFlow is my table name and since it's ODBC and the source component doesn't support a direct Expression on it, it's defined at the Data Flow Level. That makes this Biml easy.

  
<#

// A dictionary of TableName and the Expression
Dictionary<string, string> incremental = new Dictionary<string, string>();

foreach (AstPackageNode p in this.RootNode.Packages)
{
    // Loop through the Sequence Container
    foreach (var c in p.Tasks.OfType<AstContainerTaskNode>()/**/)
    {
        foreach (AstDataflowTaskNode t in c.Tasks.OfType<AstDataflowTaskNode>())
        {
            if (p.Name == "Postgres_to_MSSQL_Incremental")
            {
                incremental[t.Name] = t.Expressions[0].Expression;
            }
        }
    }
}

WriteLine("<!--");
foreach (KeyValuePair<string, string> k in incremental)
{
    // WriteLine("<!-- {0}: {1} -->", k.Key, k.Value);
    WriteLine("{0}|{1}", k.Key, k.Value.Replace("\n", " "));
}
WriteLine("-->");

#>
<Biml xmlns="http://schemas.varigence.com/biml.xsd">
</Biml>

I define a dictionary that will hold the Table and the associated Expression. The first foreach loop specifies that I want to enumerate through all the packages that have been reverse engineered. If you're using BimlExpress, you'll need to shift click on all the source Biml files as well as the Inspector package.

The next foreach enumerator looks at all the elements in the Control Flow task and we're going to filter it to just things that are of type AstContainerTaskNode, aka a Container. That "OfType" filter syntax is very handy to focus on only the type of item you're looking for. Linq is so powerful, I love it.

The innermost foreach enumerator uses the OfType again to filter tasks to only those that are DataFlows, AstDataFlowTaskNode. The if statement ensures I'm only working on the Incremental package (they supplied an initial load as well). Finally, I add the Task's Name to the key of the dictionary and the value becomes the first Expression. Again, I can cheat here because the vendor package was very consistent, which is amazing for an SSIS package that's been hand crafted. That source package was 40.5 MB and had been saved 422 times according to the VersionBuild number. Kudos to them for quality control.

Once the looping is complete, all that is left is to emit the information so I can use it elsewhere. Thus the final foreach loop. I'm working in BimlStudio so comments are emitted and I'm going to take advantage of that by simply writing the key/value pair with a Pipe delimiter and then copy/paste the output into a CSV. If you're working in BimlExpress, I'd just write directly to a file with a System.IO.Text.WriteAllLines (name approximate) but this was just a "quick and dirty get it done" task and corresponding blog post to show that Biml and metadata programming are still relevant.

Eagle eyed viewers will note that I am missing the single DataFlow task after the Container. My partner also notice it and so if you need to also look for any data flow tasks at the Package level, I added this loop after the Seqence Container loop.

    foreach (AstDataflowTaskNode t in p.Tasks.OfType<AstDataflowTaskNode>())
    {
        if (p.Name == "Postgres_to_MSSQL_Incremental")
        {
            incremental[t.Name] = t.Expressions[0].Expression;
        }
    }

Wrapup

I'm a slow/infrequent blogger and this post took me 50 minutes. I think after we were done scratching our heads at the source packages, it took less time to write the script and generate the list of tables and associated queries than this blog post took.

ETL pattern for API source

2021-11-17T08:00:00.003-06:00

ETL pattern for API source

The direction for software as a service providers is to provide APIs to access their data instead of structured file exports. Which is a pity, as every SaaS system requires a bespoke data extract solution. I inheireted a solution that had an adverse pattern I'd like to talk about.

The solution pulls data from advertising and social media sites (Google Analytics, Twitter, Facebook, etc) and does processing to make it ready for reporting. The approach here works, but there are some challenges that you can run into.

Metering - Providers generally restrict you to only consuming so much over time (where so much and time are highly dependent on the source). Google Analytics, depending on product, rejects your connections after so many calls. Twitter, also depending on their maddening, inconsistent set of APIs (v1 vs v2), endpoints, product (free standard, paid for premium or enterprise) will throttle you based on consumption
Data availability - you have no idea whether the data you pulled today will be available tomorrow. We had pulled 5 years of data out of Google Analytics that contained a variety of dimensions, two of which were ga:userAgeBracket and ga:userGender. In talking to our client, they wanted just one more data elemented added to the mix. We made the change and boom goes the dynamite: Some data in this report may have been removed when a threshold was applied. That error message means that you're requesting a level of granularity that could de-anonymize users. Ok, fine, we rolled back the change but No, that's no longer a valid combination, ever! And we ran into a situation were some of the data just wasn't availble pre-2020. Yes, a month earlier the same code had pulled 6 years worth of data but no more.
Oops - Something happened when we created the data for reporting (data merge introduced duplicates, client wanted a differen format, etc) and now we need to do it again, except instead of the month allocated, we have a week to fix all this up. Which bumps into the Metering and Data Availability points. Ouch town, population you.

Preferred pattern

What I inheireted wasn't bad, it just hadn't taken those possible pain points into consideration. In a classic data warehouse, you have a raw zone with immutable source sitting somewhere on cheap storage. The same lesson applies here. When you pull from an API, land that data to disk in some self defining format, json/xml/csv don't care.

Write your process so that it is able to consume that source file and get the exact same results as the source data pull.

  
def get_data(source_date):
  """Get a data for a given date.
  :param source_date: An ISO 8601 formatted date aka yyy-MM-dd
  :return: A dictionary of data
  
  """
  source_file = '/dbfs/mnt/datalake/raw/entity/data_provider_{0}.json'.format(source_date)
  raw_data = {}
  if os.path.exists(source_file):
    with open(source_file, 'r', encoding='utf-8') as f:
      raw_data = json.load(f)
  else:
      raw_data = analytics.reports().batchGet(body='json-goes-here').execute()
      with open(google_file, 'w', encoding='utf-8') as f:
        json.dump(raw_data, f, ensure_ascii=False)
  
  return raw_data

This simple method is responsible for getting my data by date. If the file exists in my file system, then I will reuse the results of a previous run to satisfy the data request. Otherwise, we make a call to the API and before we finish, we write the results to disk so that we can be ready in case Ooops happens downstream of the call.

Using this approach, we were able to reprocess a years worth of cached data in about 10 minutes compared to about 4.5 hours of data trickling out of the source API.

Including a local python module

2021-02-09T08:00:00.020-06:00

Including a local python module

As we saw in reusing your python code, you can create a python file, a module, that contains our core business logic and then re-use that file. This post is going to talk about to make that happen.

What happens when you import antigravity? The python interpreter is going to check all the places it knows to find modules. It's going to check the install location for a module, it's going to check if you defined pythonhome/pythonpath environment variables, and you can hint to your hearts desire where to find files. If I import a real module, import pprint, I can access the __file__ property which will tell you where it found the module. In my case it was C:\Python38\lib\pprint.py

Python is happy to tell you what the current search path is import sys print(sys.path)

On my machine, that displays an array of ['', 'C:\\Python38\\python38.zip', 'C:\\Python38\\DLLs', 'C:\\Python38\\lib', 'C:\\Python38', 'C:\\Python38\\lib\\site-packages'] If I want reusable_code.py to be callable by another module, then it needs to exist in one of those locations. That first entry of a blank path is the current directory so as long as the module I need is in the same folder, we're golden! I have added i_use_resuable.py to the same folder as the above

# This module lives in the same folder as our reusable_code.py file
import reusable_code


def main():
    c = reusable_code.Configuration()
    print(c.get_modify_date())


if __name__ == '__main__':
    main()

Executing that, I get the expected timestamp - the exact same experience as running our corporate file but now I can focus on using the business logic instead of writing it. We're going to need something more though if we're going to get this reuable code into our pyspark cluster.

In the next post, we'll learn how to package our business module up into something we can install instead of just assuming where the file is.

As always, the code is on my github repro

Reusing your python code

2021-02-08T08:00:00.080-06:00

Reusing your ptyhon code

I learned python in 2003 and used it for all the ETL work I was doing. It was beautiful and I would happilly wax to any programmer friends about the language and how they should be learning it. It turns out, my advocacy was just 15+ years too early. I recently had a client reach out to engage me to work on their Databricks project. No gentle reader, I don't much of anything about Databricks. But I do know about working with data, python programming (which I was already updating my mental model to 3.0) and pandas. Yes, pandas is not what we do in databricks but the concepts are similar.

One of the early observations is that they had dozens of notebooks with copy and paste code across them. Copy and paste code in a metadata driven solution isn't an evil but when you're hand crafting boiler plate code artifacts by hand, you're going to sneak a code mutation in there. So, let's look at how we can avoid this with code re-use.

Let's assume we use an important business process that needs to be consistent across our infrastructure. In this case, it's a modification date which is used as part of our partition strategy. This code nugget is spread across all those notebooks datetime.now().strftime("%Y-%m-%dT%H:%M:%SZ") When processing starts up, we set a timestamp so that all activities accrue under that same timestamp. It's a common pattern across data processing. How could we do this better?

In classic python programming, we would abstract that logic away in a reusable library. In this example, I have created a module (file) named reusable_code.py In it, I created a class named Configuration and it exposes a method get_modify_date

# reusable_code.py is a python module that simulates our desire to 
#consolidate our corporate business logic into a re-usable entity

from datetime import datetime

class Configuration():
    """A very important class that provides a standardized approach for our company"""
    def __init__(self):
        # The modify date drives very important business functionality
        # so let's be consitent in how it is defined (ISO 8601)
        # 2021-02-07T17:58:20Z
        self.__modify_date__ = datetime.now().strftime("%Y-%m-%dT%H:%M:%SZ")

    def get_modify_date(self):
        """Provide a standard interface for accessing the modify date"""
        return self.__modify_date__


def main():
    c = Configuration()
    print(c.get_modify_date())

if __name__ == "__main__":
    main()

Usage is simple, I create an instance of my class c, which causes the constructor/initalizer to fire and set the modify date for the life of that object. Calling the get_modify_date method results in an ISO 8601 date to be emitted

2021-02-07T17:58:20Z

At this point, I hope you have an understanding of how we can make a reusable widget. Think about your business processes that you need to encapsulate in to reusable components and tomorrow we'll review using existing python modules in new files. After that, we'll cover converting this module into a wheel. And then we'll walk through installing it to a DataBricks cluster and using it from a notebook. Sound good?

All of this code is available on my github repository -> 2021-02-08_PythonReusableCode

Making a delimited list

2020-02-20T12:47:00.001-06:00

Making a delimited list

There are various ways to concatenate values together. A common approach I see is that people will add a delimiter and then the value and loop until they finish. Then they take away the first delimiter. Generally, that's easier coding, prepending a delimiter, than to append the delimiter and not do it for the final element of a list. Or add it and then remove the final delimiter from the resulting string.

Gentle reader, there is a better way. And has been for quite some time but if you weren't looking for it, you might not know it exists. In .NET, it's String.Join

Look at the following code, what would you rather write? The Avoid this block or simply use the libraries?

using System;
using System.Collections.Generic;
using System.Linq;
using System.Xml.Linq;
using System.Text;
                    
public class Program
{
    public static void Main()
    {
        // generate a list of numbers
        List data = (Enumerable.Range(0, 10)).ToList();

        string delimiter = ",";
        // Avoid this, unless you know you need to do it for a specific reason
        {
            StringBuilder sb = new StringBuilder();
            foreach(var i in data)
            {
                sb.Append(delimiter);
                sb.Append(i);
            }
            
            // Convert the string builder to a string and then strip the first
            // character out of it
            string final = sb.ToString().Substring(delimiter.Length);
            
            Console.WriteLine(final);
        }
        
        // Make a comma delimited list
        Console.WriteLine(String.Join(delimiter, data));
        
        // What if we want to do something, like put each element in an XML tag?
        Console.WriteLine(String.Join(string.Empty, data.Select(x => string.Format("{0}", x)).ToList()));

    }
}

Output of running the above

0,1,2,3,4,5,6,7,8,9
0,1,2,3,4,5,6,7,8,9
0123456789

Gist for .net

But Bill, I use Python. The syntax changes but the concept remains the same. delimiter.join(data). Whatever language you use, there's probably an equivalent method. Look for it and use it. Don't write your own implementation.

Was there a better way to have done this? Let me know.

Generating characters in TSQL

2019-11-19T14:00:00.001-06:00

Generating characters in TSQL

I had to do a thing^* and it involved generating "codes" as numbers were too hard for people. So, if you have need to convert an arbitrary number into characters, this is your lucky day/post.

Background

As ~~I get longer in the tooth~~ programming becomes more accessible, I find that people might not have been exposed to underpinnings of how things used to work. Strings were just a bunch of characters put together and a character was a subset of the Latin alphabet shoved into 128 characters (0 to 127). The characters below 32 were referred to as the non-printable characters or control characters. Things above 32 are what you see on a US keyboard. There was a time, if you bought a programming book, it would have an ASCII table somewhere in the reference. Capital A is character 65, Capital Z is character 90 (65/A + 25 characters later). In TSQL, the CHAR function takes a number and gives you the ASCII character for the value so SELECT CHAR(66) AS B; will generate a capital B.

The mod or modulus function will return the remainder after division. Modding a value is a handy way to constrain a value between 0 and an upper threshold. In this case, if I modded any number by 26 (because there are 26 characters in the English alphabet), I'll get 0 to 25 as my result.

Knowing that the modulus function will give me 0 to 25 and knowing that my target character range starts at 65, I could use the previous expression to print any number's ascii value like SELECT CHAR((2147483625 % 26) + 65) AS StillB;. Break that apart, we do the modulus, %, which gives us the value of 1 which we then add to the starting offset (65).

Rolling all that together, here's a quick little tester to see what we can then do with it.

SELECT
    D.rn
,   ASCII_ORD.ord_value
,   ASCII_ORD.replicate_count
    -- CHAR converts a number to a character
,   CHAR(ASCII_ORD.ord_value) AS ord_value_as_character
    -- REPLICATE repeats a string N times
,   REPLICATE(CHAR(ASCII_ORD.ord_value), ASCII_ORD.replicate_count) AS RepeatedCharacter
    -- CONCAT is a null and type approach for string building (requires 2012+)
,   CONCAT(CHAR(ASCII_ORD.ord_value), ASCII_ORD.replicate_count) AS ConcatenatedCharacter
FROM
(
    -- Generate 0 to N-1 rows
    SELECT TOP (300)
        ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) -1
    FROM
        sys.all_columns AS AC
)D(rn)
CROSS APPLY
(
    -- There are 26 characters in the English language
    -- 65 is the ASCII ordinal position of a capital A
    SELECT
        D.rn % 26 + 65
    ,   D.rn / 26 + 1
) ASCII_ORD(ord_value, replicate_count)
ORDER BY
    D.rn
;

Ultimately, it was decided that using a combination of character and digits (ConcatenatedCharacter) might be more user friendly than purely a repeated character approach. Neither of which will help you when you're in the 2 billion range like our sample input of 2147483625

Key takeaways

Don't confuse the CHAR function with the char data type. Similar but different

That's why books always had ASCII tables in them

Modulus function can generate a bounded set of numbers

Older developers might know some weird tricks/trivia

Even older developers will scoff at memorized ASCII tables in favor of EBCDIC tables

Using Newtonsoft.Json with Biml

2019-09-05T10:41:00.000-05:00

Using Newtonsoft.Json with Biml

Twitter provided an opportunity for a quick blog post

#sqlhelp #biml I would have the metadata in a Json structure. How would you parse the json in the C# BIML Script? I was thinking use Newtonsoft.Json but I don't know how to add the reference to it

Adding external assemblies is a snap but here I'll show how to use the NewtonSoft Json library to parse a Json based metadata structure and then use that in our Biml.

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
<#
// Given the following structure

///{
///  "packages": [
///    "p1",
///    "p2",
///    "p3"
///  ]
///}

// Assume the json file is located as specified
string sourceFile = @"C:\ssisdata\trivial.json";

// Read the data into a string variable
string json = System.IO.File.ReadAllText(sourceFile);
 
// Deserialize the json into a dictionary of strings (packages) and a list of strings (p1, p2, p3)
Dictionary<string, List<string>> metadata = JsonConvert.DeserializeObject<Dictionary<string, List<string>>>(json);
#>
<Packages>
<#
// Shred the dictionary for our values
foreach (string item in metadata["packages"])
{
    //WriteLine(String.Format("<!-- {0} -->", item));
#>
    <Package Name="<#=item #>" />
<#
}
#> 
</Packages>
</Biml>

<#@ import namespace="Newtonsoft.Json" #>
<#* Assuming we have GAC'ed the assembly *#>
<#@ assembly name= "Newtonsoft.Json.dll" #>

The gist is also posted in case I mangled the hand crafted html entities above.

Also, not covered is GAC'ing the assembly but you can use an explicit path to your DLL name="C:\where\did\I\put\this\Newtonsoft.Json.dll"

Biml in Azure aka CallBimlScriptContent

2019-02-27T08:00:00.000-06:00

CallBimlScriptContent was introduced with the migration from Mist to BimlStudio. Why is this cool? You do not have to use files sitting on your computer as the source for your Biml. As long as you can reconstitute the Biml contents into a string, you can store your scripts where ever you'd like. If you want them in a database, that's great. Store them in the cloud? Knock yourself out.

As a consultant, the latter is rather compelling. Maybe I'm only licensing my clients to use accelerators during our engagement. If I leave files on the file system after I roll off, or they image my computer and accidentally collect them, I am David fighting Goliath. CallBimlScriptContent is a means to protect myself and my IP. Let's look at a trivial example. I set a C# string with an empty Package tag (hooray for doubling up my double quotes). Within my Packages collection, I invoke CallBimlScriptContent passing in my Biml content.

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
<#
// Do something here to populate myBimlFile
string myBimlFile = @"<Package Name=""ABC"" />";
#>    
    <Packages>
        <#=CallBimlScriptContent(myBimlFile)#>
    </Packages>
</Biml>

The rendered Biml for the above would look like

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
  <Packages>
    <Package Name="ABC" />
  </Packages>
</Biml>

It's super that it works, but that's not convenient. Like, at all! Plus, good luck trying to embed any complexity in that string.

So, let's try something a little more complex. Conceptually, imagine we have two Biml Scripts we might choose to call inc_Package_00.biml and inc_Package_10.biml <#@ property name="parameterName" type="string" #>

inc_Package_00.biml

<Package Name="ABC" />

inc_Package_10.biml

<#@ property name="packageName" type="string" #>
<Package Name="packageName" />

Our original code could then look like

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
<#
// Do something here to populate myBimlFile
string myBimlFile =System.IO.File.ReadAllText(@"C:\tmp\inc_Package_00.biml");
#>    
    <Packages>
        <#=CallBimlScriptContent(myBimlFile, "Package_00)"#>
    </Packages>
</Biml>

Do you need to pass parameters? It's no different than what you're used to doing

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
<#
// Do something here to populate myBimlFile
string myBimlFile =System.IO.File.ReadAllText(@"C:\tmp\inc_Package_10.biml");
#>    
    <Packages>
        <#=CallBimlScriptContent(myBimlFile, "Package_10)"#>
    </Packages>
</Biml>

In the next post, I'll show you how to use reference data stored in tables or Azure as your BimlScript Content. Stay tuned!

My github repository

2018-12-20T13:06:00.001-06:00

In preparation for my talk at the Kansas City SQL Server User Group this afternoon, I am putting this post here so people can get the materials easily.

Lightning Talks

SQL Server Agent Job Sort Order

2018-10-26T08:00:00.000-05:00

SQL Server Agent Job Sort Order

Today's post could also be titled "I have no idea what is happening here." We have an agent job, "Job - Do Stuff". We then created a few hundred jobs (templates for the win) all named like "Job - Do XYZ" where XYZ is a mainframe module identifier. When I'm scrolling through the list of jobs, it takes a few passes for my eye to find Do Stuff between DASD and DURR. I didn't want to change the leading portion of my job but I wanted my job to be sorted first. I open an ASCII table and find a useful character that sorts before the dash. Ah, asterisk, ASCII 42 comes before dash, ASCII 45.

Well, that was unexpected. In my reproduction here, the job names will take the form of the literal string "JOB " (trailing space there). I then use a single ASCII character as separator. A use another string literal "CHAR(" and then I display the ASCII ordinal value and for completeness, I close the parenthesis. Thus, JOB * CHAR(42) and JOB - CHAR(45). Assuming I sort ascending alphabetically, which under the sheets I would convert each character to its ASCII value, would lead to me JOB * CHAR(42) on top.

That ain't the way it's being sorted in SSMS though. Let's figure out "is this an application issue or a database issue?" Jobs are stored in the database msdb in a table called sysjobs in the dbo schema. Let's start there.

SELECT
    S.name AS JobName
FROM
    msdb.dbo.sysjobs AS S
WHERE
    S.name LIKE 'JOB%'
ORDER BY
    S.name;

Huh.

Ok, so what goes into sorting? Collations


SELECT
    S.name AS SchemaName
,   T.name AS TableName
,   C.name AS ColumnName
,   T2.name AS DataTypeName
,   C.collation_name AS ColumnCollationName
,   T2.collation_name AS TypeCollationName
FROM
    msdb.sys.schemas AS S
    INNER JOIN
        msdb.sys.tables AS T
        ON T.schema_id = S.schema_id
    INNER JOIN
        msdb.sys.columns AS C
        ON C.object_id = T.object_id
    INNER JOIN
        msdb.sys.types AS T2
        ON T2.user_type_id = C.user_type_id
WHERE
    S.name = 'dbo'
    AND T.name = 'sysjobs'
    AND C.name = 'name';

The name column for dbo.sysjobs is of data type sysname which uses the collation of "SQL_Latin1_General_CP1_CI_AS". If it's the collation causing the "weird" sort, then we should be able to reproduce it, right?

SELECT *
FROM
(
    VALUES
        ('JOB - CHAR(45)' COLLATE SQL_Latin1_General_CP1_CI_AS)
    ,   ('JOB * CHAR(42)' COLLATE SQL_Latin1_General_CP1_CI_AS)
) D(jobName)
ORDER BY
    D.jobName COLLATE SQL_Latin1_General_CP1_CI_AS;

Nope, not the collation since this returns in the expected sort order.

At this point, I waste a lot time going down rabbit holes that this isn't, because in my reproduction was not verbatim. I neglected to preface my strings with an N thus leaving them as ascii strings, not unicode strings.

SELECT *
FROM
(
    VALUES
        (N'JOB - CHAR(45)' COLLATE SQL_Latin1_General_CP1_CI_AS)
    ,   (N'JOB * CHAR(42)' COLLATE SQL_Latin1_General_CP1_CI_AS)
) D(jobName)
ORDER BY
    D.jobName COLLATE SQL_Latin1_General_CP1_CI_AS;

Running that, we get the same sort from sysjobs. At this point, I remember something about unicode sorting being different than old school dictionary sort like I was expecting. And after finding this answer on collations I'm happy simply setting my quest aside and stepping away from the keyboard.

Oh, but if you want to see what the glorious sort order is for characters in the printable range (32 to 127), my script is below. Technically, 127 is a cheat since it's the DELETE but I include it because of where it sorts.

Make the jobs

This script has two templates in it - @MischiefManaged deletes a job and @Template creates a job. I query against sys.all_columns to get a sequential set of numbers from 1 to (127 -32). I use that number and string concatenation (requires 2012+) plus the CHAR function to translate the number into the corresponding ASCII character. It will print out "JOB ' CHAR(39)" once complete because I'm lazy.

DECLARE
    @Template nvarchar(max) = N'
use msdb;
IF EXISTS (SELECT * FROM dbo.sysjobs AS S WHERE S.name = ''<JobName/>'')
BEGIN
    EXECUTE dbo.sp_delete_job @job_name = ''<JobName/>'';
END
EXECUTE dbo.sp_add_job
    @job_name = N''<JobName/>''
,   @enabled = 1
,   @notify_level_eventlog = 0
,   @notify_level_email = 2
,   @notify_level_page = 2
,   @delete_level = 0
,   @category_name = N''[Uncategorized (Local)]'';

EXECUTE dbo.sp_add_jobserver
    @job_name = N''<JobName/>''
,   @server_name = @@SERVERNAME;

EXEC dbo.sp_add_jobstep
    @job_name = N''<JobName/>''
,   @step_name = N''MinimumViableJob''
,   @step_id = 1
,   @cmdexec_success_code = 0
,   @on_success_action = 2
,   @on_fail_action = 2
,   @retry_attempts = 0
,   @retry_interval = 0
,   @os_run_priority = 0
,   @subsystem = N''TSQL''
,   @command = N''SELECT 1''
,   @database_name = N''msdb''
,   @flags = 0;

EXEC dbo.sp_update_job
    @job_name = N''<JobName/>''
,   @start_step_id = 1;
'
,   @MischiefManaged nvarchar(4000) = N'
use msdb;
IF EXISTS (SELECT * FROM dbo.sysjobs AS S WHERE S.name = ''<JobName/>'')
BEGIN
    EXECUTE dbo.sp_delete_job @job_name = ''<JobName/>'';
END'
,   @Token sysname = '<JobName/>'
,   @JobName sysname
,   @Query nvarchar(max);

DECLARE
    CSR CURSOR
FAST_FORWARD
FOR
SELECT
    J.jobName
FROM
(
    SELECT TOP (127-31)
        31 + (ROW_NUMBER() OVER (ORDER BY (SELECT NULL))) AS rn
    FROM sys.all_columns AS AC
) D(rn)
    CROSS APPLY
    (
        SELECT
            CONCAT('JOB ', CHAR(D.rn), ' CHAR(', D.rn, ')')
    )J(jobName)

OPEN CSR;
FETCH NEXT FROM CSR INTO @JobName;

WHILE @@FETCH_STATUS = 0
BEGIN
    BEGIN TRY  
        SET @Query = REPLACE(@Template, @Token, @JobName);
        ---- Uncomment the following to clean up our jobs
        --SET @Query = REPLACE(@MischiefManaged, @Token, @JobName);
        EXECUTE sys.sp_executesql @Query, N'';
    END TRY
    BEGIN CATCH
        PRINT @JobName;
    END CATCH
    FETCH NEXT FROM CSR INTO @JobName;
END
CLOSE CSR;
DEALLOCATE CSR;

At this point, you can refresh the Jobs list in SSMS and the result is this job sort.

Once you're satisfied with how things look, uncomment this line SET @Query = REPLACE(@MischiefManaged, @Token, @JobName); and rerun the script. All will be cleaned up.

Let's just chalk sorting up there with timezones, ok? Sounds easy but isn't. If you know more than me, please explain away in the comments section and share your knowledge.

Polling in SQL Agent

2018-10-18T08:00:00.000-05:00

Polling in SQL Agent

A fun question over on StackOverflow asked about using SQL Agent with SSIS to poll for a file's existence. As the comments indicate, there's a non-zero startup time associated with SSIS (it must validate the metadata associated to the sources and destinations), but there is a faster, lighter weight alternative. Putting together a host of TSQL ingredients, including undocumented extended stored procedures, the following recipe could be used as a SQL Agent job step.

If you copy and paste the following query into your favorite instance of SQL Server, it will execute for one minute and it will complete by printing the words "Naughty, naughty".

SET NOCOUNT ON;
-- http://www.patrickkeisler.com/2012/11/how-to-use-xpdirtree-to-list-all-files.html
DECLARE
    -- Don't do stupid things like adding spaces into folder names
    @sourceFolder varchar(260) = 'C:\ssisdata\Input'
    -- Have to use SQL matching rules, not DOS/SSIS
,   @fileMask sysname = 'SourceData%.txt'
    -- how long to wait between polling
,   @SleepInSeconds int = 5
    -- Don't exceed 24 hours aka 86400 seconds
,   @MaxTimerDurationInSeconds int = (3600 * 0) + (60 * 1) + 0
    -- parameter for xp_dirtree 0 => top folder only; 1 => subfolders
,   @depth int = 1
    -- parameter for xp_dirtree 0 => directory only; 1 => directory and files
,   @collectFile int = 1
,   @RC bigint = 0;

-- Create a table variable to capture the results of our directory command
DECLARE
    @DirectoryTree table
(
    id int IDENTITY(1, 1)
,   subdirectory nvarchar(512)
,   depth int
,   isFile bit
);

-- Use our sleep in seconds time to generate a delay time string
DECLARE
    @delayTime char(10) = CONVERT(char(10), TIMEFROMPARTS(@SleepInSeconds/60 /60, @SleepInSeconds/60, @SleepInSeconds%60, 0, 0), 108)
,   @stopDateTime datetime2(0) = DATEADD(SECOND, @MaxTimerDurationInSeconds, CURRENT_TIMESTAMP);

-- Force creation of the folder
EXECUTE dbo.xp_create_subdir @sourceFolder;

-- Load the results of our directory
INSERT INTO
    @DirectoryTree
(
    subdirectory
,   depth
,   isFile
)
EXECUTE dbo.xp_dirtree
    @sourceFolder
,   @depth
,   @collectFile;

-- Prime the pump
SELECT
    @RC = COUNT_BIG(1)
FROM
    @DirectoryTree AS DT
WHERE
    DT.isFile = 1
    AND DT.subdirectory LIKE @fileMask;

WHILE @rc = 0 AND @stopDateTime > CURRENT_TIMESTAMP
BEGIN

    -- Load the results of our directory
    INSERT INTO
        @DirectoryTree
    (
        subdirectory
    ,   depth
    ,   isFile
    )
    EXECUTE dbo.xp_dirtree
        @sourceFolder
    ,   @depth
    ,   @collectFile;

    -- Test for file existence
    SELECT
        @RC = COUNT_BIG(1)
    FROM
        @DirectoryTree AS DT
    WHERE
        DT.isFile = 1
        AND DT.subdirectory LIKE @fileMask;

    IF @RC = 0
    BEGIN
        -- Put our process to sleep for a period of time
        WAITFOR DELAY @delayTime;
    END
END

-- at this point, we have either exited due to file found or time expired
IF @RC > 0
BEGIN
    -- Take action when file was found
    PRINT 'Go run SSIS or something';
END
ELSE
BEGIN
    -- Take action for file not delivered in expected timeframe
    PRINT 'Naughty, naughty';
END

If you rerun the above query, in a separate window, assuming you have xp_cmdshell enabled, firing the following query will create a file with the expected pattern. Instead, it'll print out "Go run SSIS or something"

DECLARE
    @sourceFolder varchar(260) = 'C:\ssisdata\Input'
,   @fileMask sysname = REPLACE('SourceData%.txt', '%', CONVERT(char(10), CURRENT_TIMESTAMP, 120))
DECLARE
    @command varchar(1000) = 'echo > ' + @sourceFolder + '\' + @fileMask;

-- If you get this error
--Msg 15281, Level 16, State 1, Procedure sys.xp_cmdshell, Line 1 [Batch Start Line 0]
--SQL Server blocked access to procedure 'sys.xp_cmdshell' of component 'xp_cmdshell' because this component is turned off as part of the security configuration for this server. A system administrator can enable the use of 'xp_cmdshell' by using sp_configure. For more information about enabling 'xp_cmdshell', search for 'xp_cmdshell' in SQL Server Books Online.
--
-- Run this
--EXECUTE sys.sp_configure'xp_cmdshell', 1;
--GO
--RECONFIGURE;
--GO
EXECUTE sys.xp_cmdshell @command;

Once you're satisfied with how that works, now what? I'd likely set up a step 2 which is the actual running of the SSIS package (instead of printing a message). What about the condition that a file wasn't found? I'd likely use throw/raiserrror or just old fashioned divide by zero to force the first job step to fail. And then specify a reasonable number of @retry_attempts and @retry_interval.

Biml Excel Data Source without Excel

2018-10-09T08:00:00.000-05:00

Biml Excel Meta Data Source without Excel

In the previous post, Reading Excel files without Excel, I showed some simple code to consume Excel without having Excel installed on your machine. How/Why would I use this - well look at the sample spreadsheet. That could be used quite nicely by a business analyst to generate SSIS packages. In fact, it is being used by a very savvy business analyst at one of my clients' shadow IT groups to identify the source data they'd like brought into their data mart. They are translating their mainframe data extracts into SQL equivalents and specifying where the data should land.

This is exciting for me as this team gets their data and knows the business problems they need to solve &emdash; they just didn't have all the tools to do so. They are supplying the data domain expertise and we are generating consistent packages that adhere to corporate standards (as well as defining the scheduling, alerting, etc). It's a good match.

My resources are quite simple: Excel Spreadsheet containing meta data, a driver program and a package template.

The template is your standard truncate and reload pattern with the target table being specified by a parameter. The client validates data by running processes in parallel so the existing mainframe process delivers data to the Billing table while ours delivers to a Billing_NEW table. Once they accept the new process, the target table becomes Billing and the NEW table is dropped. I decided the most native SSIS route would be use specify the target table in as a parameter. We originally have a boolean parameter indicating whether we were loading the new table or the production one but that was more logic and overhead that just specifying which table to load. I force their queries to be dirty reads as some of these queries can be rather messy.

<#@ template designerbimlpath="/Biml/Packages" #>
<#@ property name="schemaName" type="string" #>
<#@ property name="tableName" type="string" #>
<#@ property name="parameterName" type="string" #>
<#@ property name="sourceQuery" type="string" #>
<#@ property name="sourceConnectionName" type="string" #>
<#@ property name="targetConnectionName" type="string" #>
<#@ property name="businessFriendlyName" type="string" #>
<#
string packageName = string.Format("{0}_Load_{1}{2}", targetConnectionName.ToUpper(), businessFriendlyName, "");
CustomOutput.PackageName = packageName;
#>
        <Package Name="&lt;#= packageName #>" ConstraintMode="Linear">
            <Parameters>
                <Parameter Name="TargetTableName" DataType="String"><#= tableName #></Parameter>
            </Parameters>
            <Variables>
                <Variable Name="SchemaName" DataType="String"><#= schemaName#></Variable>
                <Variable Name="TableName" DataType="String" EvaluateAsExpression="true"><#= parameterName #></Variable>
                <Variable Name="QualifiedTableName" DataType="String" EvaluateAsExpression="true">&quot;[&quot; +   @[User::SchemaName] + &quot;].[&quot; + @[User::TableName]+ &quot;]&quot;</Variable>
                <Variable Name="QueryTruncate" DataType="String" EvaluateAsExpression="true">"TRUNCATE TABLE " + @[User::QualifiedTableName] + ";"</Variable>
            </Variables>
            <Tasks>
                  <ExecuteSQL Name="SQL Truncate Target" ConnectionName="&lt;#= targetConnectionName #>">
                    <VariableInput VariableName="User.QueryTruncate" />
                </ExecuteSQL>
                <Dataflow Name="DFT Load &lt;#= businessFriendlyName #>">
                    <Transformations>
                        <OleDbSource ConnectionName="&lt;#= sourceConnectionName #>" Name="OLESRC Query ">
                            <DirectInput><![CDATA[SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
<#= sourceQuery#>
]]>
                            </DirectInput>
                        </OleDbSource>
                        <OleDbDestination Name="OLEDST &lt;#= schemaName #>_<#= tableName#>" ConnectionName="<#= targetConnectionName #>">
                            <TableFromVariableOutput VariableName="User.QualifiedTableName" />
                        </OleDbDestination>
                    </Transformations>
                </Dataflow>
            </Tasks>
        </Package>

My ProjectDriver.biml file is fairly straight forward. In line 1 I provide a relative path to my EPPlus.dll The ..\ indicates I would find the assembly one folder up - two folders actually since I have a single copy in my base Visual Studio folder. Line 2 specifies we need to bring in OfficeOpenXml library. In Line 5 I create a variable that will hold the metadata for my solution. Line 6 is kind of interesting. I let the template determine what the package name should be based on the supplied meta data. Rather than having to perform that logic twice, it'd be nice to keep track of what packages have been created. Not only nice, it'll be required since we're using the Project Deployment Model! Line 19 is where we actually stamp out a specific package and look at that second parameter out customOutput That is the mechanism for our template to send information back to the caller. In our case, we'll add the package name to our ever growing list of packages. In line 28, we then run back through our list of packages and build out the project's definition. And that's about it. We've already talked about the GetExcelDriverData method. The GetDriverData method provides a simple abstraction between where I actually get metadata and how the packages are built. You can see a commented out reference to a GetStaticDriverData method which I used during development to test boundary conditions. Who knows, maybe I will pull from Azure Tables next...

<#@ assembly name= "..\..\EPPlus.dll" #>
<#@ import namespace="OfficeOpenXml" #>
<Biml xmlns="http://schemas.varigence.com/biml.xsd">
<#
    Dictionary<string, List&lt;string>> dasData = new Dictionary<string, List&lt;string>>();
    List<string> packageList = new List<string>();
    string templateName = "inc_TruncAndReloadPackageParameter.biml";
    string projectName = "SHADOW_IT_DataSnapshot";

    // Get our meta data    
    dasData = GetDriverData();
#>
    <Packages>
<#

    dynamic customOutput;
    foreach(var key in dasData.Keys)
    {
        WriteLine(CallBimlScriptWithOutput(templateName, out customOutput, dasData[key][0], dasData[key][1], dasData[key][2], dasData[key][3], dasData[key][4], dasData[key][5], dasData[key][6]));
        packageList.Add(customOutput.PackageName);
    }
#>
    </Packages>
    <Projects>
        <PackageProject Name="&lt;#= projectName #>">
            <Packages>
<#
        foreach(var key in packageList)
        {
#>
            <Package PackageName="&lt;#= key #>" />
<#
        }
#>            
            </Packages>
            <Connections>
                <Connection ConnectionName="WWI_DB" />
                <Connection ConnectionName="WWI_DW" />
            </Connections>
        </PackageProject>
    </Projects>
    <Connections>
        <OleDbConnection Name="WWI_DB" ConnectionString="Data Source=.\DEV2017;Initial Catalog=WWI_DW;Provider=SQLNCLI11.1;Integrated Security=SSPI;Application Intent=READONLY;ConnectionTimeout=0;" CreateInProject="true" />
        <OleDbConnection Name="WWI_DW" ConnectionString="Data Source=.\DEV2017;Initial Catalog=WWI_DB;Provider=SQLNCLI11.1;Integrated Security=SSPI;Application Intent=READONLY;" CreateInProject="true" />
    </Connections>
</Biml>

<#+

    /// Get data from Excel worksheet
    public Dictionary<string, List<string>> GetExcelDriverData(string sourceFile)
    {
        Dictionary<string, List<string>> d = new Dictionary<string, List<string>>();
        System.IO.FileInfo fi = new System.IO.FileInfo(sourceFile);
        using (ExcelPackage ep = new ExcelPackage(fi))
        {
            ExcelWorkbook wb = ep.Workbook;
            ExcelWorksheet ws = wb.Worksheets.First();
            if (ws != null)
            {
                // 1 based array to 7, inclusive
                for (int i = ws.Dimension.Start.Row+1; i < ws.Dimension.End.Row+1; i++)
                {
                    List<string> row = new List<string>() { ws.Cells[i, 1].Value.ToString()
                    ,   ws.Cells[i, 2].Value.ToString()
                    ,   ws.Cells[i, 3].Value.ToString()
                    ,   ws.Cells[i, 4].Value.ToString()
                    ,   ws.Cells[i, 5].Value.ToString()
                    ,   ws.Cells[i, 6].Value.ToString()
                    ,   ws.Cells[i, 7].Value.ToString()
                    };
                    
                    d[ws.Cells[i, 7].Value.ToString()] = row;
                }
            }
        }
        
        return d;
    }

    public Dictionary<string, List<string>> GetDriverData()
    {
        string sourceFile= @"C:\Users\billinkc\Documents\ShadowIt_DataSnap.xlsx";
        return GetExcelDriverData(sourceFile);
        //return GetStaticDriverData();
    }
#>

And that's how we can use EPPlus to consume metadata stored in Excel to generate many packages with Biml. Let me know if this helps or if you have questions about how to get this running. It's good stuff, I can't get enough of it.

Reading Excel files without Excel

2018-10-08T08:00:00.000-05:00

Reading Excel files without Excel

A common problem working with Excel data is Excel itself. Working with it programatically requires an installation of Office, and the resulting license cost, and once everything is set, you're still working with COM objects which present its own set of challenges. If only there was a better way.

Enter, the better way - EPPlus. This is an open source library that wraps the OpenXml library which allows you to simply reference a DLL. No more installation hassles, no more licensing (LGPL) expense, just a simple reference you can package with your solutions.

Let's look at an example. Here's a simple spreadsheet with a header row and a row's worth of data.

For each row, after the header, I'll read the 7 columns into a list and then, since I assume the last column, BusinessFriendlyName, is unique, I'll use that as the key for my return dictionary.

using OfficeOpenXml;

...

    /// Get data from Excel worksheet
    public Dictionary<string, List<string>> GetExcelDriverData(string sourceFile)
    {
        Dictionary<string, List<string>> d = new Dictionary<string, List<string>>();
        System.IO.FileInfo fi = new System.IO.FileInfo(sourceFile);
        using (ExcelPackage ep = new ExcelPackage(fi))
        {
            ExcelWorkbook wb = ep.Workbook;
            ExcelWorksheet ws = wb.Worksheets.First();
            if (ws != null)
            {
                // 1 based array to 7, inclusive
                for (int i = ws.Dimension.Start.Row+1; i < ws.Dimension.End.Row+1; i++)
                {
                    List<string> row = new List<string>() { ws.Cells[i, 1].Value.ToString()
                    ,   ws.Cells[i, 2].Value.ToString()
                    ,   ws.Cells[i, 3].Value.ToString()
                    ,   ws.Cells[i, 4].Value.ToString()
                    ,   ws.Cells[i, 5].Value.ToString()
                    ,   ws.Cells[i, 6].Value.ToString()
                    ,   ws.Cells[i, 7].Value.ToString()
                    };
                    
                    d[ws.Cells[i, 7].Value.ToString()] = row;
                }
            }
        }
        
        return d;
    }

It's as easy as that. There are plenty of more clever implementations out there but I wanted to demonstrate a quick and easy method to read Excel from your .NET code.

A date dimension for SQL Server

2018-08-14T08:00:00.000-05:00

A date dimension for SQL Server

The most common table you will find in a data warehouse will be the date dimension. There is no "right" implementation beyond what the customer needs to solve their business problem. I'm posting a date dimension for SQL Server that I generally find useful as a starting point in the hopes that I quit losing it. Perhaps you'll find it useful or can use the approach to build one more tailored to your environment.

As the comments indicate, this will create: a DW schema, a table named DimDate and then populate the date dimension from 1900-01-01 to 2079-06-06 endpoints inclusive. I also patch in 9999-12-31 as a well known "unknown" date value. Sure, it's odd to have an incomplete year - this is your opportunity to tune the supplied code ;)

-- At the conclusion of this script, there will be
-- A schema named DW
-- A table named DW.DimDate
-- DW.DimDate will be populated with all the days between 1900-01-01 and 2079-06-06 (inclusive)
--   and the sentinel date of 9999-12-31

IF NOT EXISTS
(
    SELECT * FROM sys.schemas AS S WHERE S.name = 'DW'
)
BEGIN
    EXECUTE('CREATE SCHEMA DW AUTHORIZATION dbo;');
END
GO
IF NOT EXISTS
(
    SELECT * FROM sys.schemas AS S INNER JOIN sys.tables AS T ON T.schema_id = S.schema_id
    WHERE S.name = 'DW' AND T.name = 'DimDate'
)
BEGIN
    CREATE TABLE DW.DimDate
    (
        DateSK int NOT NULL
    ,   FullDate date NOT NULL
    ,   CalendarYear int NOT NULL
    ,   CalendarYearText char(4) NOT NULL
    ,   CalendarMonth int NOT NULL
    ,   CalendarMonthText varchar(12) NOT NULL
    ,   CalendarDay int NOT NULL
    ,   CalendarDayText char(2) NOT NULL
    ,   CONSTRAINT PK_DW_DimDate
            PRIMARY KEY CLUSTERED
            (
                DateSK ASC
            )
            WITH (ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, DATA_COMPRESSION = PAGE)
    ,   CONSTRAINT UQ_DW_DimDate UNIQUE (FullDate)
    );
END
GO
WITH 
    -- Define the start and the terminal value
    BOOKENDS(FirstDate, LastDate) AS (SELECT DATEFROMPARTS(1900,1,1), DATEFROMPARTS(9999,12,31))
    -- itzik ben gan rapid number generator
    -- Builds 65537 rows. Need more - follow the pattern
    --  Need fewer rows, add a top below
,    T0 AS 
(
    -- 2
    SELECT 1 AS n
    UNION ALL SELECT 1
)
,    T1 AS
(
    -- 2^2 => 4 
    SELECT 1 AS n
    FROM
        T0
        CROSS APPLY T0 AS TX
)
,    T2 AS 
(
    -- 4^4 => 16
    SELECT 1 AS n
    FROM
        T1
        CROSS APPLY T1 AS TX
)
,    T3 AS 
(
    -- 16^16 => 256
    SELECT 1 AS n
    FROM
        T2
        CROSS APPLY T2 AS TX
)
,    T4 AS
(
    -- 256^256 => 65536
    -- or approx 179 years
    SELECT 1 AS n
    FROM
        T3
        CROSS APPLY T3 AS TX
)
,    T5 AS
(
    -- 65536^65536 => basically infinity
    SELECT 1 AS n
    FROM
        T4
        CROSS APPLY T4 AS TX
)
    -- Assume we now have enough numbers for our purpose
,    NUMBERS AS
(
    -- Add a SELECT TOP (N) here if you need fewer rows
    SELECT
        CAST(ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS int) -1 AS number
    FROM
        T4
    UNION 
    -- Build End of time date
    -- Get an N value of 2958463 for
    -- 9999-12-31 assuming start date of 1900-01-01
    SELECT
        ABS(DATEDIFF(DAY, BE.LastDate, BE.FirstDate))
    FROM
        BOOKENDS AS BE
)
, DATES AS
(
SELECT
    PARTS.DateSk
,   FD.FullDate
,   PARTS.CalendarYear
,   PARTS.CalendarYearText
,   PARTS.CalendarMonth
,   PARTS.CalendarMonthText
,   PARTS.CalendarDay
,   PARTS.CalendarDayText
FROM
    NUMBERS AS N
    CROSS APPLY
    (
        SELECT
            DATEADD(DAY, N.number, BE.FirstDate) AS FullDate
        FROM
            BOOKENDS AS BE
    )FD
    CROSS APPLY
    (
        SELECT
            CAST(CONVERT(char(8), FD.FullDate, 112) AS int) AS DateSk
        ,   DATEPART(YEAR, FD.FullDate) AS [CalendarYear] 
        ,   DATENAME(YEAR, FD.FullDate) AS [CalendarYearText]
        ,   DATEPART(MONTH, FD.FullDate) AS [CalendarMonth]
        ,   DATENAME(MONTH, FD.FullDate) AS [CalendarMonthText]
        ,   DATEPART(DAY, FD.FullDate)  AS [CalendarDay]
        ,   DATENAME(DAY, FD.FullDate) AS [CalendarDayText]

    )PARTS
)
INSERT INTO
    DW.DimDate
(
    DateSK
,   FullDate
,   CalendarYear
,   CalendarYearText
,   CalendarMonth
,   CalendarMonthText
,   CalendarDay
,   CalendarDayText
)
SELECT
    D.DateSk
,   D.FullDate
,   D.CalendarYear
,   D.CalendarYearText
,   D.CalendarMonth
,   D.CalendarMonthText
,   D.CalendarDay
,   D.CalendarDayText
FROM
    DATES AS D
WHERE NOT EXISTS
(
    SELECT * FROM DW.DimDate AS DD
    WHERE DD.DateSK = D.DateSk
);