World of Whatever

A blog about SQL Server, SSIS, C# and whatever else I happen to be dealing with in my professional life.

Find ramblings

Wednesday, May 3, 2023

Formatting a date in SQL Server with the Z indicator

Formatting a date in SQL Server with the Z indicator

It seems so easy, I was building json in SQL Server and the date format for the API specified it needed to have 3 millsecond digits and the zulu timezone signifier. Easy peasy, lemon squeezey, that is ISO8601 with time zone Z format code 127

SELECT CONVERT(char(24), GETDATE(), 127) AS waitAMinute; Running that query yields something like 2023-05-02T10:47:18.850 Almost there but where's my Z? Hmmm, maybe it's because I need to put this into UTC? SELECT CONVERT(char(24), GETUTCDATE(), 127) AS SwingAndAMiss;

Running that query yields something like 2023-05-02T15:47:18.850 It's in UTC but still no timezone indicator. I guess I can try an explict conversion to a datetimezone and then convert to 127.

SELECT CONVERT(char(24), CAST(GETUTCDATE() AS datetimeoffset) , 127) AS ThisIsGettingRidiculous , CAST(GETUTCDATE() AS datetimeoffset) AS ControlValue;

Once again, that query yields something like 2023-05-02T15:47:18.850 and I can confirm the ControlValue aka unformatted looks like 2023-05-02 15:47:850.7300000 +00:00 We have timezone info, just not the way I need it.

Back to the documentation, let's ready those pesky footnotes.

8 Only supported when casting from character data to datetime or smalldatetime. When casting character data representing only date or only time components to the datetime or smalldatetime data types, the unspecified time component is set to 00:00:00.000, and the unspecified date component is set to 1900-01-01. 9 Use the optional time zone indicator Z to make it easier to map XML datetime values that have time zone information to SQL Server datetime values that have no time zone. Z indicates time zone at UTC-0. The HH:MM offset, in the + or - direction, indicates other time zones. For example: 2022-12-12T23:45:12-08:00.

Those don't appply to me...Oh, wait, they do. In the Z notation is only used for converting stringified dates into native datetime types. There is no cast and convert style code to output a formated ISO 8601 date with the Z indicator.

So what do you do? Much of the internet just proposes using string concatenation to append the Z onto the string and move on. And that's what a rational developer would do but I am not one of those people.


If you want to get a well formatted ISO8601 with the time zone Z indicator, the one-stop-shop in SQL Server will be the FORMAT function because you can do anything there!

SELECT FORMAT(CAST(D.val AS datetimeoffset), 'yyyy-MM-ddThh:mm:ss.fffZ') AS WinnerWinnerChickenDinner , FORMAT(CAST(D.val AS datetime2(3)), 'yyyy-MM-ddThh:mm:ss.fffZ') AS OrThis FROM ( VALUES ('2023-05-02T15:47:18.850Z') )D(val);

The final thing to note, the return type of FORMAT is different. It defaults to nvarchar(4000) whereas lame string concatenation yields us the right length (24) but the wrong type as concatenation changed our char to varchar. If we were storing this to a table, I'd add a final explicit cast, in either case, to be char(24). There's no unicode values to worry about nor will it ever be shorter than 24 characters.

SELECT, DEDFRS.system_type_name FROM sys.dm_exec_describe_first_result_set ( N'SELECT FORMAT(CAST(D.val AS datetimeoffset), ''yyyy-MM-ddThh:mm:ss.fffZ'') AS LookMaUnicode , CONVERT(char(23), CAST(D.val AS datetimeoffset), 127) + ''Z'' AS StillVarChar FROM ( VALUES (''2023-05-02T15:47:18.850Z'') )D(val);' , N'' , 1) AS DEDFRS;

Filed under, "I blogged about it, hopefully I'll remember the solution"

Thursday, December 22, 2022

Counting a character in a column

Counting a character in a column

I ran into an issue today that I wanted to write about so that maybe I remember the solution. We ran into a case where the source data in a column had an unprintable character. In this case, it was a line feed character, which is ASCII value 10, and they had 7 instances in this one row. "How did that get in there? Surely that's an edge case and we can just ignore it," and dear reader, I've been around long enough to know that this is likely a systemic situation. To count the number of line feeds in a single row, for a single column, I can just copy the value into NotePad++ or the like, display all characters, and simpely count.

A screenshot of a text editor. It displays 7 lines of information, each with a black LF at the end

Now, let's count how many line feeds are in a table with 23.5 million rows by hand - Any takers? Exactly. The trick to this solution is that we're going to make use of the REPLACE function to substitute the empty string for all of the values we want to count. We'll then compare the difference in string lengths between the original and the final.

    -- Generate our data
        CONCAT('100', CHAR(10), 'E', CHAR(10), 'Main', CHAR(10), 'St', CHAR(10), 'W', CHAR(10), 'Ste', CHAR(10), '357', CHAR(10) ) AS InputString
) D0
    -- What is the difference in string length?
        LEN(D0.InputString) - LEN(REPLACE(D0.InputString, CHAR(10), '')) AS LFCount

This solution is elegant in that you only require one pass through a table to figure out the number. Plus, if you want to do the same computation in other languages that may not have a count function or equivalent available for strings, it likely supports a REPLACE operation.

Is there any other way?

Sure but none of them seem to perform as well.


Assuming you're on SQL Server 2014?+ the second parater to STRING_SPLIT will be the character to split. Intuitively, I think this might be a more obvious solution. What do we want to do? Count how many times a character exists in a field. If we break the field up into multiple rows and then count how many rows were generated, Bob's your uncle!

        CONCAT('100', CHAR(10), 'E', CHAR(10), 'Main', CHAR(10), 'St', CHAR(10), 'W', CHAR(10), 'Ste', CHAR(10), '357', CHAR(10) ) AS InputString
) D0
        COUNT_BIG(SS.value) AS LFCount
        STRING_SPLIT(D0.InputString, CHAR(10)) AS SS

What I don't like is the performance. Running those two queries against my original table, it took about 30 seconds to compute the REPLACE's results and 70 seconds for the STRING_SPLIT.

Number table

I'm not going to go find my notes on how to use a number table to perform the same split operation as string_split but I can already tell you, it won't perform as well as string_split and we've already covered that one isn't going to cut it.

I want to count spaces or I'm dealing with unicode data

Fun, but documented, twist --- LEN is not going to count trailing white space. And while I'm a dumb 'murican, unicode length is different so you should probably use DATALENGTH instead, but then divide by 2.

    -- Generate our data
    -- We are now using char(32), aka space, as our delimiter
    -- and tacking on an extra 100 at the end
    -- and we set our type to be nchar
        CONCAT(CAST('100' AS nchar(3)), CHAR(32), 'E', CHAR(32), 'Main', CHAR(32), 'St', CHAR(32), 'W', CHAR(32), 'Ste', CHAR(32), '357', CHAR(32), space(100) ) AS InputString
) D0
    SELECT LEN(D0.InputString) AS IncorrectLength
    ,   DATALENGTH(D0.InputString)/2 AS CorrectLength
    -- What is the difference in string length?
        (DATALENGTH(D0.InputString) - DATALENGTH(REPLACE(D0.InputString, CHAR(32), '')))/2 AS SpaceCount
    ,   (LEN(D0.InputString) - LEN(REPLACE(D0.InputString, CHAR(32), ''))) AS SpaceCountWrong

Can you think of any other ways to crack this nut?

Tuesday, November 22, 2022

GENERATE_SERIES is a persnickety little function

GENERATE_SERIES is a persnickety little function

New in SQL Server 2022 is the GENERATE_TIMESERIES table valued function. It's a handy function to generate a series of numbers. Once you understand why a number table is handy, you'll find lots of uses for it but this it not the blog post to cover it.

I wanted to generate a time series using the new hotness and step 1 is to get a list of the timeoffsets I'll add to my starting value. The following query will generate 16565 rows, which I'd then use as the second argument to a DATEADD call.

SELECT *, DATEADD(SECOND, Works.Value, DATETIME2FROMPARTS(2022,11,22,0,0,0,0,1)) FROM GENERATE_SERIES(0, 19565, 1) AS Works;

Perfect, now I have a row for each second from midnight to approximately 5.5 hours later. What if my duration need to vary because I'm going to compute these ranges for a number of different scenarios? I should make that 19565 into a variable and let's overengineer this by making it a bigint.

-- Assume the duration might be "big"
-- TODO: @duration = DATEDIFF(SECOND, Start, Stop)
DECLARE @duration bigint = 19565;
SELECT *, DATEADD(SECOND, LolWorks.Value, DATETIME2FROMPARTS(2022,11,22,0,0,0,0,1)) FROM GENERATE_SERIES(0, @duration, 1) AS LolWorks;

In the immortal words of Press Your Luck, WHAMMIE WHAMMIE WHAMMIE!

Msg 5373, Level 16, State 1, Line 4
All the input parameters should be of the same type. Supported types are tinyint, smallint, int, bigint, decimal and numeric.
Msg 206, Level 16, State 2, Line 4
Operand type clash: int is incompatible with void type
Msg 206, Level 16, State 2, Line 4
Operand type clash: bigint is incompatible with void type
Msg 206, Level 16, State 2, Line 4
Operand type clash: int is incompatible with void type

The fix is easy - just as the error message says, all THREE operands to GENERATE_SERIES must be the same data type. In this case, it required upsizing to bigint. In the technical writers defense (defence), they call it out in the documentation but I grew up only reading documentation after I failed to natively grasp how to make something work. The data type for stop must match the data type for start. SELECT * FROM GENERATE_SERIES(CAST(1 AS bigint), @duration, CAST(1 AS bigint)) AS Fails;

Key take away

Unlike any other situation I can think of off the top of my head, GENERATE_SERIES won't automatically promote/upsize/expand data types.

Thursday, January 27, 2022

ADF and MySql.Data.MySqlClient.MySqlException,Message=Got a packet bigger than 'max_allowed_packet'

Azure Data Factory, ADF, and exception MySql.Data.MySqlClient.MySqlException,Message=Got a packet bigger than 'max_allowed_packet'

My StackOverflow developer profile specifies "I'd prefer to not work with" and honestly, the only thing I don't want to deal with is MySQL. I don't like Visual Basic or Access or plenty of other things but good grief, I find working MySQL to be an absolute cesspit after every other RDBMS I've worked with. Which brings me to an overdue client project, consolidating various MySQL instances to a single reporting server. They have a standard schema on all the boxes (no really, that was my biggest fear but they're good at ensuring the nearly 200 sites have the exact same point release of code) and I needed to bring it to a single server so it can be fed into reports.

It seemed like a great fit for Azure Data Factory but I kept getting an error dealing with some packet size issue. What do I know about packet sizes? Error details Error code 2200 Failure type User configuration issue Details 'Type=MySql.Data.MySqlClient.MySqlException,Message=Got a packet bigger than 'max_allowed_packet' bytes,Source=MySqlConnector,''Type=MySql.Data.MySqlClient.MySqlException,Message=Got a packet bigger than 'max_allowed_packet' bytes,Source=MySqlConnector,'

What's the internet got to say about all this? I checked the setting on a server that worked and one that didn't SHOW VARIABLES LIKE 'max_allowed_packet'; but they both listed 4194304 (bytes).

Beyond changing configuration settings, and no guarantee that solves the issue, the idea of inconsistent table definition sounded promising. But no dice. I tried making all the fields nullable, but to no avail. I ran the mysqldump utility from the commandline to see if I could reproduce the packet issue. Nothing.

After a lot of frustration, I looked hard at the custom integration logs. Before I move data, I copy over the source information_schema.tables for the database and store the TABLE_ROWS for each table. That number is a approximately the number of rows in the table. In the copy activity itself, I log the actual rows transferred and that's when I noticed something. The largest set of data from a single source was 6k rows. ALL OF THE HOSTS THAT GENERATED THE MAX PACKET EXCEPTION HAD MORE ROWS THAN 6K.

Well, what if the issue is one of volume? That's easy enough to test. I put a LIMIT 1000; on the query and pointed ADF at the server that never transferred data for that table. It worked. Sonofa. Ok, LIMIT 5000; Worked. Removed the limit - Failed, got a packet bigger than 'max_allowed_packet'

The error is not being generated from the source as I assumed. Normally, ADF says whether it's the source or the sink that caused the error. The exception makes sense if the error is on the sink. "You're sending too much data in one shot" would be a more useful error.

How do we fix it

The default Write Batch Size for a Copy Activity is 10,000. I dropped the size to 5000 and ran through all the troublesome hosts. Of the 51 hosts that would never transfer the suspect table,

Thursday, January 6, 2022

SSIS Azure Feature Pack and the Flexible File components

SSIS Azure Feature Pack and the Flexible File components

The Azure Feature Pack for SSIS is something I had not worked with before today. I have a client that wants to use the Flexible File Task/Flexible File Source/Flexible File Destination but they were having issues. The Flexible File tools allow you to work with Azure Blob storage. We were dealing with ADLS Gen2 but the feature pack can work with classic blob storage as well. In my hubris, I said no problem, I know SSIS. Dear reader, I did not know as much as I thought I did...

Our scenario was simple. We had a root folder datalake and subfolders of Raw. And into that we were needed to land and then consume files. Easy peasy. The Flexible File Destination allows us to write. The Flexible File Source allows us to read and we can configure a Foreach File enumerator to use the "Foreach Data Lake Storage Gen2 File Enumerator" to interact with the file system. Everything is the same as Windows except we use forward slashes instead of backslashes for path separators.

I started with the Flexible File Source after I manually created a CSV and uploaded it to data lake. One of the things I wasn't sure about was path syntax - do I need to specify the leading slash, do I need to specify the trailing slash, etc. I think I determined it didn't matter, it'd figure out whether folder path needs a trailing slash if it wasn't specified.

As I was testing things, changing that value and the value in the Flat File Destination each time was going to be annoying so I wanted to see what my options were for using Expresssions. Expressions are the secret sauce to making your SSIS packages graceful. The development teams took the same approach they have with the ADO.NET components in that there are no expressions on the components themselves. Instead, if you want to make the Flexible File Source/Flexible File Destination dynamic at all, you're going to have to look at the Data Flow Tasks Expressions and then configure there.

Here I used an SSIS package variable @[User::ADLSPath] so I could easily switch out /datalake, datalake, datalake/raw, datalake/Raw/, /datalake/raw/, and evaluate case sensitivity, path manipulation, etc.


Transfer data error : System.IO.FileNotFoundException: Could not load file or assembly 'Microsoft.Azure.Storage.Common, Version=, Culture=neutral, PublicKeyToken=31bf3856ad364e35' or one of its dependencies. The system cannot find the file specified. File name: 'Microsoft.Azure.Storage.Common, Version=, Culture=neutral, PublicKeyToken=31bf3856ad364e35'

I tried it again, same issue. I flipped from 64 to 32 bit (I had installed both sets of the Feature Pack). I reread the install requirements document. I looked through github bugs. I contemplated asking a question on StackOverflow. I ended up trying to reproduce the error on my desktop instead of the client's VM. Same bloody issue! Finally, I said to heck with it, maybe it more strongly worded, and I'll get this assembly and install to the GAC. Maybe something went wonky with the installs on both machines.

How do I install something into the global assembly cache?

StackOverflow answer On your file system, you might have a utility called "gacutil.exe" From a administrative command prompt, I'd type "CD \" and then "dir /s /b gacutil.exe" That should provide a list of all the instances of gacutil on your machine. Pick one, I don't care which. If there's a space in the path name, then you'll need to wrap it with double quotes. "C:\Program Files (x86)\Microsoft SDKs\Windows\v10.0A\bin\NETFX 4.8 Tools\x64\gacutil.exe" /? That would bring up the help text for using the gacutil that lives at that path. If the double quotes were not there, you'd get an error message stating
'C:\Program' is not recognized as an internal or external command, operable program or batch file.

How do I get Microsoft.Azure.Storage.Common.dll?

I hope you comment below and tell me an easier way because I am not a nuget master. Visual Studio has a nuget package manager in it. However, it only works in the contet of a solution or project and the project type must support packages. If you have a single project in your solution and that project is an SSIS project, attempting to use the nuget package manager will result in an error
Operation Failed. No projects supported by NuGet in the solution. Well fiddlesticks, I guess we have to give up.

Or, add a script task to the SSIS package and then click Edit Script. In the new instance of Visual Studio, guess what - it thinks it supports NuGet. It does not, when you close the editor, the packages go away and when the script runs, it does not have the brains to get the assemblies back from NuGet land. So, don't.close.the.editor. yet. In the NuGet Package manager, on the Browse tab, type in Microsoft.Azure.Storage.Common and look at that - Deprecated. Last stable version 11.2.3. But this error indicates the component expects so in the Version, scroll all the way back and find it. Click the check box and click Install button. Yes, you agree to the other packages as well as the MIT license.

At this point, in the volatile/temporary file storage on your computer, you have an on disk representation of your empty script Task with a NuGet package reference. We need to find that location, e.g. C:\Users\bfellows\AppData\Local\Temp\Vsta\SSIS_ST150\VstacIlBNvWB__0KemyB8zd1UMw\ Copy the desired DLLs out into a safe spot (because if I have to do it here, I'll likely have to do it on the server and the other three development VMs) and then use the gacutil to install them.

Right click on Solution VstaProjects and choose Open in Terminal. The assembly we're looking for will be located at .\packages\Azure.Storage.Common.12.9.0\lib\netstandard2.0\Azure.Storage.Common.dll. Assume I copied it to C:\temp

"C:\Program Files (x86)\Microsoft SDKs\Windows\v10.0A\bin\NETFX 4.8 Tools\x64\gacutil.exe" -if C:\temp\Azure.Storage.Common.dll will force install the dll to the GAC. Now when I run SSIS, we'll see whether that has resolved our error.

Will the real error please stand up

It did resolve our error, but not. The error that was reported, missing assembly was Mickey Mouse, mate. Spurious. Not genuine.

Look what error decided to show up now that it could error out "better"
Transfer data error : Microsoft.DataTransfer.Common.Shared.HybridDeliveryException: ADLS Gen2 operation failed for: Operation returned an invalid status code 'BadRequest'. Account: 'datalakedev'. FileSystem: 'datalake'. ErrorCode: 'TlsVersionNotPermitted'. Message: 'The TLS version of the connection is not permitted on this storage account.'

If I go to my storage account, under Settings, Configuration, there I can change Minimum TLS version from 1.2 to 1.1. Oh, except that still isn't kosher - same error. 1.0 it is and lo and behold, I have data transfer. The root cause is not a missing assembly, it is a red herring error message that could only be resolved by adding the assembly to the global assembly cache.


How in the hell would a normal person make the connection between "Could not load file or assembly" and Oh, I need to change the TLS? What's really galling is the fact that when I used the Flexible File Source for my data flow, I specified a file on blob storage and SSIS was able to connect and read that file because it identified the associated metadata. I has two columns and here are the data types (defaulted to string but who cares, that's consistent with flat file source). BUT IT PICKED UP THE METADATA. IT COULD TALK TO AZURE BLOB STORAGE EVEN THOUGH IT ONLY ALLOWED 1.2! And yet, when it came time to run, it could not talk on the same channel. Can you see how I lost a large portion of my day trying to decipher this foolishness?

By the way, knowing that the root cause it a mismatch between the TLS SSIS/my computer is using and the default on the storage account, let's go back to the writeup for the Azure Feature Pack

Use TLS 1.2 The TLS version used by Azure Feature Pack follows system .NET Framework settings. To use TLS 1.2, add a REG_DWORD value named SchUseStrongCrypto with data 1 under the following two registry keys. HKEY_LOCAL_MACHINE\SOFTWARE\WOW6432Node\Microsoft\.NETFramework\v4.0.30319 HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\.NETFramework\v4.0.30319

So, sometimes an error message isn't the real error message but you have to clear away all the garbage between you and the source of the error to figure out what it really is.

Wednesday, December 22, 2021

Extracting queries from SSIS packages

Extracting queries from SSIS packages

We received an abomination of an SSIS package from a third party. It was a way to write a package that I don't think I would have suggested.

Our job was to rewrite this into something more manageable and it appears Azure Data Factory will be the winner. Before we can do that, we need to document what the existing package is doing (the vendor has supplied the incremental load logic) so we can replicate it but in a more economical form. It appears to have the pattern (squint really hard at the picture) Execute SQL Task -> Execute SQL Task -> Sequence container => many data flows -> Data Flow -> Execute SQL Task. The Data Flow Task is named after the table being loaded. An ODBC source with a expression based query, named "ODBC Source 1" wired to an OLE DB Destination, named "OLE DB Destination". How would you do it, especially given that there are 236 Data Flow Tasks embedded in a single container?

Biml it!

As with so many things SSIS-related, Biml is the answer. Install BimlExpress and reverse engineer that dtsx package into Biml. I'll add a blank BimlScript file that I called Inspector.biml

Let's look at a sample DataFlow task

            <Dataflow Name="ATableName">
                <Expression ExternalProperty="[ODBC Source 1].[SqlCommand]">"SELECT * FROM dbo.ATableName where modifiedutc > '" +(DT_WSTR, 30)@[User::LastModified] + "'  AND modifiedutc <= '" + (DT_WSTR, 30)@[User::MostRecent] + "'"</Expression>
                <OdbcSource Name="ODBC Source 1" Connection="Source2">
                  <DirectInput>SELECT * FROM dbo.ATableName where modifiedutc > '0'  AND modifiedutc <= '0'</DirectInput>
                <DataConversion Name="Data Conversion">
                    <Column SourceColumn="id" TargetColumn="Copy of id" DataType="AnsiString" Length="255" CodePage="1252" />
                <OleDbCommand Name="Delete Old Rows from ATableName" ConnectionName="Destination2">
                    <Parameter SourceColumn="ye_id" TargetColumn="Param_0" DataType="AnsiStringFixedLength" Length="255" CodePage="1252" />
                    <Parameter SourceColumn="modifiedutc" TargetColumn="Param_1" DataType="Int64" />
                  <DirectInput>delete from dbo.ATableName where ye_id = ? and modifiedutc  < ?</DirectInput>
                <ConditionalSplit Name="Conditional Split">
                    <OutputPath Name="Case 1">
                <OleDbDestination Name="OLE DB Destination" ConnectionName="Destination2">
                  <ExternalTableOutput Table=""dbob"."ATableName"" />

All I want to do is find all the Data Flow Tasks in the sequence containers. I need to generate a key value pair of TableName and the source query. I could dive into the Transformations layer and find the ODBC source and extract the DirectInput node from the OdbcSource and then parse the table name from the OleDbDestination's ExternalTableOutput but look, I can "cheat" here. Everything I need is at the outer Data Flow Task level. The Name of the DataFlow is my table name and since it's ODBC and the source component doesn't support a direct Expression on it, it's defined at the Data Flow Level. That makes this Biml easy.


// A dictionary of TableName and the Expression
Dictionary<string, string> incremental = new Dictionary<string, string>();

foreach (AstPackageNode p in this.RootNode.Packages)
    // Loop through the Sequence Container
    foreach (var c in p.Tasks.OfType<AstContainerTaskNode>()/**/)
        foreach (AstDataflowTaskNode t in c.Tasks.OfType<AstDataflowTaskNode>())
            if (p.Name == "Postgres_to_MSSQL_Incremental")
                incremental[t.Name] = t.Expressions[0].Expression;

foreach (KeyValuePair<string, string> k in incremental)
    // WriteLine("<!-- {0}: {1} -->", k.Key, k.Value);
    WriteLine("{0}|{1}", k.Key, k.Value.Replace("\n", " "));

<Biml xmlns="">

I define a dictionary that will hold the Table and the associated Expression. The first foreach loop specifies that I want to enumerate through all the packages that have been reverse engineered. If you're using BimlExpress, you'll need to shift click on all the source Biml files as well as the Inspector package.

The next foreach enumerator looks at all the elements in the Control Flow task and we're going to filter it to just things that are of type AstContainerTaskNode, aka a Container. That "OfType" filter syntax is very handy to focus on only the type of item you're looking for. Linq is so powerful, I love it.

The innermost foreach enumerator uses the OfType again to filter tasks to only those that are DataFlows, AstDataFlowTaskNode. The if statement ensures I'm only working on the Incremental package (they supplied an initial load as well). Finally, I add the Task's Name to the key of the dictionary and the value becomes the first Expression. Again, I can cheat here because the vendor package was very consistent, which is amazing for an SSIS package that's been hand crafted. That source package was 40.5 MB and had been saved 422 times according to the VersionBuild number. Kudos to them for quality control.

Once the looping is complete, all that is left is to emit the information so I can use it elsewhere. Thus the final foreach loop. I'm working in BimlStudio so comments are emitted and I'm going to take advantage of that by simply writing the key/value pair with a Pipe delimiter and then copy/paste the output into a CSV. If you're working in BimlExpress, I'd just write directly to a file with a System.IO.Text.WriteAllLines (name approximate) but this was just a "quick and dirty get it done" task and corresponding blog post to show that Biml and metadata programming are still relevant.

Eagle eyed viewers will note that I am missing the single DataFlow task after the Container. My partner also notice it and so if you need to also look for any data flow tasks at the Package level, I added this loop after the Seqence Container loop.

    foreach (AstDataflowTaskNode t in p.Tasks.OfType<AstDataflowTaskNode>())
        if (p.Name == "Postgres_to_MSSQL_Incremental")
            incremental[t.Name] = t.Expressions[0].Expression;


I'm a slow/infrequent blogger and this post took me 50 minutes. I think after we were done scratching our heads at the source packages, it took less time to write the script and generate the list of tables and associated queries than this blog post took.

Wednesday, November 17, 2021

ETL pattern for API source

ETL pattern for API source

The direction for software as a service providers is to provide APIs to access their data instead of structured file exports. Which is a pity, as every SaaS system requires a bespoke data extract solution. I inheireted a solution that had an adverse pattern I'd like to talk about.

The solution pulls data from advertising and social media sites (Google Analytics, Twitter, Facebook, etc) and does processing to make it ready for reporting. The approach here works, but there are some challenges that you can run into.
  • Metering - Providers generally restrict you to only consuming so much over time (where so much and time are highly dependent on the source). Google Analytics, depending on product, rejects your connections after so many calls. Twitter, also depending on their maddening, inconsistent set of APIs (v1 vs v2), endpoints, product (free standard, paid for premium or enterprise) will throttle you based on consumption
  • Data availability - you have no idea whether the data you pulled today will be available tomorrow. We had pulled 5 years of data out of Google Analytics that contained a variety of dimensions, two of which were ga:userAgeBracket and ga:userGender. In talking to our client, they wanted just one more data elemented added to the mix. We made the change and boom goes the dynamite: Some data in this report may have been removed when a threshold was applied. That error message means that you're requesting a level of granularity that could de-anonymize users. Ok, fine, we rolled back the change but No, that's no longer a valid combination, ever! And we ran into a situation were some of the data just wasn't availble pre-2020. Yes, a month earlier the same code had pulled 6 years worth of data but no more.
  • Oops - Something happened when we created the data for reporting (data merge introduced duplicates, client wanted a differen format, etc) and now we need to do it again, except instead of the month allocated, we have a week to fix all this up. Which bumps into the Metering and Data Availability points. Ouch town, population you.

Preferred pattern

What I inheireted wasn't bad, it just hadn't taken those possible pain points into consideration. In a classic data warehouse, you have a raw zone with immutable source sitting somewhere on cheap storage. The same lesson applies here. When you pull from an API, land that data to disk in some self defining format, json/xml/csv don't care.

Write your process so that it is able to consume that source file and get the exact same results as the source data pull.

def get_data(source_date):
  """Get a data for a given date.
  :param source_date: An ISO 8601 formatted date aka yyy-MM-dd
  :return: A dictionary of data
  source_file = '/dbfs/mnt/datalake/raw/entity/data_provider_{0}.json'.format(source_date)
  raw_data = {}
  if os.path.exists(source_file):
    with open(source_file, 'r', encoding='utf-8') as f:
      raw_data = json.load(f)
      raw_data = analytics.reports().batchGet(body='json-goes-here').execute()
      with open(google_file, 'w', encoding='utf-8') as f:
        json.dump(raw_data, f, ensure_ascii=False)
  return raw_data

This simple method is responsible for getting my data by date. If the file exists in my file system, then I will reuse the results of a previous run to satisfy the data request. Otherwise, we make a call to the API and before we finish, we write the results to disk so that we can be ready in case Ooops happens downstream of the call.

Using this approach, we were able to reprocess a years worth of cached data in about 10 minutes compared to about 4.5 hours of data trickling out of the source API.