Tuesday, September 26, 2023

Databricks sparksql concat is not your SQL Server concat

One of these is not like the other...

The concat function is super handy in the database world but be aware that the SQL Server one is way better because it solves two problems. It combines everything into a string and it does not require NULL checking. In the before times, one had to down cast to a n/var/char type as well as check for NULL before appending strings via the plus sign.

In Databricks CONCAT WILL ONLY TAKE CARE OF CASTING TO THE STRING TYPE. NULLS WILL CONTINUE TO BITE YOU IN THE BUTTOCKS.

Given the following example query, we generate two rows in a derived table where the col2 value is either true (boolean 1) or NULL. In the LEFT JOIN LATERAL, which is the Databricks CROSS APPLY equivalent, I concat the 3 columns together with a pipe as separator and behold, my decidedly different results from a SQL Server expectation.

SELECT
*
FROM
(
SELECT * FROM VALUES (1, true, 'B')
UNION ALL SELECT * FROM VALUES (2, NULL, 'C')
)AS X(col1, col2, col3)
LEFT JOIN LATERAL
(
  SELECT concat(X.col1, '|', X.col2, '|', X.col3)
)HK(hkey);

What do you do? You get to wrap every nullable column with a coalesce call. Except, coalesce requires the same datatypes (mostly) so a naive implmentation of

SELECT concat(X.col1, '|', coalesce(X.col2, ''), '|', X.col3)

will result in the following error
AnalysisException: [DATATYPE_MISMATCH.DATA_DIFF_TYPES] Cannot resolve "coalesce(outer(X.col2), )" due to data type mismatch: Input to `coalesce` should all be the same type, but it's ("BOOLEAN" or "STRING")

Instead, one needs to do something along the lines of

SELECT
*
FROM
(
SELECT * FROM VALUES (1, true, 'B')
UNION ALL SELECT * FROM VALUES (2, NULL, 'C')
)AS X(col1, col2, col3)
LEFT JOIN LATERAL
(
  SELECT concat(X.col1, '|', coalesce(concat(X.col2, ''),''), '|', X.col3)
)HK(hkey);

At least I can automate this pattern with the information_schema.columns

Thursday, September 14, 2023

Databricks sparksql escaping quote/tick

If I had to embed a single quote in a query in TSQL, I would double it. In SparkSQL, I escape it like a classic C style string. So, the following shows how one would generate a query that is a query to find the row counts across all tables in SQL or unity catalog. Although for SQL, you're better off just querying the partitions meta table as it's waaaaay faster.

TSQL

SELECT CONCAT('SELECT COUNT(1) AS rc, ''', T.table_name, ''' AS table_name FROM dev.silver.', T.table_name, '') AS rcQ FROM dev.information_schema.tables AS T WHERE T.table_schema = 'silver'

Databricks unity catalog

SELECT CONCAT('SELECT COUNT(1) AS rc, \'', T.table_name, '\' AS table_name FROM dev.silver.', T.table_name, '') AS rcQ FROM dev.information_schema.tables AS T WHERE T.table_schema = 'silver'

Tuesday, September 12, 2023

ADF, DB2 and unexpected token query error

In yet another exciting episode of head to desk, I have a beautiful DB2 query. It runs just fine in a tool like DBeaver.

SELECT 
    STR.CPY_CD
,   STR.STR_CD
,   STR.DFLT_FAC_NBR -- I think this is division number
,   current date || 'T' || VARCHAR_FORMAT(CURRENT TIMESTAMP, 'HH24:MI:SS')  AS CurrentDateTime
FROM 
    FOO.CUSTMASTER AS STR
FOR FETCH ONLY WITH UR

Punch it into ADF and it fails, repeatedly with this ever-so-helpful error message. What is going on here?

Operation on target Copy StoreDivision to ADLS failed: Failure happened on 'Source' side. ErrorCode=DB2DriverRunFailed,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Error thrown from driver. Sql code: '-104',Source=Microsoft.Connectors.Db2,''Type=Microsoft.HostIntegration.DrdaClient.DrdaException,Message=An unexpected token "END-OF-STATEMENT , STR.DFLT_FAC_NBR " was found following "". Expected tokens may include: "". SQLSTATE=42601 SQLCODE=-104,Source=Microsoft.HostIntegration.Drda.Requester,'

After pasting the query into a text editor and examining each character in hex mode, no, no there are no "weird" unprintable characters hiding in my query. I mean, the only thing weird about the fourth line is that I have a comment there. But that syntax is totally valid for DB2...

Oh gentle reader, I'm sure there's a page somewhere in documentation that indicates carriage returns/line feeds are stripped from queries when subsmitted at least for Db2 connectors but I'd really like to stop running into surprises with Azure Data Factory. At least, that's my best guess as to what is happeneing because BEHOLD, this query works

SELECT 
    STR.CPY_CD
,   STR.STR_CD
,   STR.DFLT_FAC_NBR /* I think this is division number */
,   current date || 'T' || VARCHAR_FORMAT(CURRENT TIMESTAMP, 'HH24:MI:SS')  AS CurrentDateTime
FROM 
    FOO.CUSTMASTER AS STR
FOR FETCH ONLY WITH UR

Why do I think the CR/LFs are stripped? Because if I run this query in a tool like DBeaver, it fails and you won't believe what the error message is

SELECT     STR.CPY_CD,   STR.STR_CD,   STR.DFLT_FAC_NBR -- I think this is division number

SQL Error [42601]: An unexpected token "END-OF-STATEMENT" was found following ", STR.DFLT_FAC_NBR". Expected tokens may include: "".. SQLCODE=-104, SQLSTATE=42601, DRIVER=4.31.10

So, the moral of the story? Always use the /**/ comment syntax in your queries. Not just in ADF but everywhere because you never know when formatting is going to get eaten.

Monday, September 11, 2023

Difference between SparkSQL and TSQL casts

Yet another thing that has bitten me working in SparkSQL in Databricks---this time it's data types.

In SQL Server, a tinyint ranges from 0 to 255 but both of them allow for 256 total values. If you attempt to cast a value that doesn't fit in that range, you're going to raise an error.

SELECT 256 AS x, CAST(256 AS tinyint) AS boom

Msg 220, Level 16, State 2, Line 1
Arithmetic overflow error for data type tinyint, value = 256.

The range for a tinyint is -128 to 127 in SparkSQL - still 256 total values. Docs call it out as well ByteType: Represents 1-byte signed integer numbers. The range of numbers is from -128 to 127 SELECT CAST('128' AS tinyint) AS WhereIsTheBoom, CAST(128 AS tinyint) As WhatIsThisNonsense Here I select the value 128 as both a string and a number. I honestly have no idea how to interpret these results. A cast from string behaves more like a TRY_CAST but numeric overflows just cycle?

Yeah, the cycle seems to be the thing as SELECT CAST(129 as tinyint) AS Negative127 is -127.

Friday, September 8, 2023

SparkSQL Databricks [INVALID_USAGE_OF_STAR_OR_REGEX] Invalid usage of '*' in expression `alias`.

SparkSQL Databricks Error in SQL statement: AnalysisException: [INVALID_USAGE_OF_STAR_OR_REGEX] Invalid usage of '*' in expression `alias`.

Hi, it's me. I'm the problem

Dear self, when you recieve the following error Error in SQL statement: AnalysisException: [INVALID_USAGE_OF_STAR_OR_REGEX] Invalid usage of '*' in expression `alias` writing new-to-you sparksql and assuming the TSQL construct you know works just doesn't translate, take a good look at the syntax because I bet you've doubled up the FROM, again!
SELECT * FROM FROM uc.schema.table AS X WHERE X.col1 = 0

Make that
SELECT * FROM uc.schema.table AS X WHERE X.col1 = 0

Wednesday, September 6, 2023

ADF ForEach Activity

This occasional series of posts I'm calling Notes from the field is primarily for my benefit as I don't work with Azure Data Factory with any regularity and have no muscle memory in it. And the amount of sharp edges and non-intuitive baked in the product is quite frustrating for me. If you derive value from it, all the better.

The Azure Data Factory ForEach Activity shreds some enumerable thing but here I want to reference a Lookup to ForEach.

When I click on the ForEach Activity, in the "Items" section. I click "Add dynamic content" and am presented with these wonderful choices (given that my canvas consists of a Lookup Activity named "LKP_ReferenceTables" and the ForEach wired as a successor:

My options are

LKP_ReferenceTables: LKP_ReferenceTables activity output
LKP_ReferenceTables: LKP_ReferenceTables pipeline return value
LKP_ReferenceTables count: Count of the rows
LKP_ReferenceTables value array: Array of row data

and if I pick "activity output" the dyanmic expression is @activity('LKP_ReferenceTables').output WILLIAM! DO NOT CLICK OK, yet. If you accept the default code, you're going to get a lovely error in the form of Operation on target FE_ReferenceTables failed: The function 'length' expects its parameter to be an array or a string. The provided value is of type 'Object' You need to add the ".value" property to the expression to get the actual stuff you want @activity('LKP_ReferenceTables').output.value

Orrrr, I pick Array of row data and that generates the expected @activity('LKP_ReferenceTables').output.value Le sigh

Wednesday, May 3, 2023

Formatting a date in SQL Server with the Z indicator

It seems so easy, I was building json in SQL Server and the date format for the API specified it needed to have 3 millsecond digits and the zulu timezone signifier. Easy peasy, lemon squeezey, that is ISO8601 with time zone Z format code 127

SELECT CONVERT(char(24), GETDATE(), 127) AS waitAMinute; Running that query yields something like 2023-05-02T10:47:18.850 Almost there but where's my Z? Hmmm, maybe it's because I need to put this into UTC? SELECT CONVERT(char(24), GETUTCDATE(), 127) AS SwingAndAMiss;

Running that query yields something like 2023-05-02T15:47:18.850 It's in UTC but still no timezone indicator. I guess I can try an explict conversion to a datetimezone and then convert to 127.

SELECT CONVERT(char(24), CAST(GETUTCDATE() AS datetimeoffset) , 127) AS ThisIsGettingRidiculous , CAST(GETUTCDATE() AS datetimeoffset) AS ControlValue;

Once again, that query yields something like 2023-05-02T15:47:18.850 and I can confirm the ControlValue aka unformatted looks like 2023-05-02 15:47:850.7300000 +00:00 We have timezone info, just not the way I need it.

Back to the documentation, let's ready those pesky footnotes.

8 Only supported when casting from character data to datetime or smalldatetime. When casting character data representing only date or only time components to the datetime or smalldatetime data types, the unspecified time component is set to 00:00:00.000, and the unspecified date component is set to 1900-01-01. 9 Use the optional time zone indicator Z to make it easier to map XML datetime values that have time zone information to SQL Server datetime values that have no time zone. Z indicates time zone at UTC-0. The HH:MM offset, in the + or - direction, indicates other time zones. For example: 2022-12-12T23:45:12-08:00.

Those don't appply to me...Oh, wait, they do. In the Z notation is only used for converting stringified dates into native datetime types. There is no cast and convert style code to output a formated ISO 8601 date with the Z indicator.

So what do you do? Much of the internet just proposes using string concatenation to append the Z onto the string and move on. And that's what a rational developer would do but I am not one of those people.

Solution

If you want to get a well formatted ISO8601 with the time zone Z indicator, the one-stop-shop in SQL Server will be the FORMAT function because you can do anything there!

SELECT FORMAT(CAST(D.val AS datetimeoffset), 'yyyy-MM-ddThh:mm:ss.fffZ') AS WinnerWinnerChickenDinner , FORMAT(CAST(D.val AS datetime2(3)), 'yyyy-MM-ddThh:mm:ss.fffZ') AS OrThis FROM ( VALUES ('2023-05-02T15:47:18.850Z') )D(val);

The final thing to note, the return type of FORMAT is different. It defaults to nvarchar(4000) whereas lame string concatenation yields us the right length (24) but the wrong type as concatenation changed our char to varchar. If we were storing this to a table, I'd add a final explicit cast, in either case, to be char(24). There's no unicode values to worry about nor will it ever be shorter than 24 characters.

SELECT DEDFRS.name, DEDFRS.system_type_name FROM sys.dm_exec_describe_first_result_set ( N'SELECT FORMAT(CAST(D.val AS datetimeoffset), ''yyyy-MM-ddThh:mm:ss.fffZ'') AS LookMaUnicode , CONVERT(char(23), CAST(D.val AS datetimeoffset), 127) + ''Z'' AS StillVarChar FROM ( VALUES (''2023-05-02T15:47:18.850Z'') )D(val);' , N'' , 1) AS DEDFRS;

Filed under, "I blogged about it, hopefully I'll remember the solution"

Pages

World of Whatever

Find ramblings

Tuesday, September 26, 2023

Databricks sparksql concat is not your SQL Server concat

Databricks sparksql concat is not your SQL Server concat

Thursday, September 14, 2023

Databricks sparksql escaping quote/tick

TSQL

Databricks unity catalog

Tuesday, September 12, 2023

ADF, DB2 and unexpected token query error

ADF, DB2 and unexpected token query error

Monday, September 11, 2023

Difference between SparkSQL and TSQL casts

Friday, September 8, 2023

SparkSQL Databricks [INVALID_USAGE_OF_STAR_OR_REGEX] Invalid usage of '*' in expression `alias`.

SparkSQL Databricks Error in SQL statement: AnalysisException: [INVALID_USAGE_OF_STAR_OR_REGEX] Invalid usage of '*' in expression `alias`.

Wednesday, September 6, 2023

ADF ForEach Activity

ADF ForEach Activity

Wednesday, May 3, 2023

Formatting a date in SQL Server with the Z indicator

Formatting a date in SQL Server with the Z indicator

Solution