A blog about SQL Server, SSIS, C# and whatever else I happen to be dealing with in my professional life.

Find ramblings

Tuesday, September 26, 2023

Databricks sparksql concat is not your SQL Server concat

Databricks sparksql concat is not your SQL Server concat

One of these is not like the other...

The concat function is super handy in the database world but be aware that the SQL Server one is way better because it solves two problems. It combines everything into a string and it does not require NULL checking. In the before times, one had to down cast to a n/var/char type as well as check for NULL before appending strings via the plus sign.

In Databricks CONCAT WILL ONLY TAKE CARE OF CASTING TO THE STRING TYPE. NULLS WILL CONTINUE TO BITE YOU IN THE BUTTOCKS.

Given the following example query, we generate two rows in a derived table where the col2 value is either true (boolean 1) or NULL. In the LEFT JOIN LATERAL, which is the Databricks CROSS APPLY equivalent, I concat the 3 columns together with a pipe as separator and behold, my decidedly different results from a SQL Server expectation.

SELECT
*
FROM
(
SELECT * FROM VALUES (1, true, 'B')
UNION ALL SELECT * FROM VALUES (2, NULL, 'C')
)AS X(col1, col2, col3)
LEFT JOIN LATERAL
(
  SELECT concat(X.col1, '|', X.col2, '|', X.col3)
)HK(hkey);

What do you do? You get to wrap every nullable column with a coalesce call. Except, coalesce requires the same datatypes (mostly) so a naive implmentation of

SELECT concat(X.col1, '|', coalesce(X.col2, ''), '|', X.col3)
will result in the following error
AnalysisException: [DATATYPE_MISMATCH.DATA_DIFF_TYPES] Cannot resolve "coalesce(outer(X.col2), )" due to data type mismatch: Input to `coalesce` should all be the same type, but it's ("BOOLEAN" or "STRING")

Instead, one needs to do something along the lines of

SELECT
*
FROM
(
SELECT * FROM VALUES (1, true, 'B')
UNION ALL SELECT * FROM VALUES (2, NULL, 'C')
)AS X(col1, col2, col3)
LEFT JOIN LATERAL
(
  SELECT concat(X.col1, '|', coalesce(concat(X.col2, ''),''), '|', X.col3)
)HK(hkey);

At least I can automate this pattern with the information_schema.columns


No comments: