World of Whatever

A blog about SQL Server, SSIS, C# and whatever else I happen to be dealing with in my professional life.

Find ramblings

Tuesday, August 14, 2018

A date dimension for SQL Server

A date dimension for SQL Server

The most common table you will find in a data warehouse will be the date dimension. There is no "right" implementation beyond what the customer needs to solve their business problem. I'm posting a date dimension for SQL Server that I generally find useful as a starting point in the hopes that I quit losing it. Perhaps you'll find it useful or can use the approach to build one more tailored to your environment.

As the comments indicate, this will create: a DW schema, a table named DimDate and then populate the date dimension from 1900-01-01 to 2079-06-06 endpoints inclusive. I also patch in 9999-12-31 as a well known "unknown" date value. Sure, it's odd to have an incomplete year - this is your opportunity to tune the supplied code ;)

-- At the conclusion of this script, there will be
-- A schema named DW
-- A table named DW.DimDate
-- DW.DimDate will be populated with all the days between 1900-01-01 and 2079-06-06 (inclusive)
--   and the sentinel date of 9999-12-31

IF NOT EXISTS
(
    SELECT * FROM sys.schemas AS S WHERE S.name = 'DW'
)
BEGIN
    EXECUTE('CREATE SCHEMA DW AUTHORIZATION dbo;');
END
GO
IF NOT EXISTS
(
    SELECT * FROM sys.schemas AS S INNER JOIN sys.tables AS T ON T.schema_id = S.schema_id
    WHERE S.name = 'DW' AND T.name = 'DimDate'
)
BEGIN
    CREATE TABLE DW.DimDate
    (
        DateSK int NOT NULL
    ,   FullDate date NOT NULL
    ,   CalendarYear int NOT NULL
    ,   CalendarYearText char(4) NOT NULL
    ,   CalendarMonth int NOT NULL
    ,   CalendarMonthText varchar(12) NOT NULL
    ,   CalendarDay int NOT NULL
    ,   CalendarDayText char(2) NOT NULL
    ,   CONSTRAINT PK_DW_DimDate
            PRIMARY KEY CLUSTERED
            (
                DateSK ASC
            )
            WITH (ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, DATA_COMPRESSION = PAGE)
    ,   CONSTRAINT UQ_DW_DimDate UNIQUE (FullDate)
    );
END
GO
WITH 
    -- Define the start and the terminal value
    BOOKENDS(FirstDate, LastDate) AS (SELECT DATEFROMPARTS(1900,1,1), DATEFROMPARTS(9999,12,31))
    -- itzik ben gan rapid number generator
    -- Builds 65537 rows. Need more - follow the pattern
    --  Need fewer rows, add a top below
,    T0 AS 
(
    -- 2
    SELECT 1 AS n
    UNION ALL SELECT 1
)
,    T1 AS
(
    -- 2^2 => 4 
    SELECT 1 AS n
    FROM
        T0
        CROSS APPLY T0 AS TX
)
,    T2 AS 
(
    -- 4^4 => 16
    SELECT 1 AS n
    FROM
        T1
        CROSS APPLY T1 AS TX
)
,    T3 AS 
(
    -- 16^16 => 256
    SELECT 1 AS n
    FROM
        T2
        CROSS APPLY T2 AS TX
)
,    T4 AS
(
    -- 256^256 => 65536
    -- or approx 179 years
    SELECT 1 AS n
    FROM
        T3
        CROSS APPLY T3 AS TX
)
,    T5 AS
(
    -- 65536^65536 => basically infinity
    SELECT 1 AS n
    FROM
        T4
        CROSS APPLY T4 AS TX
)
    -- Assume we now have enough numbers for our purpose
,    NUMBERS AS
(
    -- Add a SELECT TOP (N) here if you need fewer rows
    SELECT
        CAST(ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS int) -1 AS number
    FROM
        T4
    UNION 
    -- Build End of time date
    -- Get an N value of 2958463 for
    -- 9999-12-31 assuming start date of 1900-01-01
    SELECT
        ABS(DATEDIFF(DAY, BE.LastDate, BE.FirstDate))
    FROM
        BOOKENDS AS BE
)
, DATES AS
(
SELECT
    PARTS.DateSk
,   FD.FullDate
,   PARTS.CalendarYear
,   PARTS.CalendarYearText
,   PARTS.CalendarMonth
,   PARTS.CalendarMonthText
,   PARTS.CalendarDay
,   PARTS.CalendarDayText
FROM
    NUMBERS AS N
    CROSS APPLY
    (
        SELECT
            DATEADD(DAY, N.number, BE.FirstDate) AS FullDate
        FROM
            BOOKENDS AS BE
    )FD
    CROSS APPLY
    (
        SELECT
            CAST(CONVERT(char(8), FD.FullDate, 112) AS int) AS DateSk
        ,   DATEPART(YEAR, FD.FullDate) AS [CalendarYear] 
        ,   DATENAME(YEAR, FD.FullDate) AS [CalendarYearText]
        ,   DATEPART(MONTH, FD.FullDate) AS [CalendarMonth]
        ,   DATENAME(MONTH, FD.FullDate) AS [CalendarMonthText]
        ,   DATEPART(DAY, FD.FullDate)  AS [CalendarDay]
        ,   DATENAME(DAY, FD.FullDate) AS [CalendarDayText]

    )PARTS
)
INSERT INTO
    DW.DimDate
(
    DateSK
,   FullDate
,   CalendarYear
,   CalendarYearText
,   CalendarMonth
,   CalendarMonthText
,   CalendarDay
,   CalendarDayText
)
SELECT
    D.DateSk
,   D.FullDate
,   D.CalendarYear
,   D.CalendarYearText
,   D.CalendarMonth
,   D.CalendarMonthText
,   D.CalendarDay
,   D.CalendarDayText
FROM
    DATES AS D
WHERE NOT EXISTS
(
    SELECT * FROM DW.DimDate AS DD
    WHERE DD.DateSK = D.DateSk
);

Thursday, April 5, 2018

Sort SQL Server tables into similarly sized buckets

Sort SQL Server Tables into similarly sized buckets

You need to do something to all of the tables in SQL Server. That something can be anything: reindex/reorg, export the data, perform some other maintenance---it really doesn't matter. What does matter is that you'd like to get it done sooner rather than later. If time is no consideration, then you'd likely just do one table at a time until you've done them all. Sometimes, a maximum degree of parallelization of one is less than ideal. You're paying for more than one processor core, you might as well use it. The devil in splitting a workload out can be ensuring the tasks are well balanced. When I'm staging data in SSIS, I often use a row count as an approximation for a time cost. It's not perfect - a million row table 430 columns wide might actually take longer than the 250 million row key-value table.

A sincere tip of the hat to Daniel Hutmacher (b|t)for his answer on this StackExchange post. He has some great logic for sorting tables into approximately equally sized bins and it performs reasonably well.

SET NOCOUNT ON;
DECLARE
    @bucketCount tinyint = 6;

IF OBJECT_ID('tempdb..#work') IS NOT NULL
BEGIN
    DROP TABLE #work;
END

CREATE TABLE #work (
    _row    int IDENTITY(1, 1) NOT NULL,
    [SchemaName] sysname,
    [TableName] sysname,
    [RowsCounted]  bigint NOT NULL,
    GroupNumber     int NOT NULL,
    moved   tinyint NOT NULL,
    PRIMARY KEY CLUSTERED ([RowsCounted], _row)
);

WITH cte AS (
SELECT B.RowsCounted
,   B.SchemaName
,   B.TableName
    FROM
    (
        SELECT
            s.[Name] as [SchemaName]
        ,   t.[name] as [TableName]
        ,   SUM(p.rows) as [RowsCounted]
        FROM
            sys.schemas s
            LEFT OUTER JOIN 
                sys.tables t
                ON s.schema_id = t.schema_id
            LEFT OUTER JOIN 
                sys.partitions p
                ON t.object_id = p.object_id
            LEFT OUTER JOIN  
                sys.allocation_units a
                ON p.partition_id = a.container_id
        WHERE
            p.index_id IN (0,1)
            AND p.rows IS NOT NULL
            AND a.type = 1
        GROUP BY 
            s.[Name]
        ,   t.[name]
    ) B
)

INSERT INTO #work ([RowsCounted], SchemaName, TableName, GroupNumber, moved)
SELECT [RowsCounted], SchemaName, TableName, ROW_NUMBER() OVER (ORDER BY [RowsCounted]) % @bucketCount AS GroupNumber, 0
FROM cte;


WHILE (@@ROWCOUNT!=0)
WITH cte AS
(
    SELECT
        *
    ,   SUM(RowsCounted) OVER (PARTITION BY GroupNumber) - SUM(RowsCounted) OVER (PARTITION BY (SELECT NULL)) / @bucketCount AS _GroupNumberoffset
    FROM
        #work
)
UPDATE
    w
SET
    w.GroupNumber = (CASE w._row
                 WHEN x._pos_row THEN x._neg_GroupNumber
                 ELSE x._pos_GroupNumber
             END
            )
,   w.moved = w.moved + 1
FROM
    #work AS w
    INNER JOIN
    (
        SELECT TOP 1
            pos._row AS _pos_row
        ,   pos.GroupNumber AS _pos_GroupNumber
        ,   neg._row AS _neg_row
        ,   neg.GroupNumber AS _neg_GroupNumber
        FROM
            cte AS pos
            INNER JOIN
                cte AS neg
                ON pos._GroupNumberoffset > 0
                   AND neg._GroupNumberoffset < 0
                   AND
            --- To prevent infinite recursion:
            pos.moved < @bucketCount
                   AND neg.moved < @bucketCount
        WHERE --- must improve positive side's offset:
            ABS(pos._GroupNumberoffset - pos.RowsCounted + neg.RowsCounted) <= pos._GroupNumberoffset
            AND
            --- must improve negative side's offset:
            ABS(neg._GroupNumberoffset - neg.RowsCounted + pos.RowsCounted) <= ABS(neg._GroupNumberoffset)
        --- Largest changes first:
        ORDER BY
            ABS(pos.RowsCounted - neg.RowsCounted) DESC
    ) AS x
    ON w._row IN
       (
           x._pos_row
       ,   x._neg_row
       );

Now what? Let's look at the results. Run this against AdventureWorks and AdventureWorksDW

SELECT
    W.GroupNumber
,   COUNT_BIG(1) AS TotalTables
,   SUM(W.RowsCounted) AS GroupTotalRows
FROM
    #work AS W
GROUP BY
    W.GroupNumber
ORDER BY
    W.GroupNumber;


SELECT
    W.GroupNumber
,   W.SchemaName
,   W.TableName
,   W.RowsCounted
,   COUNT_BIG(1) OVER (PARTITION BY W.GroupNumber ORDER BY (SELECT NULL)) AS TotalTables
,   SUM(W.RowsCounted) OVER (PARTITION BY W.GroupNumber ORDER BY (SELECT NULL)) AS GroupTotalRows
FROM
    #work AS W
ORDER BY
    W.GroupNumber;

For AdventureWorks (2014), I get a nice distribution across my 6 groups. 12 to 13 tables in each bucket and a total row count between 125777 and 128003. That's less than 2% variance between the high and low - I'll take it.

If you rerun for AdventureWorksDW, it's a little more interesting. Our 6 groups are again filled with 5 to 6 tables but this time, group 1 is heavily skewed by the fact that FactProductInventory accounts for 73% of all the rows in the entire database. The other 5 tables in the group are the five smallest tables in the database.

I then ran this against our data warehouse-like environment. We had a 1206 tables in there for 3283983766 rows (3.2 million billion). The query went from instantaneous to about 15 minutes but now I've got a starting point for bucketing my tables into similarly sized groups.

What do you think? How do you plan to use this? Do you have a different approach for figuring this out? I looked at R but without knowing what this activity is called, I couldn't find a function to perform the calculations.

Monday, March 12, 2018

2018 MVP Summit retrospective

2018 MVP Summit retrospective

Another year of the MVP Summit is in the bag and as always, I have months worth of learning I'm excited to do.

Thank you

I'd like to extend a hearty thank you to Microsoft and the various teams for hosting us. I can't imagine the sheer amount of hours spent in preparation, actual time not-spent-working-on-technology-X, much less the expense of caffeinating, feeding, lodging, and transporting us.

What I'm excited about

Stream Analytics

We have a high performance (60M messages per day averaging 130ms throughput) messaging system that allows us to expose mainframe data as a SQL Server database for analytics. The devil with Service Broker is that there's no built in monitoring. We have a clever dashboard built on the PowerBI reporting streaming dataset source that provides an at-a-glance health check for data processing. What we need though, is something that can drive action based on changes. The September changes in Stream Analytics look like the perfect fit. It allows us to detect not just hard limits (we've violated our 3 second SLA) but the squishier metrics like a background process just woke up and swamped us with a million rows in the past three minutes or our processing time is trending upwards and someone needs to figure out why.

SQL Graph improvements

While we are not yet using graph features, I can see opportunities for it with our client that I want to build some proof of concept models.

Cosmos DB

Alongside the Stream Analytics improvements, perhaps we need to feed the change data into Cosmos and then leverage the Change Feed support to push to analytics processing. And just generally, I need to invest some time in Apache Spark. I also learned that I don't need to discover all the patterns for lambda architecture as it's already out there with a handy URL to boot.

Cognitive Services

Ok, while picking up information about this was just to scratch a very silly itch, I was impressed how easy it was from the web interface. I have bird feeders and even though most seed will state that squirrels are not interested in it, that's a downright lie.

Don't mind me, I'm just a fuzzy bird

I want a camera pointed at my bird feeder and if a squirrel shows, I want to know about it. I used about a dozen pictures of my bird feeders with and without my nemesis to train the model and then fed back assorted photos to see how smart it was. Except for an image of a squirrel hiding in shadow, it was able to give me high confidence readings on what was featured in the photo. Here we can see that my dog is neither a bird nor a squirrel.
Not a squirrel, just a lazy dog

I'm so excited to get these bots built out. One for the Raspberry Pi to detect presence at the feeder and then an Azure based recognizer for friend versus foe. Once that's done, the next phase will be to identify specific bird species. And then tie it to feed type and feeder style (tray/platform versus house versus tube) and time of day and ... yes, lot of fun permutations that are easily available without having to learn all the computer vision and statistics. Feel free to give it a whirl at https://customvision.ai

SQLOps studio

This is the new cross platform SQL Server Management Studio replacement - sort of. It's not designed to do everything SSMS does but instead the vision is to solve the most needed problems and with the open source model, the community can patch in their own solutions. I'm excited to put together a better reporting interface for the SSISDB. Something that you can actually copy text out of - how crazy is that?

Azure Data Lake Analytics

It had been a year since I had worked through some of the ADLA/USQL so it was good to get back into the language and environment. I need to get on a project that is actually using the technology though to really cement my knowledge.

What I learned

In October of 2016, I launched Sterling Data Consulting as my company. I sub under a good friend and it's been an adventure running a business but I don't feel like I'm really running a business since I have no other business. One of my TODOs at the conference was to talk to other small shop owners to see if I could discover their "secret sauce." While I got assorted feedback, the two I want to send a special thank you to are John Sterrett of Procure SQL and Tim Radney. Their advice ranged from straight forward "I don't know what you do", "are you for hire" to thoughts on lead acquisition and my lack of vision for sales.

Tim was also my roommate and it was great just getting to know him. We traded Boy Scout leader stories and he had excellent ideas for High Adventure fundraisers since that's something our troop is looking to do next year. For being a year younger than me, he sure had a lot more wisdom on the things I don't do or don't do well. You should check him at at the Atlanta SQL Saturday and attend his precon on Common SQL Server mistakes and how to avoid them.

Photos

Bellevue is less scenic than Seattle but the sunshine and warmth on Tuesday made for some nice photos of the treehouses. Yes, the Microsoft Campus has adult sized treehouses in it. How cool is that?

Friday, March 2, 2018

Python pandas repeating character tester

Python pandas repeating character tester

At one of our clients, we are data profiling. They have a mainframe, it's been running for so long, they no longer have SMEs for their data. We've been able to leverage Service Broker to provide a real-time, under 3 seconds, remote file store for their data. It's pretty cool but now they are trying to do something with the data so we need to understand what the data looks like. We're using a mix of TSQL and python to understand nullability, value variances, etc. One of the "interesting" things we've discovered is that they loved placeholder values. Everyone knows a date of 88888888 is a placeholder for the actual date which they'll get two steps later in the workflow. Except sometimes we use 99999999 because the eights are the placeholder for the time.

Initially, we were just searching for one sentinel value, then two values until we saw the bigger pattern of "repeated values probably mean something." For us, this matters because we then need to discard those rows for data type suitability. 88888888 isn't a valid date so our logic might determine that column is best served by a numeric data type. Unless we exclude the eights value in which we get a 100% match rate on the column's ability to be converted to a date.

How can we determine if a string is nothing but repeated values in python? There's a very clever test from StackOverflow

source == source[0] * len(source) I would read that as "is the source variable exactly equal to the the first character of source repeated for the length of source?"

And that was good, until we hit a NULL (None in python-speak). We then took advantage of the ternary equivalent in python to make it

(source == source[0] * len(source)) if source else False

Enter Pandas (series)

Truth is a funny thing in an Pandas Series. Really, it is. The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().. We were trying to apply the above function as we were doing everything else

df.MyColumn.str.len()
# this will fail magnificantly
(df.MyColumn == df.MyColumn[0] * len(df.MyColumn)) if df.MyColumn else False

It took me a while since I hadn't really used the pandas library beyond running what my coworker had done. What I needed to do, was get a row context to apply the calculations for true/false. As it stands, the Series stuff wants to try and aggregate the booleans or something like that. And it makes sense from a SQL perspective, you can't really apply aggregates to bit fields (beyond COUNT).

So, what's the solution? As always, you're likely to say the exact thing you're looking for. In this case, apply was the keyword.

df.MyColumn.apply(lambda source: (source == source[0] * len(source)) if source else False)

Full code you can play with would be

import pandas
import pprint

def isRepeated(src):
    return (src == src[0] * len(src)) if src else False
    
df = pandas.DataFrame({"MyCol":pandas.Series(['AB', 'BC', 'BB', None])})

pprint.pprint(df)

print()
# What rows have the same character in all of them?

pprint.pprint(df.MyCol.apply(lambda source:(source == source[0] * len(source)) if source else False))
#If you'd like to avoid the anonymous function...
pprint.pprint(df.MyCol.apply(isRepeated))

In short, python is glorious and I'm happy to writing in it again ;)


Friday, February 23, 2018

Pop Quiz - REPLACE in SQL Server

It's amazing the things I've run into with SQL Server this week that I never noticed. In today's pop quiz, let's look at REPLACE

DECLARE
    @Repro table
(
    SourceColumn varchar(30)
);

INSERT INTO 
    @Repro
(
    SourceColumn
)
SELECT
    D.SourceColumn
FROM
(
    VALUES 
        ('None')
    ,   ('ABC')
    ,   ('BCD')
    ,   ('DEF')
)D(SourceColumn);

SELECT
    R.SourceColumn
,   REPLACE(R.SourceColumn, 'None', NULL) AS wat
FROM
    @Repro AS R;

In the preceding example, I load 4 rows into a table and call the REPLACE function on it. Why? Because some numbskull front end developer entered None instead of a NULL for a non-existent value. No problem, I will simply replace all None with NULL. So, what's the value of the wat column?

Well, if you're one of those people who reads instruction manuals before attempting anything, you'd have seen Returns NULL if any one of the arguments is NULL. Otherwise, you're like me thinking "maybe I put the arguments in the wrong order". Nope, , REPLACE(R.SourceColumn, 'None', '') AS EmptyString that works. So what the heck? Guess I'll actually read the manual... No, this work, I can just use NULLIF to make the empty strings into a NULL , NULLIF(REPLACE(R.SourceColumn, 'None', ''), '') AS EmptyStringToNull

Much better, replace all my instances of None with an empty string and then convert anything that is empty string to null. Wait, what? You know what would be better? Skipping the replace call altogether.

SELECT
    R.SourceColumn
,   NULLIF(R.SourceColumn, 'None') AS MuchBetter
FROM
    @Repro AS R;

Moral of the story and/or quiz: once you have a working solution, rubber duck out your approach to see if there's an opportunity for improvement (only after having committed the working version to source control).

Thursday, February 22, 2018

Altering table types, part 2

Altering table types - a compatibility guide

In yesterday's post, I altered a table type. Pray I don't alter them further. What else is incompatible with an integer column? It's just a morbid curiosity at this point as I don't recall having ever seen this after working with SQL Server for 18 years. Side note, dang I'm old

How best to answer the question, by interrogating the sys.types table and throwing operations against the wall to see what does/doesn't stick.

DECLARE
    @Results table
(
    TypeName sysname, Failed bit, ErrorMessage nvarchar(4000)
);

DECLARE
    @DoOver nvarchar(4000) = N'DROP TABLE IF EXISTS dbo.IntToTime;
CREATE TABLE dbo.IntToTime (CREATE_TIME int);'
,   @alter nvarchar(4000) = N'ALTER TABLE dbo.IntToTime ALTER COLUMN CREATE_TIME @type'
,   @query nvarchar(4000) = NULL
,   @typeName sysname = 'datetime';

DECLARE
    CSR CURSOR
FORWARD_ONLY
FOR
SELECT 
    T.name
FROM
    sys.types AS T
WHERE
    T.is_user_defined = 0

OPEN CSR;
FETCH NEXT FROM CSR INTO @typeName
WHILE @@FETCH_STATUS = 0
BEGIN
    BEGIN TRY   
        EXECUTE sys.sp_executesql @DoOver, N'';
        SELECT @query = REPLACE(@alter, N'@type', @typeName);
        EXECUTE sys.sp_executesql @query, N'';
        
        INSERT INTO
            @Results
        (
            TypeName
        ,   Failed
        ,   ErrorMessage
        )
        SELECT @typeName, CAST(0 AS bit), ERROR_MESSAGE();
    END TRY
    BEGIN CATCH
        INSERT INTO
            @Results
        (
            TypeName
        ,   Failed
        ,   ErrorMessage
        )
        SELECT @typeName, CAST(1 AS bit), ERROR_MESSAGE()
    END CATCH
    FETCH NEXT FROM CSR INTO @typeName
END
CLOSE CSR;
DEALLOCATE CSR;

SELECT
*
FROM
    @Results AS R
ORDER BY
    2,1;
TypeNameFailedErrorMessage
bigint0
binary0
bit0
char0
datetime0
decimal0
float0
int0
money0
nchar0
numeric0
nvarchar0
real0
smalldatetime0
smallint0
smallmoney0
sql_variant0
sysname0
tinyint0
varbinary0
varchar0
date1Operand type clash: int is incompatible with date
datetime21Operand type clash: int is incompatible with datetime2
datetimeoffset1Operand type clash: int is incompatible with datetimeoffset
geography1Operand type clash: int is incompatible with geography
geometry1Operand type clash: int is incompatible with geometry
hierarchyid1Operand type clash: int is incompatible with hierarchyid
image1Operand type clash: int is incompatible with image
ntext1Operand type clash: int is incompatible with ntext
text1Operand type clash: int is incompatible with text
time1Operand type clash: int is incompatible with time
timestamp1Cannot alter column 'CREATE_TIME' to be data type timestamp.
uniqueidentifier1Operand type clash: int is incompatible with uniqueidentifier
xml1Operand type clash: int is incompatible with xml

Wednesday, February 21, 2018

Pop quiz - altering column types

Pop quiz

Given the following DDL

CREATE TABLE dbo.IntToTime
(
    CREATE_TIME int
);

What will be the result of issuing the following command?

ALTER TABLE dbo.IntToTime ALTER COLUMN CREATE_TIME time NULL;

Clearly, if I'm asking, it's not what you might expect. How can an empty table not allow you to change data types? Well it seems Time and datetime2 are special cases as they'll raise errors of the form

Msg 206, Level 16, State 2, Line 47 Operand type clash: int is incompatible with time

If you're in this situation and need to get the type converted, you'll need to make two hops, one to varchar and then to time.

ALTER TABLE dbo.IntToTime ALTER COLUMN CREATE_TIME varchar(10) NULL;
ALTER TABLE dbo.IntToTime ALTER COLUMN CREATE_TIME time NULL;