World of Whatever

A blog about SQL Server, SSIS, C# and whatever else I happen to be dealing with in my professional life.

Find ramblings

Thursday, October 18, 2018

Polling in SQL Agent

Polling in SQL Agent

A fun question over on StackOverflow asked about using SQL Agent with SSIS to poll for a file's existence. As the comments indicate, there's a non-zero startup time associated with SSIS (it must validate the metadata associated to the sources and destinations), but there is a faster, lighter weight alternative. Putting together a host of TSQL ingredients, including undocumented extended stored procedures, the following recipe could be used as a SQL Agent job step.

If you copy and paste the following query into your favorite instance of SQL Server, it will execute for one minute and it will complete by printing the words "Naughty, naughty".

SET NOCOUNT ON;
-- http://www.patrickkeisler.com/2012/11/how-to-use-xpdirtree-to-list-all-files.html
DECLARE
    -- Don't do stupid things like adding spaces into folder names
    @sourceFolder varchar(260) = 'C:\ssisdata\Input'
    -- Have to use SQL matching rules, not DOS/SSIS
,   @fileMask sysname = 'SourceData%.txt'
    -- how long to wait between polling
,   @SleepInSeconds int = 5
    -- Don't exceed 24 hours aka 86400 seconds
,   @MaxTimerDurationInSeconds int = (3600 * 0) + (60 * 1) + 0
    -- parameter for xp_dirtree 0 => top folder only; 1 => subfolders
,   @depth int = 1
    -- parameter for xp_dirtree 0 => directory only; 1 => directory and files
,   @collectFile int = 1
,   @RC bigint = 0;

-- Create a table variable to capture the results of our directory command
DECLARE
    @DirectoryTree table
(
    id int IDENTITY(1, 1)
,   subdirectory nvarchar(512)
,   depth int
,   isFile bit
);

-- Use our sleep in seconds time to generate a delay time string
DECLARE
    @delayTime char(10) = CONVERT(char(10), TIMEFROMPARTS(@SleepInSeconds/60 /60, @SleepInSeconds/60, @SleepInSeconds%60, 0, 0), 108)
,   @stopDateTime datetime2(0) = DATEADD(SECOND, @MaxTimerDurationInSeconds, CURRENT_TIMESTAMP);

-- Force creation of the folder
EXECUTE dbo.xp_create_subdir @sourceFolder;

-- Load the results of our directory
INSERT INTO
    @DirectoryTree
(
    subdirectory
,   depth
,   isFile
)
EXECUTE dbo.xp_dirtree
    @sourceFolder
,   @depth
,   @collectFile;

-- Prime the pump
SELECT
    @RC = COUNT_BIG(1)
FROM
    @DirectoryTree AS DT
WHERE
    DT.isFile = 1
    AND DT.subdirectory LIKE @fileMask;

WHILE @rc = 0 AND @stopDateTime > CURRENT_TIMESTAMP
BEGIN

    -- Load the results of our directory
    INSERT INTO
        @DirectoryTree
    (
        subdirectory
    ,   depth
    ,   isFile
    )
    EXECUTE dbo.xp_dirtree
        @sourceFolder
    ,   @depth
    ,   @collectFile;

    -- Test for file existence
    SELECT
        @RC = COUNT_BIG(1)
    FROM
        @DirectoryTree AS DT
    WHERE
        DT.isFile = 1
        AND DT.subdirectory LIKE @fileMask;

    IF @RC = 0
    BEGIN
        -- Put our process to sleep for a period of time
        WAITFOR DELAY @delayTime;
    END
END

-- at this point, we have either exited due to file found or time expired
IF @RC > 0
BEGIN
    -- Take action when file was found
    PRINT 'Go run SSIS or something';
END
ELSE
BEGIN
    -- Take action for file not delivered in expected timeframe
    PRINT 'Naughty, naughty';
END

If you rerun the above query, in a separate window, assuming you have xp_cmdshell enabled, firing the following query will create a file with the expected pattern. Instead, it'll print out "Go run SSIS or something"

DECLARE
    @sourceFolder varchar(260) = 'C:\ssisdata\Input'
,   @fileMask sysname = REPLACE('SourceData%.txt', '%', CONVERT(char(10), CURRENT_TIMESTAMP, 120))
DECLARE
    @command varchar(1000) = 'echo > ' + @sourceFolder + '\' + @fileMask;

-- If you get this error
--Msg 15281, Level 16, State 1, Procedure sys.xp_cmdshell, Line 1 [Batch Start Line 0]
--SQL Server blocked access to procedure 'sys.xp_cmdshell' of component 'xp_cmdshell' because this component is turned off as part of the security configuration for this server. A system administrator can enable the use of 'xp_cmdshell' by using sp_configure. For more information about enabling 'xp_cmdshell', search for 'xp_cmdshell' in SQL Server Books Online.
--
-- Run this
--EXECUTE sys.sp_configure'xp_cmdshell', 1;
--GO
--RECONFIGURE;
--GO
EXECUTE sys.xp_cmdshell @command;

Once you're satisfied with how that works, now what? I'd likely set up a step 2 which is the actual running of the SSIS package (instead of printing a message). What about the condition that a file wasn't found? I'd likely use throw/raiserrror or just old fashioned divide by zero to force the first job step to fail. And then specify a reasonable number of @retry_attempts and @retry_interval.

Tuesday, October 9, 2018

Biml Excel Data Source without Excel

Biml Excel Meta Data Source without Excel

In the previous post, Reading Excel files without Excel, I showed some simple code to consume Excel without having Excel installed on your machine. How/Why would I use this - well look at the sample spreadsheet. That could be used quite nicely by a business analyst to generate SSIS packages. In fact, it is being used by a very savvy business analyst at one of my clients' shadow IT groups to identify the source data they'd like brought into their data mart. They are translating their mainframe data extracts into SQL equivalents and specifying where the data should land.

This is exciting for me as this team gets their data and knows the business problems they need to solve &emdash; they just didn't have all the tools to do so. They are supplying the data domain expertise and we are generating consistent packages that adhere to corporate standards (as well as defining the scheduling, alerting, etc). It's a good match.

My resources are quite simple: Excel Spreadsheet containing meta data, a driver program and a package template.

The template is your standard truncate and reload pattern with the target table being specified by a parameter. The client validates data by running processes in parallel so the existing mainframe process delivers data to the Billing table while ours delivers to a Billing_NEW table. Once they accept the new process, the target table becomes Billing and the NEW table is dropped. I decided the most native SSIS route would be use specify the target table in as a parameter. We originally have a boolean parameter indicating whether we were loading the new table or the production one but that was more logic and overhead that just specifying which table to load. I force their queries to be dirty reads as some of these queries can be rather messy.

<#@ template designerbimlpath="/Biml/Packages" #>
<#@ property name="schemaName" type="string" #>
<#@ property name="tableName" type="string" #>
<#@ property name="parameterName" type="string" #>
<#@ property name="sourceQuery" type="string" #>
<#@ property name="sourceConnectionName" type="string" #>
<#@ property name="targetConnectionName" type="string" #>
<#@ property name="businessFriendlyName" type="string" #>
<#
string packageName = string.Format("{0}_Load_{1}{2}", targetConnectionName.ToUpper(), businessFriendlyName, "");
CustomOutput.PackageName = packageName;
#>
        <Package Name="&lt;#= packageName #>" ConstraintMode="Linear">
            <Parameters>
                <Parameter Name="TargetTableName" DataType="String"><#= tableName #></Parameter>
            </Parameters>
            <Variables>
                <Variable Name="SchemaName" DataType="String"><#= schemaName#></Variable>
                <Variable Name="TableName" DataType="String" EvaluateAsExpression="true"><#= parameterName #></Variable>
                <Variable Name="QualifiedTableName" DataType="String" EvaluateAsExpression="true">&quot;[&quot; +   @[User::SchemaName] + &quot;].[&quot; + @[User::TableName]+ &quot;]&quot;</Variable>
                <Variable Name="QueryTruncate" DataType="String" EvaluateAsExpression="true">"TRUNCATE TABLE " + @[User::QualifiedTableName] + ";"</Variable>
            </Variables>
            <Tasks>
                  <ExecuteSQL Name="SQL Truncate Target" ConnectionName="&lt;#= targetConnectionName #>">
                    <VariableInput VariableName="User.QueryTruncate" />
                </ExecuteSQL>
                <Dataflow Name="DFT Load &lt;#= businessFriendlyName #>">
                    <Transformations>
                        <OleDbSource ConnectionName="&lt;#= sourceConnectionName #>" Name="OLESRC Query ">
                            <DirectInput><![CDATA[SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
<#= sourceQuery#>
]]>
                            </DirectInput>
                        </OleDbSource>
                        <OleDbDestination Name="OLEDST &lt;#= schemaName #>_<#= tableName#>" ConnectionName="<#= targetConnectionName #>">
                            <TableFromVariableOutput VariableName="User.QualifiedTableName" />
                        </OleDbDestination>
                    </Transformations>
                </Dataflow>
            </Tasks>
        </Package>

My ProjectDriver.biml file is fairly straight forward. In line 1 I provide a relative path to my EPPlus.dll The ..\ indicates I would find the assembly one folder up - two folders actually since I have a single copy in my base Visual Studio folder. Line 2 specifies we need to bring in OfficeOpenXml library. In Line 5 I create a variable that will hold the metadata for my solution. Line 6 is kind of interesting. I let the template determine what the package name should be based on the supplied meta data. Rather than having to perform that logic twice, it'd be nice to keep track of what packages have been created. Not only nice, it'll be required since we're using the Project Deployment Model! Line 19 is where we actually stamp out a specific package and look at that second parameter out customOutput That is the mechanism for our template to send information back to the caller. In our case, we'll add the package name to our ever growing list of packages. In line 28, we then run back through our list of packages and build out the project's definition. And that's about it. We've already talked about the GetExcelDriverData method. The GetDriverData method provides a simple abstraction between where I actually get metadata and how the packages are built. You can see a commented out reference to a GetStaticDriverData method which I used during development to test boundary conditions. Who knows, maybe I will pull from Azure Tables next...

<#@ assembly name= "..\..\EPPlus.dll" #>
<#@ import namespace="OfficeOpenXml" #>
<Biml xmlns="http://schemas.varigence.com/biml.xsd">
<#
    Dictionary<string, List&lt;string>> dasData = new Dictionary<string, List&lt;string>>();
    List<string> packageList = new List<string>();
    string templateName = "inc_TruncAndReloadPackageParameter.biml";
    string projectName = "SHADOW_IT_DataSnapshot";

    // Get our meta data    
    dasData = GetDriverData();
#>
    <Packages>
<#

    dynamic customOutput;
    foreach(var key in dasData.Keys)
    {
        WriteLine(CallBimlScriptWithOutput(templateName, out customOutput, dasData[key][0], dasData[key][1], dasData[key][2], dasData[key][3], dasData[key][4], dasData[key][5], dasData[key][6]));
        packageList.Add(customOutput.PackageName);
    }
#>
    </Packages>
    <Projects>
        <PackageProject Name="&lt;#= projectName #>">
            <Packages>
<#
        foreach(var key in packageList)
        {
#>
            <Package PackageName="&lt;#= key #>" />
<#
        }
#>            
            </Packages>
            <Connections>
                <Connection ConnectionName="WWI_DB" />
                <Connection ConnectionName="WWI_DW" />
            </Connections>
        </PackageProject>
    </Projects>
    <Connections>
        <OleDbConnection Name="WWI_DB" ConnectionString="Data Source=.\DEV2017;Initial Catalog=WWI_DW;Provider=SQLNCLI11.1;Integrated Security=SSPI;Application Intent=READONLY;ConnectionTimeout=0;" CreateInProject="true" />
        <OleDbConnection Name="WWI_DW" ConnectionString="Data Source=.\DEV2017;Initial Catalog=WWI_DB;Provider=SQLNCLI11.1;Integrated Security=SSPI;Application Intent=READONLY;" CreateInProject="true" />
    </Connections>
</Biml>

<#+

    /// Get data from Excel worksheet
    public Dictionary<string, List<string>> GetExcelDriverData(string sourceFile)
    {
        Dictionary<string, List<string>> d = new Dictionary<string, List<string>>();
        System.IO.FileInfo fi = new System.IO.FileInfo(sourceFile);
        using (ExcelPackage ep = new ExcelPackage(fi))
        {
            ExcelWorkbook wb = ep.Workbook;
            ExcelWorksheet ws = wb.Worksheets.First();
            if (ws != null)
            {
                // 1 based array to 7, inclusive
                for (int i = ws.Dimension.Start.Row+1; i < ws.Dimension.End.Row+1; i++)
                {
                    List<string> row = new List<string>() { ws.Cells[i, 1].Value.ToString()
                    ,   ws.Cells[i, 2].Value.ToString()
                    ,   ws.Cells[i, 3].Value.ToString()
                    ,   ws.Cells[i, 4].Value.ToString()
                    ,   ws.Cells[i, 5].Value.ToString()
                    ,   ws.Cells[i, 6].Value.ToString()
                    ,   ws.Cells[i, 7].Value.ToString()
                    };
                    
                    d[ws.Cells[i, 7].Value.ToString()] = row;
                }
            }
        }
        
        return d;
    }

    public Dictionary<string, List<string>> GetDriverData()
    {
        string sourceFile= @"C:\Users\billinkc\Documents\ShadowIt_DataSnap.xlsx";
        return GetExcelDriverData(sourceFile);
        //return GetStaticDriverData();
    }
#>

And that's how we can use EPPlus to consume metadata stored in Excel to generate many packages with Biml. Let me know if this helps or if you have questions about how to get this running. It's good stuff, I can't get enough of it.

Monday, October 8, 2018

Reading Excel files without Excel

Reading Excel files without Excel

A common problem working with Excel data is Excel itself. Working with it programatically requires an installation of Office, and the resulting license cost, and once everything is set, you're still working with COM objects which present its own set of challenges. If only there was a better way.

Enter, the better way - EPPlus. This is an open source library that wraps the OpenXml library which allows you to simply reference a DLL. No more installation hassles, no more licensing (LGPL) expense, just a simple reference you can package with your solutions.

Let's look at an example. Here's a simple spreadsheet with a header row and a row's worth of data.

For each row, after the header, I'll read the 7 columns into a list and then, since I assume the last column, BusinessFriendlyName, is unique, I'll use that as the key for my return dictionary.

using OfficeOpenXml;
...
    /// Get data from Excel worksheet
    public Dictionary<string, List<string>> GetExcelDriverData(string sourceFile)
    {
        Dictionary<string, List<string>> d = new Dictionary<string, List<string>>();
        System.IO.FileInfo fi = new System.IO.FileInfo(sourceFile);
        using (ExcelPackage ep = new ExcelPackage(fi))
        {
            ExcelWorkbook wb = ep.Workbook;
            ExcelWorksheet ws = wb.Worksheets.First();
            if (ws != null)
            {
                // 1 based array to 7, inclusive
                for (int i = ws.Dimension.Start.Row+1; i < ws.Dimension.End.Row+1; i++)
                {
                    List<string> row = new List<string>() { ws.Cells[i, 1].Value.ToString()
                    ,   ws.Cells[i, 2].Value.ToString()
                    ,   ws.Cells[i, 3].Value.ToString()
                    ,   ws.Cells[i, 4].Value.ToString()
                    ,   ws.Cells[i, 5].Value.ToString()
                    ,   ws.Cells[i, 6].Value.ToString()
                    ,   ws.Cells[i, 7].Value.ToString()
                    };
                    
                    d[ws.Cells[i, 7].Value.ToString()] = row;
                }
            }
        }
        
        return d;
    }

It's as easy as that. There are plenty of more clever implementations out there but I wanted to demonstrate a quick and easy method to read Excel from your .NET code.

Tuesday, August 14, 2018

A date dimension for SQL Server

A date dimension for SQL Server

The most common table you will find in a data warehouse will be the date dimension. There is no "right" implementation beyond what the customer needs to solve their business problem. I'm posting a date dimension for SQL Server that I generally find useful as a starting point in the hopes that I quit losing it. Perhaps you'll find it useful or can use the approach to build one more tailored to your environment.

As the comments indicate, this will create: a DW schema, a table named DimDate and then populate the date dimension from 1900-01-01 to 2079-06-06 endpoints inclusive. I also patch in 9999-12-31 as a well known "unknown" date value. Sure, it's odd to have an incomplete year - this is your opportunity to tune the supplied code ;)

-- At the conclusion of this script, there will be
-- A schema named DW
-- A table named DW.DimDate
-- DW.DimDate will be populated with all the days between 1900-01-01 and 2079-06-06 (inclusive)
--   and the sentinel date of 9999-12-31

IF NOT EXISTS
(
    SELECT * FROM sys.schemas AS S WHERE S.name = 'DW'
)
BEGIN
    EXECUTE('CREATE SCHEMA DW AUTHORIZATION dbo;');
END
GO
IF NOT EXISTS
(
    SELECT * FROM sys.schemas AS S INNER JOIN sys.tables AS T ON T.schema_id = S.schema_id
    WHERE S.name = 'DW' AND T.name = 'DimDate'
)
BEGIN
    CREATE TABLE DW.DimDate
    (
        DateSK int NOT NULL
    ,   FullDate date NOT NULL
    ,   CalendarYear int NOT NULL
    ,   CalendarYearText char(4) NOT NULL
    ,   CalendarMonth int NOT NULL
    ,   CalendarMonthText varchar(12) NOT NULL
    ,   CalendarDay int NOT NULL
    ,   CalendarDayText char(2) NOT NULL
    ,   CONSTRAINT PK_DW_DimDate
            PRIMARY KEY CLUSTERED
            (
                DateSK ASC
            )
            WITH (ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, DATA_COMPRESSION = PAGE)
    ,   CONSTRAINT UQ_DW_DimDate UNIQUE (FullDate)
    );
END
GO
WITH 
    -- Define the start and the terminal value
    BOOKENDS(FirstDate, LastDate) AS (SELECT DATEFROMPARTS(1900,1,1), DATEFROMPARTS(9999,12,31))
    -- itzik ben gan rapid number generator
    -- Builds 65537 rows. Need more - follow the pattern
    --  Need fewer rows, add a top below
,    T0 AS 
(
    -- 2
    SELECT 1 AS n
    UNION ALL SELECT 1
)
,    T1 AS
(
    -- 2^2 => 4 
    SELECT 1 AS n
    FROM
        T0
        CROSS APPLY T0 AS TX
)
,    T2 AS 
(
    -- 4^4 => 16
    SELECT 1 AS n
    FROM
        T1
        CROSS APPLY T1 AS TX
)
,    T3 AS 
(
    -- 16^16 => 256
    SELECT 1 AS n
    FROM
        T2
        CROSS APPLY T2 AS TX
)
,    T4 AS
(
    -- 256^256 => 65536
    -- or approx 179 years
    SELECT 1 AS n
    FROM
        T3
        CROSS APPLY T3 AS TX
)
,    T5 AS
(
    -- 65536^65536 => basically infinity
    SELECT 1 AS n
    FROM
        T4
        CROSS APPLY T4 AS TX
)
    -- Assume we now have enough numbers for our purpose
,    NUMBERS AS
(
    -- Add a SELECT TOP (N) here if you need fewer rows
    SELECT
        CAST(ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS int) -1 AS number
    FROM
        T4
    UNION 
    -- Build End of time date
    -- Get an N value of 2958463 for
    -- 9999-12-31 assuming start date of 1900-01-01
    SELECT
        ABS(DATEDIFF(DAY, BE.LastDate, BE.FirstDate))
    FROM
        BOOKENDS AS BE
)
, DATES AS
(
SELECT
    PARTS.DateSk
,   FD.FullDate
,   PARTS.CalendarYear
,   PARTS.CalendarYearText
,   PARTS.CalendarMonth
,   PARTS.CalendarMonthText
,   PARTS.CalendarDay
,   PARTS.CalendarDayText
FROM
    NUMBERS AS N
    CROSS APPLY
    (
        SELECT
            DATEADD(DAY, N.number, BE.FirstDate) AS FullDate
        FROM
            BOOKENDS AS BE
    )FD
    CROSS APPLY
    (
        SELECT
            CAST(CONVERT(char(8), FD.FullDate, 112) AS int) AS DateSk
        ,   DATEPART(YEAR, FD.FullDate) AS [CalendarYear] 
        ,   DATENAME(YEAR, FD.FullDate) AS [CalendarYearText]
        ,   DATEPART(MONTH, FD.FullDate) AS [CalendarMonth]
        ,   DATENAME(MONTH, FD.FullDate) AS [CalendarMonthText]
        ,   DATEPART(DAY, FD.FullDate)  AS [CalendarDay]
        ,   DATENAME(DAY, FD.FullDate) AS [CalendarDayText]

    )PARTS
)
INSERT INTO
    DW.DimDate
(
    DateSK
,   FullDate
,   CalendarYear
,   CalendarYearText
,   CalendarMonth
,   CalendarMonthText
,   CalendarDay
,   CalendarDayText
)
SELECT
    D.DateSk
,   D.FullDate
,   D.CalendarYear
,   D.CalendarYearText
,   D.CalendarMonth
,   D.CalendarMonthText
,   D.CalendarDay
,   D.CalendarDayText
FROM
    DATES AS D
WHERE NOT EXISTS
(
    SELECT * FROM DW.DimDate AS DD
    WHERE DD.DateSK = D.DateSk
);

Thursday, April 5, 2018

Sort SQL Server tables into similarly sized buckets

Sort SQL Server Tables into similarly sized buckets

You need to do something to all of the tables in SQL Server. That something can be anything: reindex/reorg, export the data, perform some other maintenance---it really doesn't matter. What does matter is that you'd like to get it done sooner rather than later. If time is no consideration, then you'd likely just do one table at a time until you've done them all. Sometimes, a maximum degree of parallelization of one is less than ideal. You're paying for more than one processor core, you might as well use it. The devil in splitting a workload out can be ensuring the tasks are well balanced. When I'm staging data in SSIS, I often use a row count as an approximation for a time cost. It's not perfect - a million row table 430 columns wide might actually take longer than the 250 million row key-value table.

A sincere tip of the hat to Daniel Hutmacher (b|t)for his answer on this StackExchange post. He has some great logic for sorting tables into approximately equally sized bins and it performs reasonably well.

SET NOCOUNT ON;
DECLARE
    @bucketCount tinyint = 6;

IF OBJECT_ID('tempdb..#work') IS NOT NULL
BEGIN
    DROP TABLE #work;
END

CREATE TABLE #work (
    _row    int IDENTITY(1, 1) NOT NULL,
    [SchemaName] sysname,
    [TableName] sysname,
    [RowsCounted]  bigint NOT NULL,
    GroupNumber     int NOT NULL,
    moved   tinyint NOT NULL,
    PRIMARY KEY CLUSTERED ([RowsCounted], _row)
);

WITH cte AS (
SELECT B.RowsCounted
,   B.SchemaName
,   B.TableName
    FROM
    (
        SELECT
            s.[Name] as [SchemaName]
        ,   t.[name] as [TableName]
        ,   SUM(p.rows) as [RowsCounted]
        FROM
            sys.schemas s
            LEFT OUTER JOIN 
                sys.tables t
                ON s.schema_id = t.schema_id
            LEFT OUTER JOIN 
                sys.partitions p
                ON t.object_id = p.object_id
            LEFT OUTER JOIN  
                sys.allocation_units a
                ON p.partition_id = a.container_id
        WHERE
            p.index_id IN (0,1)
            AND p.rows IS NOT NULL
            AND a.type = 1
        GROUP BY 
            s.[Name]
        ,   t.[name]
    ) B
)

INSERT INTO #work ([RowsCounted], SchemaName, TableName, GroupNumber, moved)
SELECT [RowsCounted], SchemaName, TableName, ROW_NUMBER() OVER (ORDER BY [RowsCounted]) % @bucketCount AS GroupNumber, 0
FROM cte;


WHILE (@@ROWCOUNT!=0)
WITH cte AS
(
    SELECT
        *
    ,   SUM(RowsCounted) OVER (PARTITION BY GroupNumber) - SUM(RowsCounted) OVER (PARTITION BY (SELECT NULL)) / @bucketCount AS _GroupNumberoffset
    FROM
        #work
)
UPDATE
    w
SET
    w.GroupNumber = (CASE w._row
                 WHEN x._pos_row THEN x._neg_GroupNumber
                 ELSE x._pos_GroupNumber
             END
            )
,   w.moved = w.moved + 1
FROM
    #work AS w
    INNER JOIN
    (
        SELECT TOP 1
            pos._row AS _pos_row
        ,   pos.GroupNumber AS _pos_GroupNumber
        ,   neg._row AS _neg_row
        ,   neg.GroupNumber AS _neg_GroupNumber
        FROM
            cte AS pos
            INNER JOIN
                cte AS neg
                ON pos._GroupNumberoffset > 0
                   AND neg._GroupNumberoffset < 0
                   AND
            --- To prevent infinite recursion:
            pos.moved < @bucketCount
                   AND neg.moved < @bucketCount
        WHERE --- must improve positive side's offset:
            ABS(pos._GroupNumberoffset - pos.RowsCounted + neg.RowsCounted) <= pos._GroupNumberoffset
            AND
            --- must improve negative side's offset:
            ABS(neg._GroupNumberoffset - neg.RowsCounted + pos.RowsCounted) <= ABS(neg._GroupNumberoffset)
        --- Largest changes first:
        ORDER BY
            ABS(pos.RowsCounted - neg.RowsCounted) DESC
    ) AS x
    ON w._row IN
       (
           x._pos_row
       ,   x._neg_row
       );

Now what? Let's look at the results. Run this against AdventureWorks and AdventureWorksDW

SELECT
    W.GroupNumber
,   COUNT_BIG(1) AS TotalTables
,   SUM(W.RowsCounted) AS GroupTotalRows
FROM
    #work AS W
GROUP BY
    W.GroupNumber
ORDER BY
    W.GroupNumber;


SELECT
    W.GroupNumber
,   W.SchemaName
,   W.TableName
,   W.RowsCounted
,   COUNT_BIG(1) OVER (PARTITION BY W.GroupNumber ORDER BY (SELECT NULL)) AS TotalTables
,   SUM(W.RowsCounted) OVER (PARTITION BY W.GroupNumber ORDER BY (SELECT NULL)) AS GroupTotalRows
FROM
    #work AS W
ORDER BY
    W.GroupNumber;

For AdventureWorks (2014), I get a nice distribution across my 6 groups. 12 to 13 tables in each bucket and a total row count between 125777 and 128003. That's less than 2% variance between the high and low - I'll take it.

If you rerun for AdventureWorksDW, it's a little more interesting. Our 6 groups are again filled with 5 to 6 tables but this time, group 1 is heavily skewed by the fact that FactProductInventory accounts for 73% of all the rows in the entire database. The other 5 tables in the group are the five smallest tables in the database.

I then ran this against our data warehouse-like environment. We had a 1206 tables in there for 3283983766 rows (3.2 million billion). The query went from instantaneous to about 15 minutes but now I've got a starting point for bucketing my tables into similarly sized groups.

What do you think? How do you plan to use this? Do you have a different approach for figuring this out? I looked at R but without knowing what this activity is called, I couldn't find a function to perform the calculations.

Monday, March 12, 2018

2018 MVP Summit retrospective

2018 MVP Summit retrospective

Another year of the MVP Summit is in the bag and as always, I have months worth of learning I'm excited to do.

Thank you

I'd like to extend a hearty thank you to Microsoft and the various teams for hosting us. I can't imagine the sheer amount of hours spent in preparation, actual time not-spent-working-on-technology-X, much less the expense of caffeinating, feeding, lodging, and transporting us.

What I'm excited about

Stream Analytics

We have a high performance (60M messages per day averaging 130ms throughput) messaging system that allows us to expose mainframe data as a SQL Server database for analytics. The devil with Service Broker is that there's no built in monitoring. We have a clever dashboard built on the PowerBI reporting streaming dataset source that provides an at-a-glance health check for data processing. What we need though, is something that can drive action based on changes. The September changes in Stream Analytics look like the perfect fit. It allows us to detect not just hard limits (we've violated our 3 second SLA) but the squishier metrics like a background process just woke up and swamped us with a million rows in the past three minutes or our processing time is trending upwards and someone needs to figure out why.

SQL Graph improvements

While we are not yet using graph features, I can see opportunities for it with our client that I want to build some proof of concept models.

Cosmos DB

Alongside the Stream Analytics improvements, perhaps we need to feed the change data into Cosmos and then leverage the Change Feed support to push to analytics processing. And just generally, I need to invest some time in Apache Spark. I also learned that I don't need to discover all the patterns for lambda architecture as it's already out there with a handy URL to boot.

Cognitive Services

Ok, while picking up information about this was just to scratch a very silly itch, I was impressed how easy it was from the web interface. I have bird feeders and even though most seed will state that squirrels are not interested in it, that's a downright lie.

Don't mind me, I'm just a fuzzy bird

I want a camera pointed at my bird feeder and if a squirrel shows, I want to know about it. I used about a dozen pictures of my bird feeders with and without my nemesis to train the model and then fed back assorted photos to see how smart it was. Except for an image of a squirrel hiding in shadow, it was able to give me high confidence readings on what was featured in the photo. Here we can see that my dog is neither a bird nor a squirrel.
Not a squirrel, just a lazy dog

I'm so excited to get these bots built out. One for the Raspberry Pi to detect presence at the feeder and then an Azure based recognizer for friend versus foe. Once that's done, the next phase will be to identify specific bird species. And then tie it to feed type and feeder style (tray/platform versus house versus tube) and time of day and ... yes, lot of fun permutations that are easily available without having to learn all the computer vision and statistics. Feel free to give it a whirl at https://customvision.ai

SQLOps studio

This is the new cross platform SQL Server Management Studio replacement - sort of. It's not designed to do everything SSMS does but instead the vision is to solve the most needed problems and with the open source model, the community can patch in their own solutions. I'm excited to put together a better reporting interface for the SSISDB. Something that you can actually copy text out of - how crazy is that?

Azure Data Lake Analytics

It had been a year since I had worked through some of the ADLA/USQL so it was good to get back into the language and environment. I need to get on a project that is actually using the technology though to really cement my knowledge.

What I learned

In October of 2016, I launched Sterling Data Consulting as my company. I sub under a good friend and it's been an adventure running a business but I don't feel like I'm really running a business since I have no other business. One of my TODOs at the conference was to talk to other small shop owners to see if I could discover their "secret sauce." While I got assorted feedback, the two I want to send a special thank you to are John Sterrett of Procure SQL and Tim Radney. Their advice ranged from straight forward "I don't know what you do", "are you for hire" to thoughts on lead acquisition and my lack of vision for sales.

Tim was also my roommate and it was great just getting to know him. We traded Boy Scout leader stories and he had excellent ideas for High Adventure fundraisers since that's something our troop is looking to do next year. For being a year younger than me, he sure had a lot more wisdom on the things I don't do or don't do well. You should check him at at the Atlanta SQL Saturday and attend his precon on Common SQL Server mistakes and how to avoid them.

Photos

Bellevue is less scenic than Seattle but the sunshine and warmth on Tuesday made for some nice photos of the treehouses. Yes, the Microsoft Campus has adult sized treehouses in it. How cool is that?

Friday, March 2, 2018

Python pandas repeating character tester

Python pandas repeating character tester

At one of our clients, we are data profiling. They have a mainframe, it's been running for so long, they no longer have SMEs for their data. We've been able to leverage Service Broker to provide a real-time, under 3 seconds, remote file store for their data. It's pretty cool but now they are trying to do something with the data so we need to understand what the data looks like. We're using a mix of TSQL and python to understand nullability, value variances, etc. One of the "interesting" things we've discovered is that they loved placeholder values. Everyone knows a date of 88888888 is a placeholder for the actual date which they'll get two steps later in the workflow. Except sometimes we use 99999999 because the eights are the placeholder for the time.

Initially, we were just searching for one sentinel value, then two values until we saw the bigger pattern of "repeated values probably mean something." For us, this matters because we then need to discard those rows for data type suitability. 88888888 isn't a valid date so our logic might determine that column is best served by a numeric data type. Unless we exclude the eights value in which we get a 100% match rate on the column's ability to be converted to a date.

How can we determine if a string is nothing but repeated values in python? There's a very clever test from StackOverflow

source == source[0] * len(source) I would read that as "is the source variable exactly equal to the the first character of source repeated for the length of source?"

And that was good, until we hit a NULL (None in python-speak). We then took advantage of the ternary equivalent in python to make it

(source == source[0] * len(source)) if source else False

Enter Pandas (series)

Truth is a funny thing in an Pandas Series. Really, it is. The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().. We were trying to apply the above function as we were doing everything else

df.MyColumn.str.len()
# this will fail magnificantly
(df.MyColumn == df.MyColumn[0] * len(df.MyColumn)) if df.MyColumn else False

It took me a while since I hadn't really used the pandas library beyond running what my coworker had done. What I needed to do, was get a row context to apply the calculations for true/false. As it stands, the Series stuff wants to try and aggregate the booleans or something like that. And it makes sense from a SQL perspective, you can't really apply aggregates to bit fields (beyond COUNT).

So, what's the solution? As always, you're likely to say the exact thing you're looking for. In this case, apply was the keyword.

df.MyColumn.apply(lambda source: (source == source[0] * len(source)) if source else False)

Full code you can play with would be

import pandas
import pprint

def isRepeated(src):
    return (src == src[0] * len(src)) if src else False
    
df = pandas.DataFrame({"MyCol":pandas.Series(['AB', 'BC', 'BB', None])})

pprint.pprint(df)

print()
# What rows have the same character in all of them?

pprint.pprint(df.MyCol.apply(lambda source:(source == source[0] * len(source)) if source else False))
#If you'd like to avoid the anonymous function...
pprint.pprint(df.MyCol.apply(isRepeated))

In short, python is glorious and I'm happy to writing in it again ;)