A blog about SQL Server, SSIS, C# and whatever else I happen to be dealing with in my professional life.

Find ramblings

Loading...

Thursday, April 28, 2011

Speaking at Lincoln SSUG

A week from now, I'll be speaking at the inaugural meeting of the Lincoln SQL Server User Group. There are plenty of self-deprecating comments I could make, but I'll leave that to my fine friends in the Lincoln area. Instead, I'll be sincere and thank them for the opportunity and do my best to deliver a great presentation. I was hoping to revisit my "45 new language features in 45 minutes" presentation from SQL Saturday 66 but there were requests for an SSIS talk and that's definitely an area I've been wanting to get out and talk on. My original title was lacking "SSIS Do's and Don't's: A guide to loving ETL or a handbook to hell." One of the best teachers I had at college, Dr. Bogan, said "Never underestimate the power of a catchy title." I liked the guidebook concept and thought of one of the more famous guidebooks is easily the Kama Sutra. Thus, the "Kama Sutra of SSIS: A guide to loving ETL" was born.

I'm sure everyone has their own way of doing things. Whether it's a new project, presentation or anything where I have lots of stuff floating about in my head in no discernable order, I fire up FreeMind and just start jotting things down. It doesn't really lend itself to long developed ideas like I tend to box myself into when I fire up my text editor to write. It's a great tool, free (in both senses) and works on most any operating system out there.

This is an export of where I was 15 days ago in terms of my SSIS brain dump.

This is what my mind map looks like as of today. I think that I've found a logical enough ordering for delivering the content I want to talk about. I chose to keep the branches intact and copy the salient points out into the Presentation branch (lower left corner)

I want to thank John Danley, Twitter, for the opportunity and for the all the folks that have gotten the Lincoln area's SSUG organized. I hope we can get a good speaker exchange set up in the near future. Once Brad McGehee (Blog|Twitter) gets the Springfield SSUG up and running, we could get some serious cross-pollination going.

Wednesday, April 27, 2011

Active Directory SSIS Data Source

The authoritative source for email addresses in my organization is Active Directory. Pretty much everything else comes from Payroll. When we rolled out the corporate credit card vendor, we had to send them a list of people, their unique identifier (clock number in our case) and their email address. We already had a process that pulled data from the payroll system so we just needed a way to tie in the AD information. We managed to get an existing attribute of AD filled out for all our people, employeeNumber, and that allowed us to build out the following package.

My initial approach was to use a linked server to AD from within SQL Server and my god, I'd rather write complex queries against mysql or Access than go through that again. I eventually ran into issues with certain datatypes not being able to be returned. I've since lost the link but I found a writeup that talked about fields like memberof being of type i8 which means you're humped. I ran into pain with doing any sort of a complex where clause as well. The i8 datatype will be an issue whether you use query from SSIS or SSMS. At any rate, I wanted to document what an approach could be for querying data from Active Directory.

Known issues, from least to most severe if you're going to copy/paste my solution.

  • Speed-the group lookup script is dog slow
  • MaxPageSize aka number of rows = 1000-How Active Directory Searches Work
  • Nested Groups-the script only pulls back the first degree children
With those disclaimers out of the way, the way I designed this package was to dump AD information to files that would be consumed by other processes. In the event something went wrong, I'd be able to have a known state to debug from instead of worrying about a LAN admin made a change while my process was running. The important thing I saw it was getting a list of all the people out and all the groups they belonged to as we had upcoming projects that needed clock number and email address but also needed group membership information.

Create a Connection Manager for ADO.NET . Use the ".Net Providers for OleDb\OLE DB Provider for Microsoft Directory Services" For me, it was the second, non-default, folder. For Server or file name: use your domain controller. It should look something like this.

From World of Whatever

Grab a data reader source and wire it up to your new connection manager. This query will pull back all the items in AD that are Users (sAMAccountType = 805306368)

SELECT
    distinguishedName
,   mail
,   samaccountname
,   Name
,   employeeNumber
,   objectSid
,   userAccountControl
,   givenName
,   middleName
,   sn
FROM
    'LDAP://DC=domain,DC=net'
WHERE
    sAMAccountType = 805306368
ORDER BY
    sAMAccountName ASC

I use the Raw File Destination (binary) to hold all the Active Directory information because they are faster and don't require the hassle of defining connection managers. Data coming out of AD is either integers or unicode text streams. I know my data well enough to know unicode isn't a concern. Converting to strings, I used a length of 28 for objectSid, 64 for sn, givenName, middleName, 255 for samaccountname and Name and 256 for distinguishedName and mail. It's been a few years since I wrote the package so I can't recall the why on those lengths. Your mileage may vary.

From World of Whatever

IsActive - this one bit me in the butt. For the userAccountControl a value of 512 is an active AD account, 514 is a disabled account. I have that in documentation from another team. Unfortunately, that's not quite accurate. That field is really a bitmask. A year or two after we implemented my package, we learned that inactive people were showing up in feeds which was impossible, we filter them (514s) out. I then learned that a 514 is really just a 512 (NORMAL_ACCOUNT) ORed with 2 (ACCOUNTDISABLE). An automated process had been implemented that created accounts with expired passwords or some such nonsense which had a net result of disabled accounts reading as 546 (regular became 544 - Account Enabled - Require user to change password at first logon).

The group lookup is an asynchronous script that writes a row to the output buffer for each group they are a member of.

Imports System
Imports System.Data
Imports System.Collections.Generic
Imports System.DirectoryServices
Imports System.Math
Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper
Imports Microsoft.SqlServer.Dts.Runtime.Wrapper

Public Class ScriptMain
    Inherits UserComponent

    Dim User As DirectoryEntry

    Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
        Try
            'MsgBox("LDAP://" + Row.distinguishedName)
            User = New DirectoryEntry("LDAP://" + Row.distinguishedName)
            With GroupUserOutputBuffer
                For Each group As String In User.Properties("memberOf")
                    GroupUserOutputBuffer.AddRow()
                    GroupUserOutputBuffer.sAMAccountName = Row.sAMAccountName
                    GroupUserOutputBuffer.employeeNumber = Row.employeeNumber
                    GroupUserOutputBuffer.distinguishedName = Row.distinguishedName
                    GroupUserOutputBuffer.memberOf = group
                Next
            End With
        Catch ex As Exception
            'ErrorOutputBuffer.AddRow()
            'ErrorOutputBuffer.groupdistinguishedName = Row.
        End Try
    End Sub

End Class

Below is an example of the Users and groups file if you want to visualize it

Users.raw
sAMAccountNameuserAccountControlemployeeNumberdistinguishedNameisActivemailnamefirstNamemiddleNamelastName
AAchoo512123456789CN=Angela Achoo,OU=SALES,OU=Home,DC=domain,DC=netTrueAAchoo@domain.netAngela AchooAngelaAchoo
Groups.raw
sAMAccountNameemployeeNumberdistinguishedNamememberOf
AAchoo123456789CN=Angela Achoo,OU=SALES,OU=Home,DC=domain,DC=netCN=Mobile Users,OU=Home,DC=domain,DC=net
AAchoo123456789CN=Angela Achoo,OU=SALES,OU=Home,DC=domain,DC=netCN=REPORT_VIEWER,OU=Prod,OU=ApplicationGroups,DC=domain,DC=net
AAchoo123456789CN=Angela Achoo,OU=SALES,OU=Home,DC=domain,DC=netCN=REPORT_VIEWER2,OU=Prod,OU=ApplicationGroups,DC=domain,DC=net
AAchoo123456789CN=Angela Achoo,OU=SALES,OU=Home,DC=domain,DC=netCN=REPORT_ACCESS,OU=Prod,OU=ApplicationGroups,DC=domain,DC=net
AAchoo123456789CN=Angela Achoo,OU=SALES,OU=Home,DC=domain,DC=netCN=Feed_Exclustion,OU=Prod,OU=ApplicationGroups,DC=domain,DC=net
AAchoo123456789CN=Angela Achoo,OU=SALES,OU=Home,DC=domain,DC=netCN=National_Accounts,OU=SALES,OU=Home,DC=domain,DC=net
AAchoo123456789CN=Angela Achoo,OU=SALES,OU=Home,DC=domain,DC=netCN=Interclass Exchanges,OU=Resource Mailboxes/Distribution Lists,OU=CUST,OU=Home,DC=domain,DC=net
AAchoo123456789CN=Angela Achoo,OU=SALES,OU=Home,DC=domain,DC=netCN=Remote_Users,OU=APPS_GROUPS,OU=Home,DC=domain,DC=net
AAchoo123456789CN=Angela Achoo,OU=SALES,OU=Home,DC=domain,DC=netCN=Primary SMTP Group,OU=SALES,OU=Home,DC=domain,DC=net
AAchoo123456789CN=Angela Achoo,OU=SALES,OU=Home,DC=domain,DC=netCN=Corporate Card,OU=ACCT,OU=Home,DC=domain,DC=in,DC=net

References

Monday, April 25, 2011

The SQLTeam bump

This site generally receives about 50 page views a day which I'm tickled with. People seem to read my content and it tends to rank well with Google. Imagine my surprise Tuesday morning when I look at my Stats and see this spike

Bots gone wild? Nope, I managed to get two articles linked to from the SQLTeam "Recent SQL Server Links". My Null vs empty string storage cost and Find tables without primary key received a boatload of pageviews. Much thanks to Bill Graziano for the traffic and to any new visitors, welcome and I hope you find some useful content here.

Thursday, April 14, 2011

Null vs empty string storage cost

Mindy (@sqlgirl) tweeted this 20 million rows. varchar(100) col. All NULLs versus all Empty String values '' -- any difference in storage/perf?. The size aspect of the question sounded like a quick and easy thing to figure out.

Using the awesome fast number generator from SQL PASS 2009, I quickly filled 4 new tables with 20M rows and compared sizes. Using SQL Server 2008 R2, my results are below

CREATE DATABASE sizeCompare
GO
USE sizeCompare
GO
CREATE FUNCTION
    dbo.GenerateNumbers
(
    @n as bigint
)
RETURNS TABLE
RETURN
    WITH L0 AS
    (
        SELECT
            0 AS C
        UNION ALL
        SELECT
            0
    )
    , L1 AS
    (
        SELECT
            0 AS c
        FROM
            L0 AS A
            CROSS JOIN L0 AS B
    )
    , L2 AS
    (
        SELECT
            0 AS c
        FROM
            L1 AS A
            CROSS JOIN L1 AS B
    )
    , L3 AS
    (
        SELECT
            0 AS c
        FROM
            L2 AS A
            CROSS JOIN L2 AS B
    )
    , L4 AS
    (
        SELECT
            0 AS c
        FROM
            L3 AS A
            CROSS JOIN L3 AS B
    )
    , L5 AS
    (
        SELECT
            0 AS c
        FROM
            L4 AS A
            CROSS JOIN L4 AS B
    )
    , NUMS AS
    (
        SELECT
            ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS number
        FROM
            L5
    )
    SELECT top (@n)
        number
    FROM
        NUMS
    ORDER BY
        number
GO

CREATE TABLE DBO.Empties
(
row_id int NOT NULL PRIMARY KEY
,    dumb_column varchar(100) NULL
)


CREATE TABLE DBO.Nulls
(
row_id int NOT NULL PRIMARY KEY
,    dumb_column varchar(100) NULL
)


CREATE TABLE DBO.SparseEmpties
(
row_id int NOT NULL PRIMARY KEY
,    dumb_column varchar(100) SPARSE NULL
)

CREATE TABLE DBO.SparseNulls
(
row_id int NOT NULL PRIMARY KEY
,    dumb_column varchar(100) SPARSE NULL
)

-----------------------------------------------
-- 7:28 seconds for the next 2 queries
-----------------------------------------------
INSERT INTO
    dbo.Empties
(
    row_id
,    dumb_column
)
SELECT T.number, '' FROM dbo.GenerateNumbers(20000000) T

INSERT INTO
    dbo.Nulls
(
    row_id
,    dumb_column
)
SELECT T.number, NULL FROM dbo.GenerateNumbers(20000000) T

GO

-----------------------------------------------
-- 3:30 seconds for the next 2 queries
-----------------------------------------------

INSERT INTO
    dbo.SparseEmpties
SELECT * FROM dbo.Empties

INSERT INTO
    dbo.SparseNulls
SELECT * FROM dbo.Nulls

-----------------------------------------------
-- How big is it?
-- http://www.sqlservercentral.com/blogs/jeffrey_yao/archive/tags/DMV/default.aspx
-----------------------------------------------

select name=object_schema_name(object_id) + '.' + object_name(object_id)
, rows=sum(case when index_id < 2 then row_count else 0 end)
, reserved_kb=8*sum(reserved_page_count)
, data_kb=8*sum( case 
     when index_id<2 then in_row_data_page_count + lob_used_page_count + row_overflow_used_page_count 
     else lob_used_page_count + row_overflow_used_page_count 
    end )
, index_kb=8*(sum(used_page_count) 
    - sum( case 
           when index_id<2 then in_row_data_page_count + lob_used_page_count + row_overflow_used_page_count 
        else lob_used_page_count + row_overflow_used_page_count 
        end )
     )    
, unused_kb=8*sum(reserved_page_count-used_page_count)
from sys.dm_db_partition_stats
where object_id > 1024
group by object_id
order by 
rows desc
-- data_kb desc
-- reserved_kb desc
-- data_kb desc
-- index_kb desc
-- unsed_kb desc

namerowsreserved_kbdata_kbindex_kbunused_kb
dbo.Empties2000000025824825724096840
dbo.Nulls2000000025824825724096840
dbo.SparseEmpties20000000497288495360184880
dbo.SparseNulls2000000025824825724096840
My initial conclusion is that "Empty string vs NULL for varchar types has no impact on storage cost*. That only applies for the above example though, your mileage my vary. Use of sparse columns surprised me. I would have assumed it would have taken less space than the other two non-sparse tables.

That doesn't mean developer's get off without having to clean up what they're storing. The increased complexity in logic as well as performance impacts (index, what index?) of code rifled with NULLIF(T.dumb_column) IS NULL or COALESCE(T.dumb_column, '') <> '' is reason enough to not store empty strings in your database. That'll be a post for another day however.

References

Using Sparse Columns

Wednesday, April 13, 2011

Find tables without primary key

Normally, I would expect this query to return zero rows as a table without a primary key is fairly useless from my point of view. However, we have a few databases with warts like that and I had to get a list of those tables back to the people that design said tables as they didn't know... I digress.

Query to find tables without primary key constraint

In theory, this query should work against any database that implements that ANSI standard set of INFORMATION_SCHEMA tables. In practice, I can only say this works for SQL Server 2005 through SQL Denali, Engine of the Devil, CTP1.
SELECT DISTINCT
    T.TABLE_CATALOG
,   T.TABLE_SCHEMA
,   T.TABLE_NAME
FROM
    INFORMATION_SCHEMA.TABLES T
    LEFT OUTER JOIN
        INFORMATION_SCHEMA.TABLE_CONSTRAINTS TC
        ON TC.TABLE_NAME = T.TABLE_NAME
            AND TC.TABLE_SCHEMA = T.TABLE_SCHEMA
            AND TC.CONSTRAINT_TYPE = 'PRIMARY KEY'
WHERE
    TC.TABLE_NAME IS NULL
    AND T.TABLE_TYPE = 'BASE TABLE'

Monday, April 11, 2011

Log SSIS Version number

Tony: Check it. Make sure it's the right thing.
Léon: I trust you.
Tony: One thing has nothin' to do with the other - remember that Léon.
The Professional

Where I work, security is such that I can't tell you what version of SSIS packages are actually deployed in production. I have implementation memos directing what should be out there but beyond having my DBAs run queries against msdb.dbo.sysdtspackages, I don't know what is actually out there. Furthermore, it might be handy to have a record of what version of a package at a point in time.

It's a trivial script task to add into every package that echos out the VersionBuild of the package. On my todo list, was to turn that simple task into a component so developers can simply drag a task onto the canvas and be done. This would give me the ability to look through our log history and attempt to correlate changes to specific gains and losses in performance.

Scott Stauffer (blog|twitter) recently proposed this #SSIS Tip - expose build version in a package design window by making the main business logic container's name an expression! handy?. He's thinking design-time whereas I'm thinking run-time but still, the need is there in both environments to know what's the frequency (of change) Kenneth? Until his followup tweet, I had never thought about using an expression on a task name. I mean, who cares what a task is called as long as it's not "Data Flow Task 1?" His tweet pointed out that you can use an expression to set the name. Brilliant!

The implementation is dead simple. Expression is

"Version - " +  (DT_WSTR, 10)  @[System::VersionBuild]
C#
        public void Main()
        {
            int versionBuild = -1;
            versionBuild = (Int32)Dts.Variables["System::VersionBuild"].Value;
            int informationCode = -1;
            string subComponent = string.Empty;
            string description = string.Empty;
            string helpFile = string.Empty;
            int helpContext = -1;
            bool fireAgain = true;

            subComponent = "C# Script task";
            description = string.Format("Version => {0}", versionBuild);
            
            Dts.Events.FireInformation(informationCode, subComponent, description, helpFile, helpContext, ref fireAgain);

            Dts.TaskResult = (int)ScriptResults.Success;
        }

And here's the proof

Interested in playing? LogVersion.dtsx is available on my google site

Monday, April 4, 2011

Meme Monday 2011-04-04

To quote Thomas LaRock (blog|twitter) "Welcome to the first Meme Monday! Today's meme is 'Write a SQL blog post in 11 words or less'".  I am now tagged courtesy of Gabriel Villa (blog|twitter) and Chris Shaw (blog | twitter) (#sqlsat66 represent!). I am going simple with

Script everything.