A blog about SQL Server, SSIS, C# and whatever else I happen to be dealing with in my professional life.

Find ramblings

Thursday, April 14, 2011

Null vs empty string storage cost

Mindy (@sqlgirl) tweeted this 20 million rows. varchar(100) col. All NULLs versus all Empty String values '' -- any difference in storage/perf?. The size aspect of the question sounded like a quick and easy thing to figure out.

Using the awesome fast number generator from SQL PASS 2009, I quickly filled 4 new tables with 20M rows and compared sizes. Using SQL Server 2008 R2, my results are below

CREATE DATABASE sizeCompare
GO
USE sizeCompare
GO
CREATE FUNCTION
    dbo.GenerateNumbers
(
    @n as bigint
)
RETURNS TABLE
RETURN
    WITH L0 AS
    (
        SELECT
            0 AS C
        UNION ALL
        SELECT
            0
    )
    , L1 AS
    (
        SELECT
            0 AS c
        FROM
            L0 AS A
            CROSS JOIN L0 AS B
    )
    , L2 AS
    (
        SELECT
            0 AS c
        FROM
            L1 AS A
            CROSS JOIN L1 AS B
    )
    , L3 AS
    (
        SELECT
            0 AS c
        FROM
            L2 AS A
            CROSS JOIN L2 AS B
    )
    , L4 AS
    (
        SELECT
            0 AS c
        FROM
            L3 AS A
            CROSS JOIN L3 AS B
    )
    , L5 AS
    (
        SELECT
            0 AS c
        FROM
            L4 AS A
            CROSS JOIN L4 AS B
    )
    , NUMS AS
    (
        SELECT
            ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS number
        FROM
            L5
    )
    SELECT top (@n)
        number
    FROM
        NUMS
    ORDER BY
        number
GO

CREATE TABLE DBO.Empties
(
row_id int NOT NULL PRIMARY KEY
,    dumb_column varchar(100) NULL
)


CREATE TABLE DBO.Nulls
(
row_id int NOT NULL PRIMARY KEY
,    dumb_column varchar(100) NULL
)


CREATE TABLE DBO.SparseEmpties
(
row_id int NOT NULL PRIMARY KEY
,    dumb_column varchar(100) SPARSE NULL
)

CREATE TABLE DBO.SparseNulls
(
row_id int NOT NULL PRIMARY KEY
,    dumb_column varchar(100) SPARSE NULL
)

-----------------------------------------------
-- 7:28 seconds for the next 2 queries
-----------------------------------------------
INSERT INTO
    dbo.Empties
(
    row_id
,    dumb_column
)
SELECT T.number, '' FROM dbo.GenerateNumbers(20000000) T

INSERT INTO
    dbo.Nulls
(
    row_id
,    dumb_column
)
SELECT T.number, NULL FROM dbo.GenerateNumbers(20000000) T

GO

-----------------------------------------------
-- 3:30 seconds for the next 2 queries
-----------------------------------------------

INSERT INTO
    dbo.SparseEmpties
SELECT * FROM dbo.Empties

INSERT INTO
    dbo.SparseNulls
SELECT * FROM dbo.Nulls

-----------------------------------------------
-- How big is it?
-- http://www.sqlservercentral.com/blogs/jeffrey_yao/archive/tags/DMV/default.aspx
-----------------------------------------------

select name=object_schema_name(object_id) + '.' + object_name(object_id)
, rows=sum(case when index_id < 2 then row_count else 0 end)
, reserved_kb=8*sum(reserved_page_count)
, data_kb=8*sum( case 
     when index_id<2 then in_row_data_page_count + lob_used_page_count + row_overflow_used_page_count 
     else lob_used_page_count + row_overflow_used_page_count 
    end )
, index_kb=8*(sum(used_page_count) 
    - sum( case 
           when index_id<2 then in_row_data_page_count + lob_used_page_count + row_overflow_used_page_count 
        else lob_used_page_count + row_overflow_used_page_count 
        end )
     )    
, unused_kb=8*sum(reserved_page_count-used_page_count)
from sys.dm_db_partition_stats
where object_id > 1024
group by object_id
order by 
rows desc
-- data_kb desc
-- reserved_kb desc
-- data_kb desc
-- index_kb desc
-- unsed_kb desc

namerowsreserved_kbdata_kbindex_kbunused_kb
dbo.Empties2000000025824825724096840
dbo.Nulls2000000025824825724096840
dbo.SparseEmpties20000000497288495360184880
dbo.SparseNulls2000000025824825724096840
My initial conclusion is that "Empty string vs NULL for varchar types has no impact on storage cost*. That only applies for the above example though, your mileage my vary. Use of sparse columns surprised me. I would have assumed it would have taken less space than the other two non-sparse tables.

That doesn't mean developer's get off without having to clean up what they're storing. The increased complexity in logic as well as performance impacts (index, what index?) of code rifled with NULLIF(T.dumb_column) IS NULL or COALESCE(T.dumb_column, '') <> '' is reason enough to not store empty strings in your database. That'll be a post for another day however.

References

Using Sparse Columns

No comments: