Sort SQL Server Tables into similarly sized buckets
You need to do something to all of the tables in SQL Server. That something can be anything: reindex/reorg, export the data, perform some other maintenance---it really doesn't matter. What does matter is that you'd like to get it done sooner rather than later. If time is no consideration, then you'd likely just do one table at a time until you've done them all. Sometimes, a maximum degree of parallelization of one is less than ideal. You're paying for more than one processor core, you might as well use it. The devil in splitting a workload out can be ensuring the tasks are well balanced. When I'm staging data in SSIS, I often use a row count as an approximation for a time cost. It's not perfect - a million row table 430 columns wide might actually take longer than the 250 million row key-value table.
A sincere tip of the hat to Daniel Hutmacher (b|t)for his answer on this StackExchange post. He has some great logic for sorting tables into approximately equally sized bins and it performs reasonably well.
SET NOCOUNT ON; DECLARE @bucketCount tinyint = 6; IF OBJECT_ID('tempdb..#work') IS NOT NULL BEGIN DROP TABLE #work; END CREATE TABLE #work ( _row int IDENTITY(1, 1) NOT NULL, [SchemaName] sysname, [TableName] sysname, [RowsCounted] bigint NOT NULL, GroupNumber int NOT NULL, moved tinyint NOT NULL, PRIMARY KEY CLUSTERED ([RowsCounted], _row) ); WITH cte AS ( SELECT B.RowsCounted , B.SchemaName , B.TableName FROM ( SELECT s.[Name] as [SchemaName] , t.[name] as [TableName] , SUM(p.rows) as [RowsCounted] FROM sys.schemas s LEFT OUTER JOIN sys.tables t ON s.schema_id = t.schema_id LEFT OUTER JOIN sys.partitions p ON t.object_id = p.object_id LEFT OUTER JOIN sys.allocation_units a ON p.partition_id = a.container_id WHERE p.index_id IN (0,1) AND p.rows IS NOT NULL AND a.type = 1 GROUP BY s.[Name] , t.[name] ) B ) INSERT INTO #work ([RowsCounted], SchemaName, TableName, GroupNumber, moved) SELECT [RowsCounted], SchemaName, TableName, ROW_NUMBER() OVER (ORDER BY [RowsCounted]) % @bucketCount AS GroupNumber, 0 FROM cte; WHILE (@@ROWCOUNT!=0) WITH cte AS ( SELECT * , SUM(RowsCounted) OVER (PARTITION BY GroupNumber) - SUM(RowsCounted) OVER (PARTITION BY (SELECT NULL)) / @bucketCount AS _GroupNumberoffset FROM #work ) UPDATE w SET w.GroupNumber = (CASE w._row WHEN x._pos_row THEN x._neg_GroupNumber ELSE x._pos_GroupNumber END ) , w.moved = w.moved + 1 FROM #work AS w INNER JOIN ( SELECT TOP 1 pos._row AS _pos_row , pos.GroupNumber AS _pos_GroupNumber , neg._row AS _neg_row , neg.GroupNumber AS _neg_GroupNumber FROM cte AS pos INNER JOIN cte AS neg ON pos._GroupNumberoffset > 0 AND neg._GroupNumberoffset < 0 AND --- To prevent infinite recursion: pos.moved < @bucketCount AND neg.moved < @bucketCount WHERE --- must improve positive side's offset: ABS(pos._GroupNumberoffset - pos.RowsCounted + neg.RowsCounted) <= pos._GroupNumberoffset AND --- must improve negative side's offset: ABS(neg._GroupNumberoffset - neg.RowsCounted + pos.RowsCounted) <= ABS(neg._GroupNumberoffset) --- Largest changes first: ORDER BY ABS(pos.RowsCounted - neg.RowsCounted) DESC ) AS x ON w._row IN ( x._pos_row , x._neg_row );
Now what? Let's look at the results. Run this against AdventureWorks and AdventureWorksDW
SELECT W.GroupNumber , COUNT_BIG(1) AS TotalTables , SUM(W.RowsCounted) AS GroupTotalRows FROM #work AS W GROUP BY W.GroupNumber ORDER BY W.GroupNumber; SELECT W.GroupNumber , W.SchemaName , W.TableName , W.RowsCounted , COUNT_BIG(1) OVER (PARTITION BY W.GroupNumber ORDER BY (SELECT NULL)) AS TotalTables , SUM(W.RowsCounted) OVER (PARTITION BY W.GroupNumber ORDER BY (SELECT NULL)) AS GroupTotalRows FROM #work AS W ORDER BY W.GroupNumber;
For AdventureWorks (2014), I get a nice distribution across my 6 groups. 12 to 13 tables in each bucket and a total row count between 125777 and 128003. That's less than 2% variance between the high and low - I'll take it.
If you rerun for AdventureWorksDW, it's a little more interesting. Our 6 groups are again filled with 5 to 6 tables but this time, group 1 is heavily skewed by the fact that FactProductInventory accounts for 73% of all the rows in the entire database. The other 5 tables in the group are the five smallest tables in the database.
I then ran this against our data warehouse-like environment. We had a 1206 tables in there for 3283983766 rows (3.2 million billion). The query went from instantaneous to about 15 minutes but now I've got a starting point for bucketing my tables into similarly sized groups.
What do you think? How do you plan to use this? Do you have a different approach for figuring this out? I looked at R but without knowing what this activity is called, I couldn't find a function to perform the calculations.
No comments:
Post a Comment