A blog about SQL Server, SSIS, C# and whatever else I happen to be dealing with in my professional life.

Find ramblings

Wednesday, August 5, 2015

Tracking bad queries aka Finally an Extended Event that is useful

Dessert first a.k.a. how useful would this be?

Imagine running a query and finding out all the bad queries that have been submitted. Like this

The client restored a database but neglected to fix permissions for those first three. Look at that fourth one though! I only enabled this trace on the 3rd. Any takers on how many months that error has been swallowed by their application framework? Exactly. This is definitely something I'm adding into my utility belt. Read on for how to set up Extended Events to do just this.
We now resume your regularly scheduled blog post, already in progress.

Finally, an extended event that was useful.

I'm sure that's heresy among the pure DBAs but I'm not a DBA. I'm just a developer, among many other things. At my last client, one of the developers was hell bent on building procedures which called procedures that built code to call more procedures - dynamic code to build statements to an untenable nesting level. Which was cool in theory but unfortunately they had no way to debug it when it built bad queries. Which it did. Over and over again. Thousands per second. It could have been marketed as Errors as a Service. The real problem though was that their bad code took several hours to for them to recreate the error scenario. What we needed was a way to log the statements across all those levels of nested stored procedures and the values they sent.

Enter Kendra Little and this "magic" script. It's magic because it was the missing link for me finally "getting" XEvents. I remember asking at SQL Saturday #91 (Omaha) back in 2011 "where would I, as a developer, use extended events?" Four years later, I had a scenario where I needed it.

I had meant to blog about this in April when I first used it but then shiny squirrels and this was lost in the TODO list. Fast forward to today. At my current client, we've built a new website and the database to manage part of their business. Most of the website runs off stored procedures but part of it is SQL embedded in the web pages. As much as the developer would like to just make the account the database owner, I'd rather not give up on security. I created a custom database role and added the service account into and granted it execute permissions to the schema. All is well and good except for those pesky direct queries we couldn't turn into procedures. When those queries hit, the "database is down." At least, that's the depth of the email I get. You know what would be handy here, that same bad code tracker!

Implementation

My implementation is to create an extended event that logs to an asynchronous file (first, do no harm). Where my approach differs from Kendra's is parsing all those files is dog slow - at least for the Errors As A Service scenario. So instead, I create a heap to hold the parsed bits I care about as well as the raw XML from the extended event. That way if I determine I wanted some other attribute extracted, I'd still have the original data available. Oh, and finally, I clean up after myself which is important when you're a consultant. You might roll off an engagement and someone forgets to watch the drive and suddenly, you've filled a drive with log data.

You'll notice I create the table in tempdb. Basically, I don't care if I lose that table, it's just for my monitoring purposes and in fact, putting it in temp assures me that at some point, it'll get wiped and recreated. If I were blessed to be on a 2014+ platform, I'd look hard at whether a memory optimized table might serve the same need as long as it's declared with a durability of SCHEMA_ONLY. I don't think LOB types are supported there yet, but I suspect it's a matter of time before they are.

This idempotent script will create our table, dbo.ExtendedEventErrorCapture, if it does not exist. It will create an extended event session, what_queries_are_failing, where we capture any error above severity 10 and stream it out to the file on F:\XEventSessions. If you don't have an F drive with that folder, you need to either update the script or talk to your storage vendor as they will happily supply one. Finally, we turn on the trace/session.

USE tempdb;

--------------------------------------------------------------------------------
-- Create the table to store out the XE data
--------------------------------------------------------------------------------
IF NOT EXISTS
(
    SELECT
        *
    FROM
        sys.schemas AS S
        INNER JOIN 
            sys.tables AS T
        ON T.schema_id = S.schema_id
    WHERE
        T.name = 'ExtendedEventErrorCapture'
        AND S.name = 'dbo'
)
BEGIN
    CREATE TABLE 
        dbo.ExtendedEventErrorCapture
    (
        err_timestamp datetime2(7) NULL
    ,   err_severity bigint NULL
    ,   err_number bigint NULL
    ,   err_message nvarchar(512) NULL
    ,   database_id int NULL
    ,   sql_text nvarchar(MAX) NULL
    ,   event_data xml NULL
    ,   session_name sysname
    ,   username nvarchar(512)
    )
    -- This is only vaild for a Develper/Enterprise edition license
    WITH (DATA_COMPRESSION = PAGE);
END

--------------------------------------------------------------------------------
-- Create the extended event to keep track of bad sql queries
--------------------------------------------------------------------------------
IF NOT EXISTS
(
    SELECT
        *
    FROM
        sys.server_event_sessions AS SES
    WHERE
        SES.name = N'what_queries_are_failing'
)
BEGIN
    --Create an extended event session
    CREATE EVENT SESSION
        what_queries_are_failing
    ON SERVER
    ADD EVENT sqlserver.error_reported
    (
        ACTION (sqlserver.sql_text, sqlserver.tsql_stack, sqlserver.database_id, sqlserver.username)
        WHERE ([severity]> 10)
    )
    ADD TARGET package0.asynchronous_file_target
    (set filename = 'F:\XEventSessions\what_queries_are_failing.xel' ,
        metadatafile = 'F:\XEventSessions\what_queries_are_failing.xem',
        max_file_size = 512,
        increment = 16,
        max_rollover_files = 5)
    WITH (MAX_DISPATCH_LATENCY = 5SECONDS);

END

--------------------------------------------------------------------------------
-- Turn on the extended event
--------------------------------------------------------------------------------
IF NOT EXISTS
(
    -- When a XE is active, then there is an entry
    -- in sys.dm_xe_sessions
    SELECT
        *
    FROM
        sys.dm_xe_sessions AS DXS
        INNER JOIN 
            sys.server_event_sessions AS SES
            ON SES.name = DXS.name
    WHERE
        SES.name = N'what_queries_are_failing'
)
BEGIN
    -- Start the session
    ALTER EVENT SESSION what_queries_are_failing
    ON SERVER STATE = START;
END

--------------------------------------------------------------------------------
-- Wait for errors
-- PROFIT!
--------------------------------------------------------------------------------

That's not so bad. Sure, it looks like a lot of code but you've got 1-2 lines of code to change (page compression and where to log to).

Where's the beef?

Once you think you've captured enough data, I turn off the session and parse the files into tables. As I am not a demon robot, I am grateful to Kendra for writing the XQuery for me to parse the meaningful data out of the trace files. I am a developer, but not that kind of developer.

USE tempdb;

--------------------------------------------------------------------------------
-- Turn off our extended event
--------------------------------------------------------------------------------
IF EXISTS
(
    -- When a XE is active, then there is an entry
    -- in sys.dm_xe_sessions
    SELECT
        *
    FROM
        sys.dm_xe_sessions AS DXS
        INNER JOIN 
            sys.server_event_sessions AS SES
            ON SES.name = DXS.name
    WHERE
        SES.name = N'what_queries_are_failing'
)
BEGIN
    -- Start the session
    ALTER EVENT SESSION what_queries_are_failing
    ON SERVER STATE = STOP;
END

--------------------------------------------------------------------------------
-- Extract data from our XE
--------------------------------------------------------------------------------
;
WITH events_cte AS
(
    SELECT
        DATEADD(mi,
        DATEDIFF(mi, GETUTCDATE(), CURRENT_TIMESTAMP),
        xevents.event_data.value('(event/@timestamp)[1]', 'datetime2')) AS [err_timestamp],
        xevents.event_data.value('(event/data[@name="severity"]/value)[1]', 'bigint') AS [err_severity],
        xevents.event_data.value('(event/data[@name="error_number"]/value)[1]', 'bigint') AS [err_number],
        xevents.event_data.value('(event/data[@name="message"]/value)[1]', 'nvarchar(512)') AS [err_message],
        xevents.event_data.value('(event/action[@name="database_id"]/value)[1]', 'int') AS [database_id],
        xevents.event_data.value('(event/action[@name="sql_text"]/value)[1]', 'nvarchar(max)') AS [sql_text],
        xevents.event_data,
        xevents.event_data.value('(event/action[@name="username"]/value)[1]', 'nvarchar(512)') AS [username],
        'what_queries_are_failing' AS session_name
    FROM sys.fn_xe_file_target_read_file
    (
        'F:\XEventSessions\what_queries_are_failing*.xel'
    ,   'F:\XEventSessions\what_queries_are_failing*.xem'
    ,   NULL
    ,   NULL) AS fxe
    CROSS APPLY (SELECT CAST(event_data as XML) AS event_data) AS xevents
)
INSERT INTO
    dbo.ExtendedEventErrorCapture
(
    err_timestamp
,   err_severity
,   err_number
,   err_message
,   database_id
,   sql_text
,   event_data
,   session_name
,   username
)
SELECT
    E.err_timestamp
,   E.err_severity
,   E.err_number
,   E.err_message
,   E.database_id
,   E.sql_text
,   E.event_data
,   E.session_name
,   E.username
FROM
    events_cte AS E;

Mischief managed

Clean up, clean up, everybody clean up! As our final step, since we're good citizens of this database and we remove our extended event and once that's done, we'll leverage xp_cmdshell to delete the tracefiles.

USE tempdb;

--------------------------------------------------------------------------------
-- Get rid our extended event
--------------------------------------------------------------------------------
IF EXISTS
(
    SELECT
        *
    FROM
        sys.server_event_sessions AS SES
    WHERE
        SES.name = N'what_queries_are_failing'
)
BEGIN
    -- Clean up your session from the server
    DROP EVENT SESSION what_queries_are_failing ON SERVER;
END

--------------------------------------------------------------------------------
-- Get rid our extended event files only if the XE is turned off
-- or no longer exists
--------------------------------------------------------------------------------
IF NOT EXISTS
(
    SELECT
        1
    FROM
        sys.dm_xe_sessions AS DXS
        INNER JOIN 
            sys.server_event_sessions AS SES
            ON SES.name = DXS.name
    WHERE
        SES.name = N'what_queries_are_failing'

    UNION ALL
    SELECT
        1
    FROM
        sys.server_event_sessions AS SES
    WHERE
        SES.name = N'what_queries_are_failing'

)
BEGIN
    -- Assumes you've turned on xp_cmdshell
    EXECUTE sys.xp_cmdshell'del F:\XEventSessions\what_queries_are_failing*.xe*';
END

Now what?

Query our table and find out what's going on. I use a query like the following to help me identify what errors we're getting, whether they're transient, etc.
SELECT
    EEEC.err_message
,   MIN(EEEC.err_timestamp) AS FirstSeen
,   MAX(EEEC.err_timestamp) AS LastSeen
,   COUNT_BIG(EEEC.err_timestamp) AS ErrorCount
FROM
    tempdb.dbo.ExtendedEventErrorCapture AS EEEC
GROUP BY
    EEEC.err_message
ORDER BY
    COUNT_BIG(1) DESC;

You've already seen the results but I've very happy to have found a use case for XE in my domain. In the next post, I'll wrap this implementation up with a nice little bow. Or a SQL Agent job. Definitely one of the two.

1 comment:

Paul Timmerman said...

Love this type of XE. We actually leave an "error trap" XE very similar to this running on all of our instances, all the time. With no noticeable overhead, it has paid huge dividends in troubleshooting current and historical issues.