Wednesday, March 12, 2008

SQL Server Profiler – a lesson learned the hard way

Background: I was recently at the site of a client who was concerned that the CPU utilization had shot up in the last few months without any apparent reason. Since this was a node on an Active/Active cluster and was supposed to be the failover node for their more important node they had moved quickly and upgraded from a 64-bit dual processor to a 64-bit quad. The CPU utilization had dropped from over 70% to somewhere in the 30-35% range however the concern was still there – they wanted to find out what caused the trouble in the first place.

Trouble starts: After spending a bit of time understanding the environment, looking at the trends and documenting the current status I decided to start a trace on the effected node and save a few minutes of data on a SQL Server table. I was feeling pretty good about it until I clicked on “start” and the trace fired away. I looked at the CPU utilization and I noticed, to my utter surprise that it dropped significantly and it was staying down – I did not like that and stopped the trace immediately (it must have been running for about 65 seconds).

There, now I had another puzzle on my hand – was that a simple coincidence that as soon as I started the trace it so happened that the “demand” on the server dropped, maybe a great number of users just logged off and went to eat lunch or was it something else?! Soon after a very concerned user came buy and was telling the resident IT person in charge that this third party application had gone haywire – it had started assigning case numbers from the beginning of time creating duplicate entries in the system etc…

Now I am not going to go into how totally unacceptable it is for a professional software application to behave that way… regardless, it was my trace that somehow caused this trouble that ended up taking hours of manual labor to clean up.

So, now I am stuck, I am looking at all kinds of reports but without current trace data I can’t see what’s going on and after what happened I don’t dare start another trace.

Solution: After I ran out of alternative options I decided to give it another try – this time I was not going to run a trace from SQL Server Profiler on my remote client machine but instead a server side trace. Furthermore, instead of storing the trace data on SQL Server I directed it to file on another drive. I got the script ready and keeping my fingers crossed executed it while watching the CPU utilization – this time I was pleasantly surprised to see that nothing seemed to change, the trace was not affecting the CPU utilization at all – I also had someone monitoring that third party application to make sure that it wasn’t going haywire again – everything seemed good.

I let the trace run for a whole hour. I downloaded the trace files (a couple of gigs of data) and uploaded all the data in a SQL Server database on my local machine. Now things were a lot easier – it did not take more than a few minutes to identify the query that was causing all the trouble - a trigger from that third party had failed and a couple of tables had ballooned to hundreds of thousands of rows when they were supposed to have no more than something like 50 rows! Trigger was fixed and CPU utilization went down to 10% and it is staying at that level.

Lesson: Overall it was a great success but not without some bruising on the way. The lesson learned – never run a client side trace on a sensitive production system it can bring the server to its knees.

Related Info: In case you don’t have much experience and are wondering how to run a server side trace here is a quick guide:

  • Start SQL Server Profiler on your local machine;
  • Click on “New Trace” and connect to a non-sensitive server – if you have an instance of SQL Server on your local machine all the better – just connect to that one;
  • Define your trace;
  • Start the trace;
  • Stop the trace;
  • Go to File / Export / Script Trace Definition / For SQL Server [2005 / 2000];
  • Open the just saved script and change the path/file where you want your real trace data to go. Also, make sure all the other parameters are what you want them to be.
  • When you are ready to start the trace on the “real” server simply execute the script against the target server.
  • Note down the trace id so that you can stop and delete it when you are done.

Lastly, check out our products a lot of them are free - see the left panel here.

No comments: