I love going deep and trying to look into memory dumps and gigabyte sized ProcMon logs, and through pages of API traces, etc. That being said there is a case for the old fashioned classic troubleshooting methods that pretty much apply in any industry & technology. The great thing about classic troubleshooting methods is without even knowing the technology/product you can often make more progress then a random attack at mountains of data. (OK it doesn’t help I’m no Raymond Chen on WinDbg – I have no choice but to start at basics )
I say this having seen many such examples throughout my career but write this because today I saw another great example.
A large organization had Outlook crashing all over the place. It was unstable. It froze. It took 5 minutes to send an email. It was consistently reproducible problem. Microsoft was engaged in a support case, as was the Anti-Virus vendor as that was suspected as a potential cause. Now Microsoft support asked for tests to be run. These are valuable tests, and what I would run myself. However this is where they started:
1) Reproduce problem 5 times, 1 minute apart, each time running ADPlus –hang to generate 5 user mode dumps.
2) Reproduce problem, generate a FULL memory dump on the client.
3) Reproduce problem generate a ProcMon log.
All great steps in troubleshooting application hangs, I don’t doubt that. So on 3 machines these tests were run and 16 GB of logs produced to transfer to Microsoft support. Two days later (possibly overwhelmed by all the data) Microsoft support asked for even more tests running additional tracing tools. Again perfectly valid for identifying application hangs.
However through this period I managed to get involved and proud to say despite Microsoft having a week head start on me I was able to very rapidly identify root cause & resolve the issue without examining any logs or memory dumps. In fact total troubleshooting time was probably about 15 minutes of testing.
So my conversation with a technical resource at customer site is something like this…
When did this start happening?
Since they migrated from GroupWise to Outlook.
How frequently does this occur?
Every time they send an email.
What version of Outlook / OS are you using?
Office XP + Office 2007, Windows XP SP3. We think the crashes might be caused by the different office versions.
Have you tried running Outlook without add-ins?
Can you please run Outlook /safe?
OK. Just a moment…
5 minutes later…
OMG! The problem doesn’t occur anymore.
What add-ins do you have?
Please enable one at a time to rule out which add-in is causing problem. Make a table like this:
|Add-In #1||Add-In #2||Add-In #3||Issue Occurs?|
Quick look at this table and you see Add-in #1 is the culprit. OK…
What version of add-in #1 do you have?
And a quick google search later I found that this version of the add-in was not supported on Outlook 2007, and latest version was 9.2.
Contacted the vendor – the customer is entitled to free client upgrade. Upgraded 3 users to test and the problem instantly disappeared. All without touching a log file…
Moral of the story : Advanced troubleshooting techniques are great, but don’t use these techniques as a replacement for the basics. Check the basics first. What’s changed? When did it start happening? How many users affected? Happens at home/in office/etc? Version of software? Event log? Another machine? 3rd-party add-ins/etc…
EDIT: 3 weeks later Microsoft came back and confirmed what I found.