No matter how many test cases you create, how many code reviews you conduct, things can and will go wrong. When they do, it's good to have a game plan. Solving problems can be frustrating and quite challenging. No two are alike, so following a recipe is often not enough. Problem resolution requires critical-thinking and some creativity. Having said that, I hope the steps outlined in this article can provide a good starting point.
The first step is to identify the problem. Seems like a fairly obvious first step, but you'd be surprised. It's quite easy to get caught up in the excitement and panic. It's important to understand that the Support team may not have the same understanding of a process as a developer. For instance, Support may report a problem that a file was not delivered to the expected location. This may lead to a wild goose chase where you suspect that the ftp server is down, or maybe there is a permission problem. In reality, it could be that the process failed well before the delivery and the file was not even created.
Application logs are a good place to start when trying to figure out what went wrong. Many problems are fairly obvious and can easily be corrected. For the not so obvious, the logs can still provide clues as to area of code that last executed before the problem occurred. This can be done either by looking at stack traces or by analyzing the last log message. In the case of the latter, it would only be useful if the code was properly instrumented with special care taken on how detailed the messages are and their severity level. Depending on the nature of the problem, system logs can also provide insight as to what was going on at the time of the failure.
Re-creating the Problem
It's extremely important to understand and reproduce the problem. Being able to consistently reproduce the problem at will provides the following benefits:
The logs should give some clue as to the nature of the problem. Understanding the type of problem will aid in narrowing down the approach. For instance, did the process throw an exception? Did it run out of memory? Did it hang? Did it complete, but very slowly, or maybe timed out?
These are typically the easiest problems to fix as they are accompanied by a stack trace. The application logs are the most useful tool. Common causes are:
If the logs are stopping unexpectedly, or if the process has stopped and not responding, it could be running very slowly or hung. Here are some things to consider:
If you are encountering out of memory exceptions, the first thing you need to determine is if you have a leak, or if your jvm heap space setting is too small (IT and Developers often joke that it seems as though developers are on a quest to use up all available RAM and IT's goal is to be able to run an enterprise on 8 bytes). Joking aside, IT does a good job of keeping us in check. It is easy to just try doubling the heap space. But. Is this really solving the problem, or just delaying it? Here are some steps to follow that will help track down the problem:
To be honest, this topic should be an article of its own, but I will do my best to provide some brief insight. If the problem is occurring intermittently, or if its dependent on events from another process, there is a good chance you are dealing with a "timing" issue. To confirm this is the case, switch your application to single threaded. If the problem goes away, you have a concurrency issue. Here are some things to look for:
When problems occur, stress levels rise, and panic sets in. It is important to be patient and understanding. Clear and concise communication goes a long way in making sure people feel comfortable.
Everyone makes mistakes, but we can do our best to avoid them by trying to learn from them and being proactive. Research best practices such as: