ActiveDev Inc.
  • Home
  • ADCLOUD
    • Getting Started
    • Developing a Microservice
    • Reference
  • Services
  • About
  • Contact
  • Articles

Problem solving

3/5/2018

 
No matter how many test cases you create, how many code reviews you conduct, things can and will go wrong. When they do, it's good to have a game plan. Solving problems can be frustrating and quite challenging. No two are alike, so following a recipe is often not enough. Problem resolution requires critical-thinking and some creativity. Having said that, I hope the steps outlined in this article can provide a good starting point.
Identification
The first step is to identify the problem. Seems like a fairly obvious first step, but you'd be surprised. It's quite easy to get caught up in the excitement and panic. It's important to understand that the Support team may not have the same understanding of a process as a developer. For instance, Support may report a problem that a file was not delivered to the expected location. This may lead to a wild goose chase where you suspect that the ftp server is down, or maybe there is a permission problem. In reality, it could be that the process failed well before the delivery and the file was not even created.
  • Understand the flow of the process.
  • Check each step.
  • Look at log files.
  • Make sure the problem has been accurately depicted.
Check the Logs
Application logs are a good place to start when trying to figure out what went wrong. Many problems are fairly obvious and can easily be corrected. For the not so obvious, the logs can still provide clues as to area of code that last executed before the problem occurred. This can be done either by looking at stack traces or by analyzing the last log message. In the case of the latter, it would only be useful if the code was properly instrumented with special care taken on how detailed the messages are and their severity level. Depending on the nature of the problem, system logs can also provide insight as to what was going on at the time of the failure.
Re-creating the Problem
It's extremely important to understand and reproduce the problem. Being able to consistently reproduce the problem at will provides the following benefits:
  • Provides developers with the ability to run the code through a debugger.
  • If the problem is only reproducible in production, developers can add debug messages, which can provide more information as to what is happening.
  • Developers and QA can accurately test a fix, and removes any possible guess work.
Of course, recreating a problem can sometimes be easier said than done. Try to isolate where the problem is occurring. Run the system bare bones and slowly start adding components until the problem arises.
Problem Type
The logs should give some clue as to the nature of the problem. Understanding the type of problem will aid in narrowing down the approach. For instance, did the process throw an exception? Did it run out of memory? Did it hang? Did it complete, but very slowly, or maybe timed out?
Application Exception
These are typically the easiest problems to fix as they are accompanied by a stack trace. The application logs are the most useful tool. Common causes are:
  • Defective code
  • Deployment issues such as a misconfiguration.
  • Bad data.
Application Hung
If the logs are stopping unexpectedly, or if the process has stopped and not responding, it could be running very slowly or hung. Here are some things to consider:
  • Did the log file stop in the middle of a message? Make sure the system didn't run out of disk space.
  • Was the last message in the log for an external resource? It could be hung on a TCP/IP call, a deadlock on a database, exhausted system resources, thread or object pool contention, file locks, etc.
  • Create java core dumps in three second intervals. This will give you an indication of where the threads are stuck and whether or not the threads are hung or stuck in a loop.
  • Take a look at system resources to see if there is a bottleneck. If the code is stuck on an external resource, then check its utilization as well.
Out of Memory
If you are encountering out of memory exceptions, the first thing you need to determine is if you have a leak, or if your jvm heap space setting is too small (IT and Developers often joke that it seems as though developers are on a quest to use up all available RAM and IT's goal is to be able to run an enterprise on 8 bytes). Joking aside, IT does a good job of keeping us in check. It is easy to just try doubling the heap space. But. Is this really solving the problem, or just delaying it? Here are some steps to follow that will help track down the problem:
  • Generate and analyze heap dumps to determine where possible leaks are. (I will try to cover this in more detail in a future article)
  • Consider how much data you cache. Does your heap allow enough space to accommodate it
  • How much stack space are you using. Do you have a lot of recursive function calls, nested loops, etc.
  • How is garbage collected. Is it thrashing (this one can also be in the slow response time scenario as well)
  • Are you creating and destroying a lot of objects? Is your GIT compiler keeping up?
Concurrency
To be honest, this topic should be an article of its own, but I will do my best to provide some brief insight. If the problem is occurring intermittently, or if its dependent on events from another process, there is a good chance you are dealing with a "timing" issue. To confirm this is the case, switch your application to single threaded. If the problem goes away, you have a concurrency issue. Here are some things to look for:
  • Are you sharing state between threads? If so, make sure any modifications are protected by locks or synchronized blocks
  • Look for places where you should be using volatile, Atomic, or concurrent collection classes
  • Try to have your threads work with immutable objects.
  • Clone objects like arrays where possible.
  • If you are using third party tools, make sure they are thread safe.
  • Keep in mind that not all Java classes are thread safe such as Date.
  • Isolate the problem by synchronizing large blocks of code and progressively make the block smaller until the problem reappears.
Communication
When problems occur, stress levels rise, and panic sets in. It is important to be patient and understanding. Clear and concise communication goes a long way in making sure people feel comfortable.
  • Establish regular intervals for status reports.
  • Status reports should provide a list of what has been done and what is remaining with ETAs.
  • Don't make excuses. At this point, the only thing a client wants to know is what is being done to resolve the problem and how long it will take.
  • Avoid the dreaded all-day conference calls. If a client insists on them, assign a point of contact that can keep them up to date. Diagnosing problems can be very intense and developers will require blocks of uninterrupted time.
Proactive
Everyone makes mistakes, but we can do our best to avoid them by trying to learn from them and being proactive. Research best practices such as:
  • Implement coding standards.
  • Design patterns
  • Code reviews
  • Continuous Integration with frequent and automated builds
  • Enforcing thresholds for unit test code coverage
  • Defensive programming

Comments are closed.
    Picture
    Visit our ​Code Repository for samples and demos.

    ARTICLES

    All
    Becoming A Consultant
    Development
    EAI - Part 1
    EAI - Part 2
    Microservice - Spring Boot
    Microservice - Spring Cloud
    Problem Solving
    Roadmap

Site powered by Weebly. Managed by SiteGround
  • Home
  • ADCLOUD
    • Getting Started
    • Developing a Microservice
    • Reference
  • Services
  • About
  • Contact
  • Articles