Portal postmortem using java thread dumps to diagnose application failures
Download
1 / 29

Portal Postmortem Using Java Thread Dumps to Diagnose Application Failures - PowerPoint PPT Presentation


  • 89 Views
  • Uploaded on

Portal Postmortem Using Java Thread Dumps to Diagnose Application Failures. Joseph Dane University of Hawaii. © 2005 University of Hawaii. Overview. Motivation Concepts Examples Resources. Why? When? Who?. Why: because you’ve got a flaky application, and you don’t know the cause

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Portal Postmortem Using Java Thread Dumps to Diagnose Application Failures' - odeda


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Portal postmortem using java thread dumps to diagnose application failures l.jpg

Portal PostmortemUsing Java Thread Dumps to Diagnose Application Failures

Joseph Dane

University of Hawaii

© 2005 University of Hawaii


Overview l.jpg
Overview

  • Motivation

  • Concepts

  • Examples

  • Resources


Why when who l.jpg
Why? When? Who?

  • Why: because you’ve got a flaky application, and you don’t know the cause

    • This is your only choice (except, perhaps …)

  • When: now. Or possibly later.

    • You do need a running JVM

    • If you’ve got a dead JVM, you might have a core

  • Who: java developers

    • requires some familiarity with java

    • threading, locks, etc. in particular

    • but, no guru level knowledge required


General principles l.jpg
General Principles

  • The goal is not to solve problems, but to add to the pool of knowledge about the problem

  • Some sysadmin skills required/useful

    • Finding PIDs, sending signals, munging text

  • For some problems (e.g. G.C.), this will all be a complete waste of time

    • For other (fairly rare) problems, this will immediately point out the problem, e.g. deadlocks

  • “Postmortem”, in that we can (must, in fact) wait until a problem presents


What can we learn l.jpg
What can we learn?

  • Resource contention

    • network (database, RSS feed, etc.)

    • java locks

    • CPU

  • Not so good for

    • memory leaks


Terminology l.jpg
Terminology

  • “thread” -- a single execution path through a program

  • “method” -- a java function call

  • “stack” -- the context of the currently called methods/functions

  • “frame” -- a single method call, or a single line in a stack

  • “up/down” -- “up” is toward the TOS, “down” is toward the start of the program

  • “TOS” -- “top of stack”, or the currently executing method

  • “monitor” -- a java lock

    • only one thread can hold a given lock at a time


The stack l.jpg
The Stack

public void A() {

Z();

B();

}

public void B() {

C();

}

public void C() {

}

public void Z() {}

In words: A calls Z, then B, which calls C. C is at the TOS, and is the currently executing method.

Z was called previously by A, but is not currently executing, so is no longer on the stack.

C

B

A


The thread dump l.jpg
The Thread Dump

  • A snapshot of all activity in the program at a given instant

  • All JVMs can produce them, in different ways and formats

  • A text file, in various formats

    • All include a stack trace per live thread.

    • Some include explicit lock information.

  • Generating a thread dump

    • UNIX: send the QUIT signal to the JVM

      • Linux: send to the root of the process tree

      • Mac OS: apparently some flakiness in 1.4 JVM signals

      • TIP: use -XX:+JavaMonitorsInStackTrace on Sun 1.3.1

    • Windows: CTRL-Break

      • Need to arrange for the service to interact with the desktop.

  • Redirect output to a file.

  • jstack (tiger)

    • Not available for windows? Can also show native frames.

  • Generating the dump is usually safe.


A stack trace l.jpg
A Stack Trace

"tcpConnection-8443-0" daemon prio=5 tid=0x005469d0 nid=0x19b2800 in Object.wait() [f1112000..f1112ac0]

at java.lang.Object.wait(Native Method)

- waiting on <0x650c01f0> (a java.lang.Object)

at com.caucho.server.TcpServer.accept(TcpServer.java:648)

- locked <0x650c01f0> (a java.lang.Object)

at com.caucho.server.TcpConnection.accept(TcpConnection.java:211)

at com.caucho.server.TcpConnection.run(TcpConnection.java:132)

at java.lang.Thread.run(Thread.java:552)


The thread dump pictorially l.jpg
The Thread Dump, Pictorially

Time

T1

T2

T3

T4

T5

T6


Deadlock example l.jpg
Deadlock example

Found one Java-level deadlock:

=============================

"T2":

waiting to lock monitor 0x0180ed00 (object 0x2a563978, a java.lang.String),

which is held by "T1"

"T1":

waiting to lock monitor 0x0180ecdc (object 0x2a5639a8, a java.lang.String),

which is held by "T2"

Java stack information for the threads listed above:

===================================================

"T2":

at hawaii.jasig.DL$Locker.run(DL.java:47)

- waiting to lock <0x2a563978> (a java.lang.String)

- locked <0x2a5639a8> (a java.lang.String)

at java.lang.Thread.run(Thread.java:613)

"T1":

at hawaii.jasig.DL$Locker.run(DL.java:47)

- waiting to lock <0x2a5639a8> (a java.lang.String)

- locked <0x2a563978> (a java.lang.String)

at java.lang.Thread.run(Thread.java:613)


The tool l.jpg
The tool

  • Far too many threads for human consumption.

  • Written in a good text processing language.

  • Main concept: group related threads together.

    • Simplest definition of “related” lexical identity of stack frame.

  • Keep it Simple

    • Work on the tool ahead of time, verify that it works as expected.

  • Basic process: learn, hypothesize (I.e. guess), test, repeat.


Sample tool output l.jpg
Sample tool output

1: at com.pipeline.email.engine.ExternalSessionIMAP4.close(ExternalSessionIMAP4.java:1920)

1: at java.lang.Object.clone(Native Method)

1: at org.apache.xml.utils.TreeWalker.startNode(TreeWalker.java:274)

2: at com.pipeline.email.engine.ExternalSessionIMAP4.close(ExternalSessionIMAP4.java:1914)

2: at com.iplanet.server.http.servlet.NSServletSession.writePrivate(Native Method)

3: at org.jasig.portal.UserInstance.renderState(UserInstance.java:282)

5: at com.pipeline.email.engine.ExternalSessionIMAP4.getPersonalNamespace(ExternalSessionIMAP4.java:243)

6: at java.net.PlainSocketImpl.socketAccept(Native Method)

14: at com.pipeline.email.engine.ExternalSessionIMAP4.getJMCacheSubfolders(ExternalSessionIMAP4.java:599)

21: at java.lang.Thread.sleep(Native Method)

47: at java.net.SocketInputStream.socketRead(Native Method)

262: at java.lang.Object.wait(Native Method)

total of 365 threads


Step 0 preparation l.jpg
Step 0 -- Preparation

  • Monitoring

    • Manual, e.g. “top” “ps”, etc.

    • Automatic, e.g. MRTG

    • Learn about normal behavior.

      • Number of threads? Distribution?

  • Response plan

    • Notifications. Who gets called?

    • Gather wide expertise. What might be changed? A new switch or firewall? Has registration started?

  • Analysis plan

    • Access to the source is important, probably essential

    • Organize. New directory for each incident, each dump in a separate file. Keep notes as you work.


Step 1 something s wrong l.jpg
Step 1 -- Something’s Wrong!

  • Generate the dump. Maybe a couple of dumps, but one is probably OK for now.

  • Maybe clean the dump file.

    • depends on how smart your tool is

  • Check the system’s vital signs.

    • Running out of memory?

    • CPU bound? Idle CPU often means a concurrency problem, or an unavailable resource.


Step 2 what s happening l.jpg
Step 2 -- What’s happening?

  • 10/27/2004. production portal becomes non-responsive

  • First, try looking at the tops of the stacks. See anything odd?

[EXAMPLE 1] (2004102701)

1: at com.pipeline.grouptools.lib.GTSession.initialize(GTSession.java:203)

2: at java.util.HashMap.put(HashMap.java:324)

6: at java.net.PlainSocketImpl.socketAccept(Native Method)

20: at java.lang.Thread.sleep(Native Method)

47: at java.net.SocketInputStream.socketRead(Native Method)

133: at java.lang.Object.wait(Native Method)

235: at netscape.ldap.util.ConnectionPool.getConnFromPool(ConnectionPool.java:208)

total of 444 threads


Slide18 l.jpg

[EXAMPLE 2] (ex2)

3: at java.net.PlainSocketImpl.socketAccept(Native Method)

6: at java.lang.Thread.sleep(Native Method)

16: at java.net.SocketInputStream.socketRead0(Native Method)

33: at java.lang.Object.wait(Native Method)

total of 58 threads

  • Often the tops are not that useful

  • Lots of things call Object.wait. Best to ignore this for now.

  • 16 Threads insocketRead0. Try looking down the stack

interesting traces ...

16 interesting traces

17 frames down ...

10: at edu.hawaii.portal.channels.jasig.ReallySlowChannel.

setRuntimeData(ReallySlowChannel.java:47)

total of 10 threads


Slide19 l.jpg

Thread 2

Thread 3

Thread 4

Thread 1

edu.hawaii.portal.channels.jasig.ReallySlowChannel.setRuntimeData


Slide20 l.jpg

Thread 1

Thread 2

Thread 3

Thread 4

edu.hawaii.portal.channels.jasig.ReallySlowChannel.setRuntimeData


Reality is usually more complicated l.jpg
Reality is usually more complicated …

[EXAMPLE 3] (2004120101)

6: at java.net.PlainSocketImpl.socketAccept(Native Method)

9: at com.example.foo.Bar(Bar.java:559)

9: at com.iplanet.server.http.servlet.NSServletSession.ostream_flush(Native

Method)

9: at org.jasig.portal.UserInstance.renderState(UserInstance.java:267)

15: at java.lang.Thread.sleep(Native Method)

36: at org.jasig.portal.UserInstanceManager.getUserInstance(UserInstanceMana

ger.java:76)

72: at com.example.foo.Foo.setFoo(Foo.java:596)

99: at java.net.SocketInputStream.socketRead(Native Method)

139: at com.example.foo.Foo.getFoo(Foo.java:534)

142: at com.iplanet.server.http.session.IWSHttpSession.removeAllObjects(IWSHt

tpSession.java:705)

232: at java.lang.Object.wait(Native Method)

total of 788 threads


When tos isn t enough l.jpg
When TOS isn’t enough

  • Sometimes you need to identify threads which pass through certain methods .

    • “pass through” meaning “currently executing”

  • Call these methods (and threads) “interesting”.

  • Some threads may be blocked in the interesting methods. Some may have made it through to other methods.


Interesting example l.jpg
“interesting” example

public class Example {

private Object aLock = new Object();

public void foo() {

synchronized (aLock) {

bar();

}

}

private void bar() {

// calls other methods …

}

}


Slide24 l.jpg

Let’s call removeAllObjects “interesting” and see what we get …

interesting traces ...

212 interesting traces

interesting TOS ...

70: at com.example.Foo.setUserFoo(Foo.java:596)

142: at com.iplanet.server.http.session.IWSHttpSession.removeAllObjects(IWSHttpSession.java:705)

  • 212 threads made it into or past removeAllObjects.

  • of these, 142 are blocked inside that method (why? dunno) and 70 have made it past, and are now blocked in setUserFoo

  • Time to check the source …


Slide25 l.jpg

setUserFoo():

synchronized( aLock )

{

}

getUserFoo():

synchronized( aLock )

{

}

_getUserFoo():

synchronized( aLock )

{

}


Slide26 l.jpg

1: at java.net.SocketInputStream.socketRead(Native Method)

9: at com.example.foo.Foo.getUserFoo(Foo.java:559)

72: at com.example.foo.Foo.setUserFoo(Foo.java:596)

139: at com.example.foo.Foo.getUserFrob(Foo.java:534)

total of 221 threads

Bingo! One thread in socketRead! Find that thread!


Slide27 l.jpg

nid=0xc9

suspended

[0x8dc77000..0x8dc79668]

at java.net.SocketInputStream.socketRead(Native Method)

at java.net.SocketInputStream.read(SocketInputStream.java:85)

at oracle.net.ns.Packet.receive(Unknown Source)

at oracle.net.ns.NetInputStream.getNextPacket(Unknown Source)

at oracle.net.ns.NetInputStream.read(Unknown Source)

at oracle.net.ns.NetInputStream.read(Unknown Source)

at oracle.net.ns.NetInputStream.read(Unknown Source)

at oracle.jdbc.ttc7.MAREngine.unmarshalUB1(MAREngine.java:726)

at oracle.jdbc.ttc7.MAREngine.unmarshalSB1(MAREngine.java:698)

at oracle.jdbc.ttc7.Oclose.receive(Oclose.java:105)

at oracle.jdbc.ttc7.TTC7Protocol.close(TTC7Protocol.java:533)

at oracle.jdbc.driver.OracleStatement.close(OracleStatement.java:639)

TIP: Learn to recognize the unexceptional


Resources l.jpg
Resources

  • jstack, jconsole

  • BEA analysis tool.

    • uses XSLT to produce graphs, analysis

    • http://dev.bea.com/products/wlplatform81/articles/thread_dumps.jsp

  • sample code

    • http://www2.hawaii.edu/~jdane/jasig


Finally l.jpg
Finally

  • If you haven’t considered this method (as a “real” method), you should.

  • Success rate: 20/50/30

  • don’t neglect QC


ad