Thursday, April 08, 2010

Stacktrace forencis

Many a times you would find the stacktrace ( through pstack or stacktrace files from production) supplied for finding the root cause. There's lot much information you can extract from stack-trace, here’s a summary from my experience in debugging core files.

  1. The faulty stack frame

You could pin point the faulty stack frame by looking at the backtrace file, you would see that all the stack-frames (barring one) are under wait (by executing blocking syscall e.g lwp_park or nanosleep et. al). As you know only one context would be in execution and you can identify it by looking at the last call.

Below is an extract of one such faulty stack frame

----------------- lwp# 9 / thread# 9 --------------------

fe241dd0 waitid (0, 22ac, f2372470, 3)

fe234bfc waitpid (22ac, f23725c4, 0, 0, f237261c, fe162b00) + 60

fe22828c system (f237275c, fe272034, 20000, 1, fe26a2b4, f237261c) + 2ec

fe54094c void MyApplication::generateStackTraceFile(MyModule::AppMgrId,long,const char*,bool) (f23728ac, 41c, 20fbc8, f23728ac, fe703c18, fe755050) + 42c

fe5404a4 void MyApplication::fatalSignal(int) (0, fe755050, 211f88, fe758db4, 388, 0) + 5c

fe240a14 __sighndlr (b, 0, f2372c90, fe540448, 0, 1) + c

fe235a30 call_user_handler (b, 0, 0, 100, fd3e1800, f2372c90) + 3b8

fe224220 fileno (0, 7c, ff11f728, ff11d3f4, 2881c, ff11efc4)

0004036c bool FileSender::sendPoll(unsigned short,MyAppEnum::RedEnum,int,unsigned short,unsigned long) (1, 8000, d213b, f2376ddc, 1a1400, 0) + db8

00059504 void FileSender::sendScPoll() (128430, 2affc227, 102024, 16e454, f237886c, 15f000) + f4

......
......

fe54d134 void*BaseThread::execute(void*) (604a08, 3b8, fe755050, 0, 108, 0) + 68

fe2408e8 _lwp_start (0, 0, 0, 0, 0, 0)

----------------- lwp# 10 / thread# 10 --------------------

  1. Identifying the last user-space function call

You could see from the stack trace that the last user space API looks like MyApplication::generateStackTraceFile, which is correct, but this is not the probable faulty function responsible for the crash, as this method is generating the trace-back file, this means that the ‘fatal signal’ was raised somewhere down the stack. So, we look down at the stack and see that ‘sendPoll’ is the previous user-space function call. ‘fineno’, call_uesr_handler, __sighndlr are either kernel space or libc based api’s, so we wouldn’t bother about them.

  1. Reaching the last statement within the function to pin-point the probable cause of crash.

This is a little trick part. One straight forward way is to attach a debugger to the core, and feel lucky that the core is ‘not corrupt' and you see the line.no. displayed using ‘where’ command, many a times the core is not available and all you have is a stacktrace file from production. If such a case the following information would be useful.

0004036c bool FileSender::sendPoll(unsigned short,MyAppEnum::RedEnum,int,unsigned short,unsigned long) (1, 8000, d213b, f2376ddc, 1a1400, 0) + db8

We pick the function call FileSender::sendPoll and see that there’s an offset db8 mentioned in the stack-trace, this means that the last instruction that caused this process to crash was at an offset of db8 bytes from the starting of the function FileSender::sendPoll

Next steps is to reach at this offset. Unfortunately, we can’t look at the C++ code to determine the bytes offset. What we need to do is compile the corresponding source file to generate an assembly code. On solaris we use SUN’s ‘CC’ compiler, and the corresponding compiler option to generate the assembly code is -S

  1. Generating assembly code file for a given source file

Use your compiler's 'generate assembly' switch to generate the code. Something like

bash-3.00$ pwd

/home/skp/cmp/obj/......./

bash-3.00$ ls -l FileSender.s

-rw-rw-r-- 1 skp staff 6640175 Mar 10 12:56 FileSender.s

  1. Moving to the offset with-in the function

Now that we have the assembled code, we can go the required offset. I’m listing down the assembled code. The code generated through –S option would have mangled symbols, to de-mangle them you could use c++filt as below

bash-3.00$ c++filt FileSender.s > FileSender_dem.s

! SUBROUTINE bool FileSender::sendPoll(unsigned short,MyAppEnum::RedEnum,int,unsigned short,unsigned long)

!

! OFFSET SOURCE LINE LABEL INSTRUCTION

......

......

bool FileSender::sendPoll(unsigned short,MyAppEnum::RedEnum,int,unsigned short,unsigned long):

/* 000000 2167 */ sethi %hi(0x4c00),%g1

………………….. .

…………………..

…………………..

Now keep moving down till we move to an offset of db8, meaning thereby, that we’ll have to reach an address of 0x0000 + 0x0db8 = 0x0db8

Following line gives the corresponding offset listing

/* 0x0db8 2286 */ call pclose ! params = %o0 ! Result =

Which shows that the last line caused this process to generate ‘fatal signal’ was pclose. The above line shows that the offset was 0x0db8 and corresponding line number within the C++ file is 2286

  1. Reviewing the equivalent code

2282 FILE* fd=popen(command,"w");

2283 if (fd == NULL){

2284 Logger.logMsg(MyAppTrace::ERROR,"Fail to execute ");//check for log, if you are luck you might see this log

2285 }

2286 pclose(fd); // this might result in SIGSEGV if fd is NULL

  1. Which signal might have caused the crash.

Again we revert back to the stacktrace, it shows

fe235a30 call_user_handler (b, 0, 0, 100, fd3e1800, f2372c90) + 3b8

as one of the calls. A little googling shows that this is a kernel space api that calls the ‘callback’ signal handler and its first argument is the signal number. Which is b, meaning 11 (SIGSEGV)

bash-3.00$ kill -l

1) SIGHUP 2) SIGINT 3) SIGQUIT 4) SIGILL

5) SIGTRAP 6) SIGABRT 7) SIGEMT 8) SIGFPE

9) SIGKILL 10) SIGBUS 11) SIGSEGV 12) SIGSYS


This confirms our debugging and we can provide the corresponding fix.

Hope this is helpful to many how are into debugging stufff.