Debugging Performance Issues

Sirikata is large, complicated, and uses multiple languages. Debugging performance issues can be tricky. The following tries to describe the available tools, how to use them, and which is best for different situations.

External Tools

OProfile

If you're looking for hotspots in C++ code, this is the suggested tool. It works by having the kernel periodically trigger sampling, collecting stacktraces from Sirikata (and other applications). You run opcontrol to enable sampling, run the application, and then turn off opcontrol. A separate program, opreport, will report the collected data. See the OProfile documentation for details.

A simple wrapper script for whichever binary you're targeting can be very handy:

   #!/bin/bash
   
   sudo opcontrol --no-vmlinux --start
   sudo opcontrol --reset
   sudo opcontrol --callgraph=6
   
   ./cppoh_d "$@"
   
   sudo opcontrol --stop
   sudo opcontrol --shutdown

When you run opreport, you'll need to filter out non-Sirikata modules (because OProfile examines the entire system). Make sure you include Sirikata's libraries and plugins as well as the binary you're examining!

Google Perftools

Google Perftools is already included as we also use it for Debugging Memory Issues. To use it, set the CMake variable GOOGLE_PERFTOOLS_SETTING=CPU and recompile. When you execute the program you want to examine, set the environment variable CPUPROFILE to a location to save profiling data to, e.g.,

   > env CPUPROFILE=/tmp/cppoh.prof /home/me/sirikata/build/cmake/cppoh_d

Then run pprof to analyze the output.

See the Google Perftools page for more details, and specifically the CPU profiling instructions.

Emerson

Theoretically V8's internal profiler can be used, which should give information about both Javascript and V8 CPU usage. You can use the --v8-flags setting or change the value directly in JSObjectScriptManager.cpp. However, so far we've only been able to get information about C++.

GNU Profiler

You can build Sirikata with GNU Profiler settings. Set CMAKE_BUILD_TYPE=Profile. This is an instrumentation-based statistical profiler. It uses the same basic idea as OProfile and will report information in a similar manner. However, it's more intrusive and seems to have a higher impact on performance. Therefore, you should generally use it when you can't get the information you need from OProfile, or to double check OProfile's results.

See the GNU Profiler guide for more details.

Sirikata Tools

Course Grained Profiling

For a few parts of code where performance issues might be expected, especially if they could result in long-running event handlers, TimeProfiler's are manually added around event handlers. This is much less heavyweight than a profiler like GProf and minimally invasive, so it is useful to ensure you're getting results that aren't skewed by the profiler.

To enable profilers, add --profile=true to the command line (this should work for all binaries). The statistics will be reported at exit:

   [PROFILER] INFO: Profiler report: Object Host
   [PROFILER] INFO: Stage: Session Manager -- Avg: 2us Min: 2us Max:7us Sum: 38us  Its: 13
   [PROFILER] INFO: Stage: Handle Read Network -- Avg: 10us Min: 5us Max:1.045ms Sum: 125.492ms  Its: 12018
   [PROFILER] INFO: Stage: Handle Server Message -- Avg: 402us Min: 23us Max:4.346ms Sum: 1.8317s  Its: 4550
   [PROFILER] INFO: Stage: TimeSync -- Avg: 81us Min: 31us Max:173us Sum: 326us  Its: 4
   [PROFILER] INFO: Stage: OrphanLocUpdateManager -- Avg: 8us Min: 8us Max:9us Sum: 17us  Its: 2

This data covers the entire run, so be wary of variation or spikes.

You can add your own 'Stage' by using TimeProfiler::addStage. All Contexts have a TimeProfiler in Context::profiler. A few classes (e.g. PollingService) will automatically add a Stage if you provide a name for it in their constructor. In fact, PollingService is a good example of how to add your own TimeProfiler Stage.

Fine-grained Event Loop Profiling

Because everything runs within a single event loop (or a small number of them), you can see the results of a performance issue in seemingly unrelated areas. We provide some fine-grained profiling for the event loops to help track these issues down.

Quick start

Build with the CMake variable SIRIKATA_TRACK_EVENT_QUEUES enabled
During performance issues, break and call Sirikata::Network::IOService::reportAllStats()

Details

The core event loop classes are essentially just queues of event handlers to be executed. Strands provide serialization of related events that are unsafe to execute simultaneously. When you're seeing a performance issue that can't be attributed to the code that exhibits bad behavior, you may be seeing the queues getting backed up.

But since everything goes through these queues, it's difficult to track down the cause. It could be long-running event handlers clogging up the queue. Or it could be a lot of event handlers all dumped on the queue at once; individually they don't take long, but when handled in order they can block the queue for many seconds.

To determine what's going through the event queues and break it down by strands, posting to strands or the main event queue allows you to tag the event. For example, this code from the Ogre plugin identifies the target method being called:

   void OgreSystem::onCreateProxy(ProxyObjectPtr p)
   {
       simStrand->post(
           std::tr1::bind(&OgreSystem::iOnCreateProxy,this,
               livenessToken(),p, false),
           "OgreSystem::iOnCreateProxy"
       );
   }

Tags are const char* (not Strings!) so that we use a quick hash table lookup to record information as the event handler passes through the queue. We also track some timing information. However, these steps are only performed if the CMake variable SIRIKATA_TRACK_EVENT_QUEUES is enabled, so it costs essentially nothing if you use a normal build.

To report this information, you need to invoke the static method Sirikata::Network::IOService::reportAllStats. If you're in gdb you can do this with call

   (gdb) call Sirikata::Network::IOService::reportAllStats()

This prints a simple report over all IOServices and their child IOStrands in the system. For example, you might get something like this (a shortened, but representative sample):

   [IOSERVICE] INFO: =======================================================
   [IOSERVICE] INFO: 'Object Host' IOService Statistics
   [IOSERVICE] INFO:   Timers: 148 with 895us recent latency
   [IOSERVICE] INFO:   Event handlers: 0 with 1.384ms recent latency
   [IOSERVICE] INFO:       (IOStrands) (148)
   [IOSERVICE] INFO: -------------------------------------------------------
   [IOSERVICE] INFO:   Child 'Object Host Main'
   [IOSERVICE] INFO:     Timers: 148 with 926us recent latency
   [IOSERVICE] INFO:     Event handlers: 0 with 211us recent latency
   [IOSERVICE] INFO:       Stream<EndPointType>::sendKeepAlive (95)
   [IOSERVICE] INFO:       Stream<EndPointType>::serviceStreamNoReturn (47)
   [IOSERVICE] INFO:       Connection<EndPointType>::checkIfAlive (5)
   [IOSERVICE] INFO:       Connection<EndPointType>::serviceConnectionNoReturn (1)
   [IOSERVICE] INFO: -------------------------------------------------------
   [IOSERVICE] INFO:   Child 'SessionManager'
   [IOSERVICE] INFO:     Timers: 0 with 0us recent latency
   [IOSERVICE] INFO:     Event handlers: 0 with 0us recent latency
   [IOSERVICE] INFO: -------------------------------------------------------
   [IOSERVICE] INFO:   Child 'EmersonScript a8fd93ea-01fb-4f63-b59e-9501df5788ba'
   [IOSERVICE] INFO:     Timers: 0 with 0us recent latency
   [IOSERVICE] INFO:     Event handlers: 0 with 1.73ms recent latency

(Note that this isn't actually from a trace with a performance issue, so there are no obvious issues like large latency numbers or too many outstanding event handlers.)

The top level is an entire IOService and each large sub-item is an IOStrand. They both track latency and queue depth, for both timed and immediate post()s. Note that since IOStrands work through the parent IOService, the IOService numbers include the IOStrand numbers. The latency is only measured over a window of recent event handlers (currently 100).

Below these statistics is an list of the tags of handlers currently in the queue, ordered by count. In this example, only (IOStrands) is listed under the IOService, which means that only handlers posted through IOStrands are currently enqueued. Within the child 'Object Host Main', we can see that only some networking handlers are in the queue. In this case, Stream<EndPointType>::sendKeepAlive is high on the list, but this is actually a timer that waits a long time. Therefore, it will be in the queue, but not blocking anything, for a long time and we can safely ignore it. We could then move onto the next item and check if it's latency, combined with the number of outstanding instances of it are in the queue, is sufficient to cause a performance issue.

Generally you'll want to wait until you can directly observe the performance issues (e.g. frame rate issues or lag) or use the debugger to break when you can find them programmatically, then print this information. By looking at what's in the queue, you can probably track down the source of the delay. A good approach is to identify the strand or service which has high recent latency and then look at the handlers with the most copies in the queue.

If you see (NULL) in the output, that's because someone didn't tag that callback. You can easily find these missing tags and commit them back by removing the default tag values from the methods in IOService and IOStrand and trying to compile.

Finally, beware that these can only capture information from code which uses them. Most code should be using these classes, but anything which works directly with the boost::asio::ioservice will not be accurately represented.