Introducing jdiscript
Charles Nutter's recent post on browsing memory with JRuby and the Java Debugger Interface reminded me of my own little project for the JDI, jdiscript ("helping you write elegant scripts for a more civilized debugger"... I'll work on that). I've spent some spare time over the last week or so dusting it off to get it from the simple handler class I've used for one-off tasks in the past to something more generally useful and enjoyable. It still has a ways to go, but I thought I would go ahead and share the basic features and show a quick script that hints at how powerful JDI scripting can be.
What you get from the JDI is a set of tools for working with VMs. A JDI program can start a new VM, attach to a running VM, be attached to by a VM, or be spawned from a VM when an exception occurs. Once you have a hook on one or more running VMs, you can then begin inspecting and controlling them. Inspection occurs via Mirrors representing the remote VM's live objects, classes, threads, and the VM itself, while control can be exerted by manipulating values via a Mirror or by requesting notification of particular events and handling them how you'd like.
The JDI is very powerful, but it can be a bit clunky for writing short, quick scripts, which is where jdiscript comes in. Launching or attaching to a VM is mostly boilerplate, so jdiscript provides utility classes that take care of that for you. While the JDI allows you to easily request event notifications, the process of matching those notifications to code that will do something useful in response is left to the programmer, so jdiscript provides an event loop and a standard way to dispatch events to handlers. There are also some nice minor features in jdiscript, like redefining JDI's EventRequest classes to be chainable. Most of this functionality is exposed through the top-level JDIScript class, which allows you to cut out a bit of code to access common operations. And of course, all of this is intended for use from a higher-level JVM language such as Groovy or JRuby.
So let's get started with an example. One of the more kickass events that we can track with the JDI is the MonitorContendedEnterEvent, which fires whenever a thread tries to enter a monitor that's already held by another thread. Uncontended locks in Java are basically free these days, but contention can still be a performance killer in multithreaded apps, so it's nice to be able to find these spots without a lot of headache so we can eliminate them as far as possible.
Here's a groovy script that prints out a stack trace for any contending threads, and prints another notice when the thread ends up entering the monitor:
package org.jdiscript
import org.jdiscript.handlers.
import org.jdiscript.util.VMSocketAttacher
import com.sun.jdi.
VirtualMachine vm = new VMSocketAttacher(12345).attach()
JDIScript j = new JDIScript(vm)
j.monitorContendedEnterRequest({
long timestamp = System.currentTimeMillis()
println "${timestamp}: Contended enter for ${it.monitor()} by ${it.thread()}"
it.thread().frames().each { println " " + it }
} as OnMonitorContendedEnter).enable()
j.monitorContendedEnteredRequest({
long timestamp = System.currentTimeMillis()
println "${timestamp}: Contended entered for ${it.monitor()} by ${it.thread()}"
} as OnMonitorContendedEntered).enable()
j.run({ println "Got StartEvent" } as OnVMStart)
println "Shutting down"
Let's fire this up against a Cassandra instance to see it in action. I picked Cassandra because it's a production-quality high-concurrency project that's still dirt easy to get up and running from source, and it already includes a python script for stress testing (you can grab a copy of the 0.6.4 release, the version used for this post, to try it for yourself).
To start, copy bin/cassandra to bin/cassandra-dbg and add '-agentlib:jdwp=transport=dt_socket,server=y,address=12345,suspend=y' to the exec command that starts the server, so that the JVM will wait for the debugging script to attach before it starts. Once it's up and waiting, fire off the groovy script. You should see the start event print out, followed by a bunch of stack traces. Most of these are just ReferenceQueue activity, and any others we don't really care about right now (it's just startup, so as long as there's no deadlocking, we're probably fine with a bit more contention).
Once Cassandra is up and stable, fire up contrib/py_stress/stress.py (you'll need to have thrift installed and the Cassandra python bindings generated). Here's what comes up for my run:
1281044297880: Contended enter for instance of sun.security.provider.Sun(id=1783) by instance of java.lang.Thread(name='ROW-MUTATION-STAGE:11', id=1784)
java.security.Provider:663 in thread instance of java.lang.Thread(name='ROW-MUTATION-STAGE:11', id=1784)
sun.security.jca.ProviderList:314 in thread instance of java.lang.Thread(name='ROW-MUTATION-STAGE:11', id=1784)
sun.security.jca.GetInstance:140 in thread instance of java.lang.Thread(name='ROW-MUTATION-STAGE:11', id=1784)
java.security.Security:659 in thread instance of java.lang.Thread(name='ROW-MUTATION-STAGE:11', id=1784)
java.security.MessageDigest:129 in thread instance of java.lang.Thread(name='ROW-MUTATION-STAGE:11', id=1784)
org.apache.cassandra.utils.FBUtilities:239 in thread instance of java.lang.Thread(name='ROW-MUTATION-STAGE:11', id=1784)
org.apache.cassandra.utils.FBUtilities:229 in thread instance of java.lang.Thread(name='ROW-MUTATION-STAGE:11', id=1784)
org.apache.cassandra.dht.RandomPartitioner:118 in thread instance of java.lang.Thread(name='ROW-MUTATION-STAGE:11', id=1784)
org.apache.cassandra.dht.RandomPartitioner:44 in thread instance of java.lang.Thread(name='ROW-MUTATION-STAGE:11', id=1784)
org.apache.cassandra.db.Memtable:124 in thread instance of java.lang.Thread(name='ROW-MUTATION-STAGE:11', id=1784)
org.apache.cassandra.db.Memtable:116 in thread instance of java.lang.Thread(name='ROW-MUTATION-STAGE:11', id=1784)
org.apache.cassandra.db.ColumnFamilyStore:434 in thread instance of java.lang.Thread(name='ROW-MUTATION-STAGE:11', id=1784)
org.apache.cassandra.db.Table:407 in thread instance of java.lang.Thread(name='ROW-MUTATION-STAGE:11', id=1784)
org.apache.cassandra.db.RowMutation:200 in thread instance of java.lang.Thread(name='ROW-MUTATION-STAGE:11', id=1784)
org.apache.cassandra.service.StorageProxy$2:138 in thread instance of java.lang.Thread(name='ROW-MUTATION-STAGE:11', id=1784)
org.apache.cassandra.utils.WrappedRunnable:30 in thread instance of java.lang.Thread(name='ROW-MUTATION-STAGE:11', id=1784)
java.util.concurrent.ThreadPoolExecutor$Worker:886 in thread instance of java.lang.Thread(name='ROW-MUTATION-STAGE:11', id=1784)
java.util.concurrent.ThreadPoolExecutor$Worker:908 in thread instance of java.lang.Thread(name='ROW-MUTATION-STAGE:11', id=1784)
java.lang.Thread:637 in thread instance of java.lang.Thread(name='ROW-MUTATION-STAGE:11', id=1784)
(Subsequent identical traces elided)
1281044297896: Contended enter for instance of sun.security.provider.Sun(id=1783) by instance of java.lang.Thread(name='ROW-MUTATION-STAGE:13', id=1785)
1281044297903: Contended enter for instance of sun.security.provider.Sun(id=1783) by instance of java.lang.Thread(name='ROW-MUTATION-STAGE:12', id=1786)
1281044297912: Contended enter for instance of sun.security.provider.Sun(id=1783) by instance of java.lang.Thread(name='ROW-MUTATION-STAGE:15', id=1787)
1281044297927: Contended enter for instance of sun.security.provider.Sun(id=1783) by instance of java.lang.Thread(name='ROW-MUTATION-STAGE:16', id=1788)
1281044297935: Contended enter for instance of sun.security.provider.Sun(id=1783) by instance of java.lang.Thread(name='ROW-MUTATION-STAGE:17', id=1789)
1281044297940: Contended enter for instance of sun.security.provider.Sun(id=1783) by instance of java.lang.Thread(name='ROW-MUTATION-STAGE:14', id=1790)
1281044297945: Contended enter for instance of sun.security.provider.Sun(id=1783) by instance of java.lang.Thread(name='ROW-MUTATION-STAGE:18', id=1791)
Pretty quickly, 8 threads back up waiting on a synchronized method deep down in the creation of a new MessageDigest which will be used to get an md5 hash of the row key. This is pretty easy to "fix" (though honestly it probably never caused anyone a problem): just preallocate some md5 MessageDigest instances and stuff them in threadlocals.
So there we go. A few lines of code, and we found an avoidable contention point without opening a profiler or even knowing anything ahead of time about the underlying codebase. Awesome.
I'll post more on troubleshooting with JDI in the future, including more examples. In the meantime, I'd love to get some feedback on jdiscript; feel free to email me (jfager at gmail), create an issue on github, or join jdiscript's new group. I've also thrown up a wiki page for collecting jdiscripts, so if you put together something useful, please share it.