Dating Visually

Matt Gattis tweeted a quiz earlier tonight: 10 girls and 10 guys in a group. Sally dated 5 of the guys, Bob dated 2 of the girls. What’s the probability that Bob dated Sally? Think about it for a bit, then read on.

Kui Tang has a nice write up of the solution over on his blog, but I thought I’d bang out a quick alternate explanation for those of us who like to visualize our probabilities: imagine a 10×10 grid of cells, the x axis corresponding to the men and the y axis to the women, with each cell either on or off depending on whether the x,y pair had been on a date. Take and count up all unique grid configurations that have Sally going on 5 dates and Bob going on 2. That’s your denominator. Your numerator is then the number of these unique grids that have Sally matched with Bob. These are huge numbers, but then recognize that all possible non-Bob/non-Sally cell state configurations repeat for every unique Bob/Sally configuration, and so neatly cancel out.

The math given in Kui’s post is the same thing expressed with counting formulas, but I think picturing the problem as a set of unique grid layouts helps give a better intuitive understanding of what’s going on. It’s hard to accidentally overcount, for instance, because its clear that the visual equivalent to (10 2) * (10 5) counts the Bob+Sally cell too many times, and it erases the questions about the cases where only one other woman has dated at all or 2 other women have dated 10 guys, because it’s clear they’ve been taken into account as part of the massive number of states that cancel out when you do the tally.

Color Credulity

Color is a new photo sharing app that builds social networks based on proximity. You take a picture with the app, and it turns around and starts grouping you with and sharing photos from other people nearby who have done the same. Sounds kind of dumb, right? Why would I want to see photos from nearby strangers?

Well, Sequoia thinks there’s something there, and has put $41 million into the company before it’s really even launched (thanks to a killer pitch deck). “Not since Google” have they seen this. Given that “this” currently refers to an app that I can’t even get to work on my phone, I’m left hoping that there’s a lot more going on here.

So what could that be? I’m going to put on my magic hat of credulity now, and describe what I (yes, I, random internet wantrepreneur) would be willing to bet $41 million on in this space.

Color is being run by Bill Nguyen, who sold Onebox for $850M in 2000, Lala to Apple for over $80M in 2009, and (at least until 11:41am today) spent time at AdGent. I’m not going to say that his presence means Color will be successful, but I do take it as a pretty good sign that there’s no possible way their actual business story is “Color shows you photos taken by people in the same room and then money pours out”.

From the TechCrunch writeup:

Color is also making use of every phone sensor it can access. The application was demoed to me in the basement of Color’s office — where there was no cell signal or GPS reception. But the app still managed to work normally, automatically placing the people who were sitting around me in the same group. It does this using a variety of tricks: it uses the camera to check for lighting conditions, and even uses the phone’s microphone to ‘listen’ to the ambient surroundings. If two phones are capturing similar audio, then they’re probably close to each other.

Remember The Dark Knight, when Batman hacked into everyone’s cellphone and streamed back sonar data to build a cohesive picture of what was happening everywhere in the city? That sounds awfully similar to what’s going on here – photo, GPS, and audio streams feeding back to Color in such a way that they can build a real-time model of where all their users are, who they’re with, and what’s happening around them.

With that kind of technology, who cares what their frontend does? Based on the quality of the first release of the phone apps, they’re clearly not sweating it too much. Whatever hook they try to snag users with is just a way to get that datastream, so they should ride whatever wave is currently popular. This week it’s Instagram and Path, so, sure, do that. Next week it’s going to be something else, so next week they’ll shift their apps towards that, or if they really can’t figure out how to get traction, they’ll release an api and let others do it for them. It doesn’t matter how that data comes in, as long as it comes in.

The web is training advertisers how to most effectively work with real time data (tracking cookies, ad auctions, sentiment analysis, twitter monitoring, all of that). How many companies work on this? How much money is being spent on these efforts, and how much is being made? There’s already one $190B company in this space on the web; the startup that can bring the same sort of tools into the real world might actually have a shot at becoming another.

Facebook, Foursquare, Yelp, Gowalla, Brightkite, Loopt, and everyone else with check-in functionality are already going for this. The biggest differences with Color seem to be that they want check-ins to be implicit byproducts of actions users have other motivations for (you’re not trying to get a free soda, you’re taking a picture to, uh, show to strangers in the same room), and that they’re handling far more inputs than just location.

These differences are both potentially huge. Other services risk crossing a mental line where explicitly checking in feels like work done for compensation (which is bad, which is why Foursquare is set up as a game), whereas this is an attempt to keep motivation purely social. And using multimedia opens up the door for all kinds of data points – facial recognition to keep track of people who aren’t actively using their product, brand recognition to note logos on clothes or labels on bottles, song recognition to track what music people are actually listening to – that advertisers would pay through the nose for.

Taking the credulity hat back off, even though I really do think the potential business models could make a ton of money, I’m equally convinced that this initial attempt at getting users isn’t going to make it very far. With $41M in the bank, though, they’ve got plenty of room to fail.

Update: Bill Nguyen confirms all of this almost point-by-point in an interview with Business Insider:

Photo sharing is not our mission. We think it’s cool and we think it’s fun, but we’re a data mining company.

A Year of Cooking

It’s amazing how long you can go without basic life skills. Pretty much all the food that went into my body from college until last year came from a can, a box, or a restaurant, and either tasted bad, actively tried to kill me, or drained my bank account (or all three). The first step is acknowledging that you have a problem. The second step is learning to cook. The third step, apparently, is writing a post about what you learn over the course of a year of step two.

From learning to cook, I’ve saved hundreds of dollars, eaten much better, and picked up a new skill that I might actually be able to use in a post-apocalyptic setting (my first!). This is all basic stuff, but if you’re starting from zero like I was, it may be helpful.

- Get a decent chef’s knife. I picked up an OXO for under $20, and I love it. I’ve used other peoples’ nicer knives since getting this one, and there is a difference, but starting with a decent knife for cheap means you get to practice knife skills and maintenance without caring too much when you drop it on the floor a half-inch from your toe. Related: picking up hot things without putting sharp objects down first is not advised.

- Epicurious is awesome. Probably 80% of the recipes I tried this year came from this site. The recipes are usually well written and you can find ones from all over the difficulty spectrum (from quick-and-easy to spend-your-Saturday-hating-yourself-for-sucking-before-finally-just-ordering-a-pizza).

- When I first started, I thought salt was something that primarily goes on at the table so everyone gets as much or as little as they want. This is dead wrong, especially for meat. If you’re cooking steak, chicken, or pork, get some kosher salt (big flakes), sprinkle it on generously, and let it sit for a bit at room temperature before throwing it into the pan or oven. This locks in the moisture, and if you do it right, you shouldn’t really notice a salty flavor, it should just taste better. In college, I had a tradition of cooking myself a steak whenever I finished a big project, and I always wondered why it never tasted as good as what you get at a restaurant. It was the salt thing.

- When you’re chopping things up, make the results the same size so they cook at the same rate.

- If you’re frying, sauteing, or grilling chicken or pork, make sure you use cuts that are thin enough or that you can finish cooking in the oven. I’ve started butterflying chicken breasts before throwing them on the pan, and the difference is stark. A full breast takes too long to cook through and will either burn on the outside or dry out before it’s done, where the half breast stays moist and picks up a nice brown color while cooking in a much shorter time.

- I’ve been amazed at how many recipes want an onion. Learn how to chop one and save time and tears.

- Keep the pan hot. Every time you add something cool to the pan, you cool the pan off (stupid physics), so right when doing so, ramp up the burner and then taper it back to where it was as the food heats up to where you want it. I’ve only internalized this one in the last couple of months, after repeatedly banging my head on needing twice as much time as a recipe suggests. Keep things hot, and they cook faster. Similarly, it takes a lot less time to boil water if you put a lid on the pot. Weird how that works, right?

- Get to know by heart how long different basics take to cook, and how long things can sit when they’re done, so you only panic appropriately when everything else isn’t finished yet. Rice takes 20 minutes and can sit for a while, while steamed veggies take 10 and can’t. Chicken dries out quickly, while steak can rest for a spell. If you turn the heat down a notch, you can keep onions sauteing for a good while, but not so much for garlic, and not at all for peppers. That sort of thing.

ZestCash is not Good.

A couple of days ago, TechCrunch featured a favorable story about a new startup called ZestCash, which provides an online lending alternative to existing payday loans (I’m not going to link to them directly, you can get to them on your own easily enough). The story regurgitates ZestCash’s copy about the evils of the existing payday loan industry, including numbers highlighting just how usurious the sector is. What it fails to mention is ZestCash’s own rates, which run between 242% and 462% APR at the time that I’m writing this.

To put that into perspective, consumer advocates regularly warn against the abusive nature of the ~30% APRs charged by many credit cards. The Center for Responsible Lending, which is frequently mentioned on ZestCash’s website at the time of this writing, supports a 36% annual interest rate cap. To make that point absolutely clear: ZestCash *repeatedly* cites a consumer advocacy group in making the case that they’re a responsible lender, and then turns around and charges rates up to more than 12x those advocated for by that same group.

Beyond the ridiculously high rates, the entire site is filled with disingenuous copy that seems designed to make unsavvy consumers feel smart and responsible for using ZestCash. They claim in big letters on the front page that ZestCash is “up to 50% cheaper than a payday loan”, but you have to click two links deep to find the explanation for where that number comes from, at which point you learn that 50% is over a payday loan that has been “rolled over” 7 times. They have an entire page dedicated to trying to convince you that APR doesn’t really matter. They make a big deal about the fact that they don’t have a lot of extra fees, but the fees they do have are massive: a 30% ‘origination fee’ on every loan, and a $35 late fee per missed payment on top of whatever overdraft fees your bank charges. They make a big deal of the fact that they clearly disclose their terms, but they’re required to do so by federal law. Almost every sentence on their website makes me tremble with rage for one reason or another.

The worst part about all of this is that their marketing message seems to have worked, at least this early on. I learned about them from a tweet by an entrepreneur I admire, which said he liked how ZestCash was trying to do payday loans in a “don’t be evil” way (he seemed to back off this when their rates were pointed out). A twitter search right now shows an overwhelmingly positive reaction, and the coverage of the service from major tech and business sites has been mostly positive as well. What gives? Do people really trust the press release they get from a company that much? Do they not go to the front page of the service and click around at least a little bit? Are Douglas Merrill and Shawn Budde big enough players that nobody’s willing to criticize them? Are their investors? I was similarly confused by the positive reaction to Betterment, a service that launched a little while ago that appears to try to convince consumers that ETFs are just savings accounts with higher returns, but this really takes it to a whole new level.

(ed. note: Betterment is no longer pushing the marketing approach mentioned above as aggressively as when they launched. Thanks, KW)

Rank-o-Matic Week 6

Busy day yesterday, so I only managed to tweet it, but Rank-o-Matic Week 6 rankings are up! New this week: I’ve always found myself clicking around the rankings to check the records of a given team’s opponents. In an effort to reduce that a bit, I’ve added some at-a-glance info. Now wins against teams with winning records are denoted with a W+, and wins against teams for which that game was the only loss are shown with a W1. With that information emphasized, it’s easier to understand why LSU, who has a tendency to win in a dirt-ugly fashion, is #1 in the current rankings: they’ve beaten 4 winning teams, including handing West Virginia their only loss of the season so far. Ugly or not, on the field, they’re walking away with Ws against some of the highest-quality competition in college football. Compare this to the AP poll: a #2 ranking for an Oregon team with only one quality win against Stanford and a resume otherwise made up of teams in the bottom half of the league or not in the league at all, and an undefeated LSU behind an Alabama team that just got spanked by South Carolina. AP polls like this week’s are exactly why I put the Rank-o-Matic together. I’m sick of seeing petty regional and personal biases matter more than what happens on the field.

Rank-o-Matic Week 5

Your week 5 is now complete, as the latest Rank-o-Matic rankings are up. This week adds a zeitgeist summary and tweaks the formula to give teams a bit less credit for close wins. Enjoy!

Rank-o-Matic Week 4

Rank-o-Matic week 4 rankings are up. The big change this week is that I’ve decided that the experiment to stop special-casing non-division schools is a failure. Instead, I’m arbitrarily deciding on 0.05 for a win and -0.95 for a loss, on the idea that a 12 win team with a non-division game should have a razor-thin edge over an otherwise identical 11 win team, and that any team that loses a non-division game should have no real shot at a high ranking. I’ve applied the change retroactively, so if you browse back to previous weeks, you’ll see the reports as they would have been had I been using this scoring system all season.

In the works for the coming weeks are a Zeitgeist view showing biggest movers and conference overviews, and a more principled way of rewarding road wins and penalizing home losses. And as always, big thanks to James Howell for collecting and hosting the score data I use to build the ranking. Enjoy!

The Rank-o-Matic is Back

After a year-long hiatus, I’m happy to announce that the Rank-o-Matic is back! Deep down in your heart, you’ve always yearned to know what my laptop thinks of the current college football season, and now, once again, you can. New features this time around include full-precision summation (thanks to Raymond Hettinger) and inter-weekly comparisons showing each game’s change in value and each team’s change in rank order. I’ve also temporarily stopped special-casing games against non-Division IA schools, a tweak I’ll be monitoring as the season progresses.

Big thanks again to James Howell, who keeps an awesome historical index of college football seasons, and whose current season listing I use as the source for the Rank-o-Matic. He’s also got a ranking of his own.

Questions or comments can be sent to me at jfager -at- gmail. Enjoy!

Steven Skiena rapping on combinatorial search in the Algorithm Design Manual:

[Chess] has inspired many combinatorial problems of independent interest. The combinatorial explosion was first recognized with the legend that the inventor of chess demanded as payment one grain of rice for the first square of the board, and twice as much for the (i + 1)st square than the ith square. The king was astonished to learn he had to cough up 265 – 1 = 36,893,488,147,419,103,231 grains of rice. In beheading the inventor, the wise king first established pruning as a technique for dealing with combinatorial explosion.

Introducing jdiscript

Charles Nutter’s recent post on browsing memory with JRuby and the Java Debugger Interface reminded me of my own little project for the JDI, jdiscript (“helping you write elegant scripts for a more civilized debugger”… I’ll work on that). I’ve spent some spare time over the last week or so dusting it off to get it from the simple handler class I’ve used for one-off tasks in the past to something more generally useful and enjoyable. It still has a ways to go, but I thought I would go ahead and share the basic features and show a quick script that hints at how powerful JDI scripting can be.

What you get from the JDI is a set of tools for working with VMs. A JDI program can start a new VM, attach to a running VM, be attached to by a VM, or be spawned from a VM when an exception occurs. Once you have a hook on one or more running VMs, you can then begin inspecting and controlling them. Inspection occurs via Mirrors representing the remote VM’s live objects, classes, threads, and the VM itself, while control can be exerted by manipulating values via a Mirror or by requesting notification of particular events and handling them how you’d like.

The JDI is very powerful, but it can be a bit clunky for writing short, quick scripts, which is where jdiscript comes in. Launching or attaching to a VM is mostly boilerplate, so jdiscript provides utility classes that take care of that for you. While the JDI allows you to easily request event notifications, the process of matching those notifications to code that will do something useful in response is left to the programmer, so jdiscript provides an event loop and a standard way to dispatch events to handlers. There are also some nice minor features in jdiscript, like redefining JDI’s EventRequest classes to be chainable. Most of this functionality is exposed through the top-level JDIScript class, which allows you to cut out a bit of code to access common operations. And of course, all of this is intended for use from a higher-level JVM language such as Groovy or JRuby.

So let’s get started with an example. One of the more kickass events that we can track with the JDI is the MonitorContendedEnterEvent, which fires whenever a thread tries to enter a monitor that’s already held by another thread. Uncontended locks in Java are basically free these days, but contention can still be a performance killer in multithreaded apps, so it’s nice to be able to find these spots without a lot of headache so we can eliminate them as far as possible.

Here’s a groovy script that prints out a stack trace for any contending threads, and prints another notice when the thread ends up entering the monitor:

package org.jdiscript
import org.jdiscript.handlers.*
import org.jdiscript.util.VMSocketAttacher 
import com.sun.jdi.*
VirtualMachine vm = new VMSocketAttacher(12345).attach()
JDIScript j = new JDIScript(vm)
	long timestamp = System.currentTimeMillis()
	println "${timestamp}: Contended enter for ${it.monitor()} by ${it.thread()}"
	it.thread().frames().each { println "   " + it }
} as OnMonitorContendedEnter).enable()
	long timestamp = System.currentTimeMillis()
	println "${timestamp}: Contended entered for ${it.monitor()} by ${it.thread()}"
} as OnMonitorContendedEntered).enable(){ println "Got StartEvent" } as OnVMStart)
println "Shutting down"

Let’s fire this up against a Cassandra instance to see it in action. I picked Cassandra because it’s a production-quality high-concurrency project that’s still dirt easy to get up and running from source, and it already includes a python script for stress testing (you can grab a copy of the 0.6.4 release, the version used for this post, to try it for yourself).

To start, copy bin/cassandra to bin/cassandra-dbg and add ‘-agentlib:jdwp=transport=dt_socket,server=y,address=12345,suspend=y’ to the exec command that starts the server, so that the JVM will wait for the debugging script to attach before it starts. Once it’s up and waiting, fire off the groovy script. You should see the start event print out, followed by a bunch of stack traces. Most of these are just ReferenceQueue activity, and any others we don’t really care about right now (it’s just startup, so as long as there’s no deadlocking, we’re probably fine with a bit more contention).

Once Cassandra is up and stable, fire up contrib/py_stress/ (you’ll need to have thrift installed and the Cassandra python bindings generated). Here’s what comes up for my run:

1281044297880: Contended enter for instance of by instance of java.lang.Thread(name='ROW-MUTATION-STAGE:11', id=1784) in thread instance of java.lang.Thread(name='ROW-MUTATION-STAGE:11', id=1784) in thread instance of java.lang.Thread(name='ROW-MUTATION-STAGE:11', id=1784) in thread instance of java.lang.Thread(name='ROW-MUTATION-STAGE:11', id=1784) in thread instance of java.lang.Thread(name='ROW-MUTATION-STAGE:11', id=1784) in thread instance of java.lang.Thread(name='ROW-MUTATION-STAGE:11', id=1784)
   org.apache.cassandra.utils.FBUtilities:239 in thread instance of java.lang.Thread(name='ROW-MUTATION-STAGE:11', id=1784)
   org.apache.cassandra.utils.FBUtilities:229 in thread instance of java.lang.Thread(name='ROW-MUTATION-STAGE:11', id=1784)
   org.apache.cassandra.dht.RandomPartitioner:118 in thread instance of java.lang.Thread(name='ROW-MUTATION-STAGE:11', id=1784)
   org.apache.cassandra.dht.RandomPartitioner:44 in thread instance of java.lang.Thread(name='ROW-MUTATION-STAGE:11', id=1784)
   org.apache.cassandra.db.Memtable:124 in thread instance of java.lang.Thread(name='ROW-MUTATION-STAGE:11', id=1784)
   org.apache.cassandra.db.Memtable:116 in thread instance of java.lang.Thread(name='ROW-MUTATION-STAGE:11', id=1784)
   org.apache.cassandra.db.ColumnFamilyStore:434 in thread instance of java.lang.Thread(name='ROW-MUTATION-STAGE:11', id=1784)
   org.apache.cassandra.db.Table:407 in thread instance of java.lang.Thread(name='ROW-MUTATION-STAGE:11', id=1784)
   org.apache.cassandra.db.RowMutation:200 in thread instance of java.lang.Thread(name='ROW-MUTATION-STAGE:11', id=1784)
   org.apache.cassandra.service.StorageProxy$2:138 in thread instance of java.lang.Thread(name='ROW-MUTATION-STAGE:11', id=1784)
   org.apache.cassandra.utils.WrappedRunnable:30 in thread instance of java.lang.Thread(name='ROW-MUTATION-STAGE:11', id=1784)
   java.util.concurrent.ThreadPoolExecutor$Worker:886 in thread instance of java.lang.Thread(name='ROW-MUTATION-STAGE:11', id=1784)
   java.util.concurrent.ThreadPoolExecutor$Worker:908 in thread instance of java.lang.Thread(name='ROW-MUTATION-STAGE:11', id=1784)
   java.lang.Thread:637 in thread instance of java.lang.Thread(name='ROW-MUTATION-STAGE:11', id=1784)
(Subsequent identical traces elided)
1281044297896: Contended enter for instance of by instance of java.lang.Thread(name='ROW-MUTATION-STAGE:13', id=1785)
1281044297903: Contended enter for instance of by instance of java.lang.Thread(name='ROW-MUTATION-STAGE:12', id=1786)
1281044297912: Contended enter for instance of by instance of java.lang.Thread(name='ROW-MUTATION-STAGE:15', id=1787)
1281044297927: Contended enter for instance of by instance of java.lang.Thread(name='ROW-MUTATION-STAGE:16', id=1788)
1281044297935: Contended enter for instance of by instance of java.lang.Thread(name='ROW-MUTATION-STAGE:17', id=1789)
1281044297940: Contended enter for instance of by instance of java.lang.Thread(name='ROW-MUTATION-STAGE:14', id=1790)
1281044297945: Contended enter for instance of by instance of java.lang.Thread(name='ROW-MUTATION-STAGE:18', id=1791)

Pretty quickly, 8 threads back up waiting on a synchronized method deep down in the creation of a new MessageDigest which will be used to get an md5 hash of the row key. This is pretty easy to “fix” (though honestly it probably never caused anyone a problem): just preallocate some md5 MessageDigest instances and stuff them in threadlocals.

So there we go. A few lines of code, and we found an avoidable contention point without opening a profiler or even knowing anything ahead of time about the underlying codebase. Awesome.

I’ll post more on troubleshooting with JDI in the future, including more examples. In the meantime, I’d love to get some feedback on jdiscript; feel free to email me (jfager at gmail), create an issue on github, or join jdiscript’s new group. I’ve also thrown up a wiki page for collecting jdiscripts, so if you put together something useful, please share it.

← Before After →