On alerts.

I’ve screwed up.

Here’s one of my inboxes on an average morning:

That’s a mistake on my part, but from what I’m starting to gather, it’s common to err on the side of “too much information” rather than “too little information”. That’s what I did, and that’s what I’m going to have to fix.

Especially for early-stage startups, you want as much information as possible. User-facing analytics like Google Analytics and web server logs only tell you part of the story.

I want and need better details of how user traffic maps to server load, what an average user time on site looks like, what kind of average response times we’re seeing - and this is just on the ops side of things. On the product side, I want even more insight into how people are using my product.

Here’s the mistake I made - I’ve got both actionable alerts and metric information coming in to the same inbox. Oops.

Moving forward, I’m going to change how I do alerting and information gathering.

Actionable alerts from things like Munin, nagios, or airbrake, which tell me things like “a server is down, on fire, or otherwise indisposed” will still come to email.

Information gathering is going to go somewhere else. I haven’t quite locked this down yet, but I might wind up writing something custom. I’m looking for something centralized that I can run reports and queries against, that’s fast enough on the client side that it won’t slow down anything. Graylog2 is currently leading the pack (thanks to bear454 for the recommendation!), but I’m open to alternatives.