<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">

  <title><![CDATA[blog dot lusis]]></title>
  <link href="http://lusis.github.com/atom.xml" rel="self"/>
  <link href="http://lusis.github.com/"/>
  <updated>2013-05-13T22:59:21-04:00</updated>
  <id>http://lusis.github.com/</id>
  <author>
    <name><![CDATA[John E. Vincent]]></name>
    
  </author>
  <generator uri="http://octopress.org/">Octopress</generator>

  
  <entry>
    <title type="html"><![CDATA[Smart Clients]]></title>
    <link href="http://lusis.github.com/blog/2013/05/13/smart-clients/"/>
    <updated>2013-05-13T21:17:00-04:00</updated>
    <id>http://lusis.github.com/blog/2013/05/13/smart-clients</id>
    <content type="html"><![CDATA[<p>Currently <a href="http://ricon.io/east.html">RICON|EAST</a> is going on in NYC. <a href="https://twitter.com/tsantero">Tom Santero</a> and the whole Basho crew is doing an awesome job if the content available via the live stream and twitters is to be believed.</p>

<!--more-->


<p><em>(please note this is my first blog entry post-D. As such and because I&#8217;ve yet to talk to any legal folks, I should state that this does not represent any opinion or policy of Dell)</em></p>

<p>One thing that caught my eye/ear when I could listen/watch was an excellent presentation by <a href="https://twitter.com/seancribbs">Sean Cribbs</a>. Sean holds a special place in my hero worship pantheon for a few reasons:</p>

<ul>
<li>I first heard about Riak on one of the first episodes of the ChangeLog show (along with <a href="https://twitter.com/argv0">Andy Gross</a>). It and the whole NoSQL thing made sense then</li>
<li>Sean is a pretty fucking down to earth person. He graciously drove down to one of our local meetups. Uber friendly and an awesome advocate for Basho and Riak.</li>
<li>He&#8217;s a really awesome presenter and if you&#8217;ve never had the priviledge of seeing him (live stream or in person), he rocks the mic.</li>
</ul>


<p>So in Sean&#8217;s presentation he&#8217;s talking about some changes to the Ruby client library for Riak. Many of the changes make the Ruby library a proper smart client. Read <a href="https://github.com/basho/riak-ruby-client/wiki/Connecting-to-Riak">this wiki</a> wiki entry under <strong>Connecting to Clusters</strong> for some of the features. It&#8217;s awesome (especially the transport-related failure handling).</p>

<h1>Client libraries in general</h1>

<p>I want to say a bit about client libraries. Regardless of what they talk to (though I&#8217;ll be talking specifically about database client libraries), this is something many companies get wrong.</p>

<p>Everyone knows I&#8217;m not the biggest MongoDB/10Gen fan in the world. I won&#8217;t go into detail about the technical reasons behind that. Many others have done a much more eloquent dive into that topic.
As much as it&#8217;s easy to make fun of MongoDB as being an marketing-driven database, they did get one thing right. They owned their client driver availability. Not only did they own and maintain all the drivers but they largely had the same API across the various languages.</p>

<p>Then again, they had to. Other databases/applications offer a REST-ish interface over HTTP (or plain-text interface like Redis) so they can punt a bit. Got a libcurl port for your language? You&#8217;re set. MongoDB has its own protocol and that god-forsaken BSON shitshow.</p>

<p>One of the benefits, however, of a plain-text or HTTP-based protocol is that it&#8217;s a pattern we can grok as operators and developers. We load balance webservers. We speak to third-party APIs. It&#8217;s not the most EFFICIENT but it&#8217;s a known quantity. It&#8217;s also, as I said, REALLY fucking easy to add support to your language of choice. No need to FFI some c library or make a binary extension. Any language worth its salt has http client support in stdlib (even if it&#8217;s as big a pile of dog squeeze as net/http). Again, most languages also have libcurl support for something better.</p>

<h1>Back to smart clients</h1>

<p>I largely dislike applications that require smart clients to get the full benefit. As an operations person, I&#8217;m USUALLY using a dynamic language like python, ruby or perl to access the system as opposed to directly from the application. This was my biggest gripe with ZooKeeper (as I&#8217;ve said many times in the past). It&#8217;s also been one of my points of contention with Datomic. If you aren&#8217;t on a JVM language, you&#8217;re shit out of luck for now. Yes, JRuby makes this billions of times easier for those of us using Ruby but Jython is still not where it needs to be for modern Python.</p>

<p>I also had this problem with Voldemort. Disclaimer this is 2-3 year old data from running Voldemort in production. AFAIK, it&#8217;s still the case. For the sake of this discussion, we&#8217;re going to ignore data opacity. At the time, the only way to fully access the data in and maintain a Voldemort cluster was from the JVM. I ended up writing quite a bit of JRuby wrapper around StoreClient just to see the data we had in Voldemort.</p>

<p>Riak (and as another example, ElasticSearch) is nice in this regard. It&#8217;s HTTP. I can curl it from a shell script. I can use the Ruby library Basho is maintaining. If I&#8217;m using a language without &#8216;official&#8217; support, I can write my own. All the metadata is largely attached to http headers and even monitoring is done via the <code>/ping</code> and <code>/stats</code> urls. Something I didn&#8217;t realize until today (thanks to <a href="https://twitter.com/b6n">Benjamin Black</a>) is that the stats interface actually exposes stuff I had previously glossed over including cluster topology. This is where the meat of the discussion on twitter today happened.</p>

<h1>Operational Happiness</h1>

<p>My original statements on this discussion related to using haproxy in front of your Riak cluster. There are several reasons I prefer this but a quick sidebar</p>

<h2>Seed Nodes</h2>

<p>I&#8217;ve had some minor operational experience with Cassandra (nothing to write home about) but one of the things that always bothered me was the idea of &#8216;seed nodes&#8217;. Let me be clear that Cassandra and Riak are pretty much the only two datastores I&#8217;d feel comfortable using these days (with the nod going to Riak) in any sort of scalable environment. Postgres has earned its way back to my into my graces but MySQL can <em>insert Louis C.K. euphemism here</em>.</p>

<p>My problem with seed nodes is the idea that I have special nodes. These nodes have to be hard-coded in a config file somewhere or discovered by some other method. I could store them as DNS lookups but now I&#8217;ve got to deal with TTLs on DNS. And I&#8217;ve got to deal with the fact that DNS doesn&#8217;t actually care if the host I&#8217;ve been given is actually alive or not.</p>

<p>I could store this information in ZooKeeper but what if I don&#8217;t actually have native ZK support in that database? I&#8217;ve got to write something that populates ZK when a new node is available and it&#8217;s not actually a live check. I still have to test that host first. Yes you should do that anyway but it&#8217;s a valid point.</p>

<p>So if I&#8217;m storing seed nodes as DNS names in a config file, I can never change those names without either rolling out new code or configs. That might require a restart somewhere. If I&#8217;m clever, I could probably make that an administrative hook in my application (think JMX) where I can fiddle the seed list. I can poll a config file for changes. I can do a lot of thing but none of them are &#8220;optimal&#8221; to me.</p>

<h2>Back to haproxy</h2>

<p>My prefered method of using Riak is to stick everything behind haproxy. There are several reasons for this but here are a few reasons (note we&#8217;re going to assume use of CM at this point):</p>

<ul>
<li>Operationally, haproxy nodes are easier to manage than application configs (depending on the application).</li>
<li>Many times, at different companies, we&#8217;ve had to roll our own layer on top of a client library (or even write our own due to licensing issues). Load-balancing is not neccessarily a core application developer competency (and it&#8217;s bitten me in the ass before).</li>
<li>From an operational perspective, I can bring nodes in and out of service in haproxy for maintenance without needing to inform the client or have it waste cycles with stale node detection. It simply will never talk to an out of service backend.</li>
</ul>


<p>Basically the haproxy crew has already done all the work in load balancing HTTP connections intelligently and they&#8217;re pretty damn good at it. I love my developers and I know the folks writing the various libraries I use are smart people but again it goes back to core competency. Note that Basho gets a nod here, however, in that I&#8217;m pretty sure that, what with writing webmachine, they grok http pretty well.</p>

<p>I&#8217;ve even gone the approach of using haproxy on every system that needs to connect to some backend locally. This totally eliminates the idea of fixed points in the network for connections (at the expense of having to deal with a bit of drift between CM runs). If I wanted to, I could even make multiple riak clusters transparent behind haproxy (though I can only think of a few REALLY specific use cases for that).</p>

<h2>Trade offs</h2>

<p>Yes there are trade-offs to this approach. As Ben pointed out, the Riak stats interface is really powerful from a topology perspective. A REALLY smart riak client can discovery data layout and make deeply intelligent decisions on that.</p>

<p>With other applications like ElasticSearch I can actually become a full fledged cluster member as a non-data node and actually offload some of the work of the cluster for scatter/gather type operations using the Java library.</p>

<p>With haproxy, I don&#8217;t get those types of benefits.</p>

<h2>What I do get</h2>

<p>Outside of the maintainability of haproxy (which is, again, subjective) I get one benefit I <strong>CAN&#8217;T</strong> get with smart clients that is, unfortunately, neccessary with many customers - a more narrow network allowance.</p>

<p>Enterprises are interesting beasts and still think in terms of traditional tiered application stacks. Nothing wrong with that and it does have SOME security benefits but many deployments I&#8217;ve been involved with have official policies that &#8216;data tiers&#8217; (whatever the hell those are) must be protected in a dedicated network behind an additional firewall. So here&#8217;s Riak, a &#8216;data tier&#8217;, that we have to have with as small an ingress as possible. That rules out smart clients of the toplogy-aware variety. So we stick haproxy in the mix. We tell them to use a load balancer. Some folks use an internal F5 HA pair with some sort of VRRP. We&#8217;ll set up haproxy + keepalived or some other combination depending.</p>

<h1>TMTOWTDI</h1>

<p>One of the things that Riak allows is this level of flexibility. I can use HTTP and haproxy. I can use HTTP and smart client. I can use protobufs and haproxy or a smart client. It&#8217;s really that flexible. I happen to prefer the haproxy approach for reasons I&#8217;ve already mentioned but I totally grok that some folks want a more intelligent client approach. Some folks would argue that there&#8217;s a right way and a wrong way but I don&#8217;t see it like that. What I see is a datastore that, just like it letting me control the consistency levels I want, let&#8217;s me control HOW I access that data.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Future of Noah]]></title>
    <link href="http://lusis.github.com/blog/2013/01/20/future-of-noah/"/>
    <updated>2013-01-20T21:15:00-05:00</updated>
    <id>http://lusis.github.com/blog/2013/01/20/future-of-noah</id>
    <content type="html"><![CDATA[<p>This is probably the most difficult blog post I&#8217;ve had to write. What&#8217;s worse is I&#8217;ve been sitting on it for months.</p>

<!-- more -->


<p>When I started Noah a few years ago, I had a head full of steam. I had some grand ideas but was trying to keep things realistic. I simply wanted a simple REST-ish interface for stashing nuggets of information between systems and a flexible way to notify interested parties when that information changed.</p>

<p>It started as a <a href="https://raw.github.com/lusis/Noah/8a2e193c043ab30cce17d7ada25ef33b72baa73e/doc/noah-mindmap-original.png">mindmap</a> laying in bed one night. It was my first serious project and I had no idea what I was getting in to. If you&#8217;re curious, you can read quite a bit of my initial braindumps on the <a href="https://github.com/lusis/Noah/wiki">wiki under &#8216;General Thoughts&#8217;</a>. I watched every day as more and more people started following the project.</p>

<p>It was a game changer for me in many ways. Working on Noah was fun and it was rewarding in more ways than one. But real life gets in the way sometimes.</p>

<h1>On stewardship</h1>

<p>One of the things I&#8217;ve learned over the past few years is that for opensource to REALLY thrive, it can&#8217;t be a one-person show. I&#8217;ve been involved with opensource for most of my 17+ year career. You think I would have learned that lesson before now.</p>

<p>Stewardship is a hard thing. Our arrogance and pride makes us want to keep things close to our chest.</p>

<ul>
<li>&#8220;I just want to get to a 1.0 release&#8221;</li>
<li>&#8220;Things are too in flux right now. It wouldn&#8217;t be fair to bring others in&#8221;</li>
<li>&#8220;I don&#8217;t quite trust anyone else with it yet&#8221;</li>
<li>&#8220;Let me just get this ONE part of the API in place first..&#8221;</li>
</ul>


<p>These are all things I said to myself.</p>

<p>What really changed my mind was a few things. Being involved in the Padrino project. Seeing the Fog community grow after Wesley started allowing committers. Seeing Jordan trust me enough to make me a logstash committer before his daughter was born. The biggest trigger was actually one of my own projects - the chef logstash cookbook.</p>

<p>Bryan Berry (FSM bless him) pestered the hell out of me about getting some changes merged in. He was making neccessary changes and fixes. He was evolving it to make it more flexible beyond my own use case. I don&#8217;t recall if he asked to be a committer but I gave it to him. The pull request queue drained and he added more than I ever had time for. Not long after, I added Chris Lundquist. Those two have been running it since then really.</p>

<p>I think back to when I got added to the committers for Padrino. It was a rush. It was amazing and scary. Above all it was the encouragement I needed. How dare I deny someone else that same opportunity.</p>

<p>Making that first pull request is hard. To have it accepted is a feeling I&#8217;ll keep with me for a long time. I can only hope that some project I create some day will give someone that same confidence and feeling.</p>

<h1>So what about Noah</h1>

<p>Noah is in the same place Logstash was. I&#8217;m not using it and that&#8217;s really hurting it more than anything. It&#8217;s time to let someone who IS using it take control. I care too much about it to watch it die on the vine. I still believe in what it was designed to do and every single day I get emails asking me if it&#8217;s still alive because it&#8217;s a perfect fit for what someone needs. The same stuff is STILL coming up on various mailing lists and Noah is a perfect fit. There are companies actively using it even it the current unloved state. Those folks have a vested interest in it.</p>

<p>When I added Chris and Bryan to the cookbook, I sent them an email with what my vision was for the cookbook. I can&#8217;t find that email now but I recall only had two real requirements:</p>

<ul>
<li>Out of the box, it would work on a single system with no additional configuration (i.e. add the cookbook to a run_list and logstash would work automatically)</li>
<li>A user never had to modify the cookbook to change anything related to roles (i.e. allow the attributes to drive search for discovering your indexer - hence all the role stuff in the attrs now)</li>
</ul>


<p>I need to do the same thing for Noah and see where it leads.</p>

<h1>Dat list</h1>

<p>This list isn&#8217;t comprehensive but I think it hits the key points.</p>

<h2>Simple</h2>

<p>Noah should be simple to interact with. It was born out of frustration with trying to interact with ZooKeeper. Nothing is more simple than being able to use <code>curl</code> IMHO. I can use Noah in shell scripts and I can use it in Java (we had a Spring Configurator at VA that talked to Noah. It was awesome). You should always be able to use <code>curl</code> to interact with Noah. I wish I could find it now but someone once brought up Noah on the ZK mailing list. This led to various rants about how it didn&#8217;t do consensus and a bunch of other stuff that ZK did. One of the Yahoo guys (I wish I could remember who) said something in favor of Noah that stuck with me:</p>

<p><em>Interfaces matter</em></p>

<p>I know I&#8217;m on the right track here because Rackspace just built a product that provides an HTTP interface to ZK. Oh and it does callbacks.</p>

<h2>Friendly to both sysadmins and developers</h2>

<p>Simplicity plays into this but I wanted Noah to be the tool that solved some friction between the people who write the code and the people who run the code. Configuration is all over the place in any modern stack. Configuration management has come into its own. People are using it but you still see disconnects. Where should this config be maintained? What&#8217;s the best way to have puppet track changes to application configuration? I can&#8217;t get my developers to update the ERB templates in the Chef cookbook. All of these things are where Noah is helpful.</p>

<p>I still stand by the statement that <a href="http://lusislog.blogspot.com/2011/03/ad-hoc-configuration-coordination-and.html">not all configuration is equal</a>. Volatility is a thing and it doesn&#8217;t have to mean the end of all the effort in moving to a CM tool. I wanted to remove that friction point.</p>

<p>I was also immensely inspired by Kelsey Hightower here. I&#8217;ve told the story several times of how Kelsey got so frustrated that the developers wouldn&#8217;t cooperate with us on Puppet and config files for our applications that he learned enough Java to write a library for looking up information in Cobbler. Cobbler has an XMLRPC api and that was simple enough that he could port his python skills to java and write the fucking library himself. I wanted Noah to be friendly enough that a sysadmin could do what Kelsey did.</p>

<h2>Watches and Callbacks</h2>

<p>I&#8217;ve said this before but one of the most awesome things that ZK has is watches. They have pitfalls (reregister your watches after they fire for instance) but they&#8217;re awesome. Noah&#8217;s callback system is the thing that needs the most love (it works but the plugin API was never finalized). It&#8217;s also one of the most powerful parts that meets the needs of folks that I see posting on various mailing lists.</p>

<p>The idea is simple. When something changes in Noah, you should be able to fire off a message however the end-user wants to get it. I think this is one of the reasons I love working on Logstash so much. Writing plugins is so simple and it&#8217;s the gateway drug to anyone who wants to contribute to logstash.</p>

<h1>Things I don&#8217;t care about</h1>

<p>What don&#8217;t I care about?</p>

<h2>Language</h2>

<p>I don&#8217;t care about the language it&#8217;s written in. If someone wants to take it and convert it to Python or Erlang or Clojure, be my guest. I just want the ideas to live on somehwere. In fact, I&#8217;ve rewritten various parts of Noah over the last year privately. Not just experimenting with moving from EM to Celluloid but as a Cherry.py app, in Clojure and I even started an Erlang attempt (except that I know almost NO Erlang so it didn&#8217;t get very far).</p>

<h2>Name</h2>

<p>Honestly I don&#8217;t even care about the name. Yeah it&#8217;s witty and fits with the idea of ZooKeeper but I have no qualms about adding a link to your project from the Noah readme and recommending people use it instead.</p>

<h2>Paxos/ZAB</h2>

<p>This was never a requirement for Noah. Noah was specifically designed for certain types of information. If you need that, use the right tool.</p>

<h2>Persistence</h2>

<p>Let&#8217;s be honest. From a simplicity standpoint, it doesn&#8217;t get much simpler than Redis. It&#8217;s one of the reasons we changed the default logstash tutorial to use Redis instead of RabbitMQ. I know Redis reinvents a lot of wheels that have already been solved but it, along with ElasticSearch, are one of the lowest friction bits of software I&#8217;ve dealt with in a long time. Not having external dependencies is a godsend for getting started.</p>

<p>However I&#8217;ve also got small experiments privately where I used ZMQ internally and sqlite. I&#8217;ve written a git-based persistence for it too.</p>

<p>Riak is also a great fit for Noah and takes care of the availability issue on the persistence side. More on Riak in a sec.</p>

<h1>So that&#8217;s it</h1>

<p>That&#8217;s really all that matters. If you want to take ownership of the project, contact me. Let me know and we&#8217;ll talk. Who knows. Maybe I&#8217;m overestimating the level of interest. Maybe ZK isn&#8217;t as unapproachable to people anymore. The language bindings have certainly gotten much better. I just want the project to be useful to folks and I&#8217;m getting in the way of that.</p>

<h1>What are the other options?</h1>

<p>I don&#8217;t know of many other options out there. Doozer is picking up steam again as I understand it and it has a much smaller footprint than ZK does. There was a python project that did a subset of Noah but I can&#8217;t find it now.</p>

<p>One thing that is worth considering is a project that I found earlier today - <a href="https://github.com/cocagne/zpax">zpax</a>. While this is just a framework experiment of sorts, it could inspire you to add your own frontend to it. The same author is also working on DTLS on top of ZMQ.</p>

<p>I&#8217;ve thought about ways I could actually do this with Logstash plugins. It&#8217;s doable but not really feasible without making Logstash do something it isn&#8217;t shaped for.</p>

<p>Another idea that I&#8217;m actually toying around with is simply using Riak plus a ZeroMQ post-commit hook so that plugins could be written in a simpler way. <a href="https://github.com/seancribbs/riak_zmq">Sean Cribbs already took the idea and made a POC 2 years ago</a> based on a gist from Cody Soyland. You wouldn&#8217;t have the same API up front as Noah but you could stub that out in some framework and also have it be the recipient of the ZMQ publishes.</p>

<p>Finally you could just use ZooKeeper. Yes it has MUCH greater overhead but you DO get a lot more bang for the buck. There really isn&#8217;t anything in the opensource world right now that compares. It also provides additional features that I never really cared about or needed in Noah.</p>

<h1>Wrap up</h1>

<p>I&#8217;m not done in this space. I don&#8217;t know where I&#8217;m going next with it. Maybe I&#8217;ll start from scratch with a much simpler API. Maybe I&#8217;ll just run with the Riak idea.</p>

<p>I just want to give a shoutout to the countless people who helped me evangelize Noah over the last few years. It was recommended on mailing lists, twitter and many other places. It meant a lot to me and I only hope that someone will take up the mantle and make it something you would recommend again.</p>

<p>For those of you still using Noah, I hope we can find a home for it so that it can continue to provide value to you.</p>

<p>Thanks.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[How we vagrant]]></title>
    <link href="http://lusis.github.com/blog/2012/12/17/how-we-vagrant/"/>
    <updated>2012-12-17T22:10:00-05:00</updated>
    <id>http://lusis.github.com/blog/2012/12/17/how-we-vagrant</id>
    <content type="html"><![CDATA[<p>People may or may not have noticed but I&#8217;ve been largely offline for the past 4 weeks or so. This is because I&#8217;ve been in the middle of a pretty heavy redesign of a few key parts of our application stack. This also required me to learn Java so I&#8217;ve been doubly slammed.</p>

<p>As part of that redesign, I worked on what we lovingly refer to internally as the &#8220;solo installer&#8221;. I gave a bit of background on this in a post to the Chef mailing list at one point but I&#8217;ll go over it again as part of this post.</p>

<!-- more -->


<h1>Beginnings</h1>

<p>To understand why this is something of a departure for us, it&#8217;s worth understanding from whence we came. enStratus, like most hosted solutions, has experienced largely organic growth. One of the nice things about a SaaS product is that you have the freedom to experiment to some degree. You might be in the middle of a MySQL to Riak migration and still need two data stores for the time being. You might be in the process of changing how some background process works so you&#8217;ve got an older designed system running along side the newer system which is only doing a subset of the work.</p>

<p>With a hosted platform these kinds of things are hidden from the end-user to some degree. They don&#8217;t know what&#8217;s going on behind the scenes and they really don&#8217;t care as long as you&#8217;re doing X, Y and Z that you&#8217;re being paid to do.</p>

<p>Now for those of you running/developing/managing some sort of SaaS/Hosted solution I want you to take a journey with me. Imagine that tomorrow someone walked into your office and said:</p>

<blockquote><p>We need to take our production environment and make it so that a customer can run it on-premise. Oh and it has to be able to run entirely isolated and can&#8217;t always talk back to any external service.</p></blockquote>


<p>That&#8217;s pretty much the place enStratus found itself. enStratus is a cloud management platform. Not all clouds are public. Maybe someone wants to use your service but has regulatory needs that prevent them from using it as is. Maybe it needs to run entirely isolated from the rest of the world. There are valid reasons for all of these despite my general attitude towards security theater in the enterprise.</p>

<p>Now you&#8217;ve got an interesting laundry list in front of you:</p>

<ul>
<li>How do you teach someone to manage this organic &#8220;thing&#8221; you&#8217;ve built?</li>
<li>How do you take not one application but an entire company and its stack and shove it in a box?</li>
<li>Do you wrap up all the external components (monitoring, file servers, access control) and deal with those?</li>
</ul>


<p>We aren&#8217;t the first company to do this and we won&#8217;t be the last. Take a look at Github. They offer a private version of Github. But we&#8217;re not just talking about one part of Github - e.g. Gist. We&#8217;re talking about the entire stack the company runs to provide Github as we know it.</p>

<p>Unless you design with this in mind, you can&#8217;t really begin to understand how difficult of a task this can be. As I understand it, Github finally went the appliance route and offer prefab vms with some setup glue. Please correct me if I&#8217;m wrong here.</p>

<h1>Early iterations</h1>

<p>Obviously you can see that this was/is a daunting task. Original versions of our install process were based around a collection of shell scripts. Because of certain details of the stack (such as encryption keys for our key management system), we had to maintain state between each component of the stack when it was installed. Currently there are roughly 7 core components/services that make up enStratus:</p>

<ul>
<li>The console</li>
<li>The api endpoint</li>
<li>The key management subsystem</li>
<li>The provisioning subsystem</li>
<li>The directory integration service</li>
<li>The &#8220;worker&#8221; system</li>
<li>The &#8220;monitor&#8221; system</li>
</ul>


<p>and those are just the enStratus components. You also need RabbitMQ, MySQL and Riak (as we&#8217;re currently transitioning from MySQL to Riak). All of these things largely talk to each other over some web service or via RabbitMQ. With one or two exceptions, they can all be loadbalanced in an active/active type of configuration and scaled horizontally simply by adding an additional &#8220;whatever&#8221; node.</p>

<p>So the original installation process was a set of shell scripts that persisted some state and this &#8220;state&#8221; file had to be copied between systems. Yes, we could use some sort of external configuration store but that&#8217;s another component that we would have to install just to do the installation.</p>

<h1>Phase two</h1>

<p>One of my coworkers, <a href="https://twitter.com/zomgreg">Greg Moselle</a> was sort of &#8220;sysadmin number one&#8221; at enStratus. This was in addition to his duties as managing all customer installs. So he did what most of us would do and brute forced a workable solution with the original shell scripts. As enStratus started to offer Chef and Puppet support in the product, Greg gets this wild hair up his ass and thinks:</p>

<blockquote><p>I wonder if I can rewrite these shell scripts into something a bit more cross-platform and idempotent using chef-solo.</p></blockquote>


<p>You might be thinking the same thing I originally did that this was largely a bad idea. In my mind we had a workable solution for the interim in the existing shell scripts that had the install of enStratus down to a day or so. Pragmatism right? It&#8217;s also worth noting that this was how he wanted to learn Chef&#8230;</p>

<p>So off he goes and does what I recommend any new Puppet or Chef user does - exec resources all over the fucking place. Wrap your shell scripts in <code>exec</code>. Hardcode all the fucking things.</p>

<p>Once he did this, then I started working with him on some basic attributes to make them a bit more flexible. Before too long we had a stack of roles matching different components and we had moved everything from <code>cookbook_file</code> to <code>remote_file</code>. It was still a mess of execs but it worked.</p>

<p>But we still had this &#8220;state&#8221; we had to maintain between runs. This is not going away anytime soon. In production we store this state in attributes and use chef-server. We didn&#8217;t have that luxury here.</p>

<p>Then <a href="https://twitter.com/jimsander">Jim Sander</a> drops in and writes a small setup script that maintains some of that state for us. Basically a wrapper around raw <code>chef-solo</code>. Side note, if you ever need someone to drop some shell scripting knowledge on your ass, Jim&#8217;s the man to see. Ask him about his Tivoli days to really piss him off.</p>

<p>At this point, I start working on cleaning up the recipes as a sort of tutorial for folks. I&#8217;d pick a particular recipe and refactor it to all native resources and make it data driven. I&#8217;d commit these in small chunks so folks could easily see what the differences were easily - stuff like &#8220;instead of execing to call rpm, we&#8217;ll use the yum provider&#8221;.</p>

<p>At this point we&#8217;ve got something pretty far evolved from where we were. Now that we&#8217;ve got this workable chef-solo repository, I decide to hack out a quick Vagrantfile. The problem was it wasn&#8217;t entirely idempotent and we still had some manual steps that had to be dealt with. In addition to finishing up the recipes and ended up rewriting large chunks of the setup script. Now that I had something largely repeatable and localized, we suddenly had a Vagrant setup that folks could use for development. It wasn&#8217;t fully automated but it worked. We also still had this shared state thing.</p>

<p>So I set out to refactor the setup script a bit more. What&#8217;s important to keep in mind is that the primary use-case for this chef-solo repository wasn&#8217;t for Vagrant. This is our &#8220;installer&#8221;. The interesting part to me is that the improvements to how we do on-premise installs are coming as a direct result of making this work better with Vagrant. There&#8217;s a lot of wrapper work tied up in the setup script that wouldn&#8217;t need to be done if we used a base box that had more stuff baked in. However not baking stuff in actually gives us a more real-world scenario for installation.</p>

<p>Additionally we needed to be able to somehow pass user-specific configuration settings into the <code>vagrant up</code> process and get those into <code>chef-solo</code> by way of the setup wrapper. We have things like license keys, hostnames and my personal hated favorite - database credentials - that need to be handled in a way that we can make it so a developer can just type <code>vagrant up</code> and be running. If I have to require someone to edit a json file or anything else, the whole thing will fall flat on its face.</p>

<p>So any time we needed something like that, we added support to the setup wrapper and then used environment variables to pass that information in to vagrant.</p>

<h1>So how do we vagrant?</h1>

<p>We leverage environment variables pretty heavily in our Vagrantfile. If it&#8217;s something that someone might need to tune for whatever reason, it&#8217;s an environment variable that triggers an option to our setup script.</p>

<h2>Current list of tunables</h2>

<p>This is just a subset of the tunables we control via environment variables. The majority of these map directly to options for the setup script:</p>

<ul>
<li><code>ES_DLPASS</code> and <code>ES_LICENSE</code>: the basic set of credentials needed to fetch our assets and your personal license key.</li>
<li><code>ES_MEM</code>: this is the result of some of our front end developers having less memory than others.</li>
<li><code>ES_CACHE</code>: We have an office in New Zealand and bandwidth there is &#8220;challenging&#8221;. This allows us to cache as much as possible between calls to <code>vagrant up</code>. This not only triggers caching of system packages downloaded but also triggers the <code>prefetch</code> option in our setup script that predownloads all the assets. These assets are all stored in the <code>cache</code> directory of the repository which is not coincidently the value of <code>file_cache_path</code> in chef-solo. Remember that we may not always have external network access during installation so we offer a way to warm the <code>cache</code> directory with as many assets as possible.</li>
<li><code>ES_BOX</code>: let&#8217;s you specify an alternate base box to use. This is how we test the installer on different distros.</li>
<li><code>ES_DEVDIR</code>: shares an additional directory from the host to the vagrant image. This is how development is done (at least for me). I map this to the root of all of my git repository checkouts.</li>
<li><code>ES_VAGRANT_NW</code>: Allows you to configure bridged networking in addition to the host-only network we use.</li>
<li><code>ES_PROFILE</code>: This directly maps to an option in our setup script for persisting state between runs.</li>
</ul>


<p>There are other options that are specific to the enStratus product as well but you get the idea.</p>

<h2>The setup script</h2>

<p>I can&#8217;t post the full thing here but I can give you a general idea of how it works and some of the options it supports. This is a &#8220;sanitized&#8221; truncated version of the help output:</p>

<figure class='code'><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
<span class='line-number'>13</span>
<span class='line-number'>14</span>
<span class='line-number'>15</span>
<span class='line-number'>16</span>
<span class='line-number'>17</span>
<span class='line-number'>18</span>
<span class='line-number'>19</span>
<span class='line-number'>20</span>
<span class='line-number'>21</span>
<span class='line-number'>22</span>
<span class='line-number'>23</span>
<span class='line-number'>24</span>
<span class='line-number'>25</span>
<span class='line-number'>26</span>
<span class='line-number'>27</span>
<span class='line-number'>28</span>
<span class='line-number'>29</span>
<span class='line-number'>30</span>
<span class='line-number'>31</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>Usage: setup.sh [-h] [-e] [-f] -p &lt;download password&gt; -l &lt;license key&gt; [-s savename] [-c &lt;console hostname&gt;] [-n &lt;number of nodes&gt;] [-m &lt;mapping string&gt;] [-a &lt;optional sourceCidr string&gt;]
</span><span class='line'>-------------------------------------------------------------------------
</span><span class='line'>-p: The password for downloading enStratus
</span><span class='line'>-l: The license key for enStratus
</span><span class='line'>
</span><span class='line'>For most single node installations, specify the download password and license key.
</span><span class='line'>
</span><span class='line'>optional arguments
</span><span class='line'>------------------
</span><span class='line'>-h: This text
</span><span class='line'>-e: extended help
</span><span class='line'>-f: fetch-only mode. Downloads and caches *MOST* assets. Requires download password and *WILL* install chef
</span><span class='line'>-c: Alternate hostname to use for the console. [e.g. cloud.mycompany.com] (default: fqdn of console node)
</span><span class='line'>-a: Alternate string to use for the sourceCidr entry. You know if you need this.
</span><span class='line'>-s: A name to identify this installation
</span><span class='line'>-n: Number of nodes in installation [1,2,4] (default: 1)
</span><span class='line'>-m: Mapping string [e.g. frontend:192.168.1.1,backend:backend.mydomain.com]
</span><span class='line'>
</span><span class='line'>About savename:
</span><span class='line'>---------------
</span><span class='line'>Savename is a way to persist settings between runs of enStratus.
</span><span class='line'>If you specify a save name, a directory will be created under local_settings
</span><span class='line'>will be created. It will contain a YAML file with your settings as well 4 JSON files.
</span><span class='line'>
</span><span class='line'>The YAML file is the source of truth for the named installation. The JSON files MAY
</span><span class='line'>be recreated if the contents of the YAML file change. They exist to migrate between systems.
</span><span class='line'>If a save file is found, no other arguments are honored. If you need to change the 
</span><span class='line'>download password or license key, please update the YAML file itself
</span><span class='line'>
</span><span class='line'>If you lose this YAML file you will not be able to recover this enStratus installation.
</span><span class='line'>You should save it somewhere secure and optionally version it.</span></code></pre></td></tr></table></div></figure>


<h3>Persisting settings</h3>

<p>One of the &#8220;gotchas&#8221; we have is how do we basically build a node JSON file for chef-solo to use with any information we need to persist. Since we don&#8217;t know the state of all the systems involved when we go in, we have to &#8220;punt&#8221; on a few things. What we end up doing is something we call the <code>savename</code>. If you use this option, the settings you define will be persisted to a directory that git ignores called <code>local_settings</code>. This directory will contain directories named after the above <code>savename</code> parameter. The setup script (written for now in bash) will create a yaml file (easy to do in bash with HEREDOC as opposed to JSON) and also a copy of the generated encryption keys in a plain text file for the customer to store.</p>

<p>The only thing we can count on being on the system up front is the Chef omnibus install (since that&#8217;s a requirement). Instead of complicating things with ruby at this point (and chicken/egg issues since the setup script actually installs chef omnibus), we use the <code>erubis</code> binary that gets installed with omnibus to pass the yaml to to a JSON erb template. That generated JSON is the node json with attribute overrides. We actually support multi-node installation in the setup script if you provide a mapping of where certain components are running when calling setup. If you rerun setup using an existing <code>savename</code> parameter, the yaml file is updated (only certain values) and then regenerate the JSON file.</p>

<h1>The upshot</h1>

<p>The best part of all of this is that we can now say the same process is used when installing enStratus locally in Vagrant, in our dev, staging and production environments (though production uses chef-server) as well as what we install on the customer&#8217;s site. We version this repository around static points in our release cycle. We branch for a new release and create tags at given points in the branch based on either a patch release for enStratus itself in that release OR a patch to the installer itself.</p>

<p>It&#8217;s not all unicorns pooping rainbows. The process is much more complicated than it needs to be but it&#8217;s almost a world of difference from where it was when I started and it was entirely a team effort. This setup allowed us to do full testing to switch entirely off the SunJDK (and the need to manually download the JCE during customer installs) onto OpenJDK. We were able to migrate from Tomcat to Jetty and refactor our build process using this method. I was able to do this work without affecting anyone else. All I had to do when we were ready for full testing was tell everyone to switch branches, run <code>vagrant up</code> and test away.</p>

<h1>Special thanks</h1>

<p>I want to give a serious shout-out to Mitchell Hashimoto and John Bender for the work they did with Vagrant. Last year I said that no two software products impacted my career more than ElasticSearch and ZeroMQ. This year, without a doubt, Vagrant is at the top of that list.</p>

<h1>Addendum</h1>

<p>What follows is the sanitized version of our <code>Vagrantfile</code>. If anyone has any suggestions, I&#8217;m all ears:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
<span class='line-number'>13</span>
<span class='line-number'>14</span>
<span class='line-number'>15</span>
<span class='line-number'>16</span>
<span class='line-number'>17</span>
<span class='line-number'>18</span>
<span class='line-number'>19</span>
<span class='line-number'>20</span>
<span class='line-number'>21</span>
<span class='line-number'>22</span>
<span class='line-number'>23</span>
<span class='line-number'>24</span>
<span class='line-number'>25</span>
<span class='line-number'>26</span>
<span class='line-number'>27</span>
<span class='line-number'>28</span>
<span class='line-number'>29</span>
<span class='line-number'>30</span>
<span class='line-number'>31</span>
<span class='line-number'>32</span>
<span class='line-number'>33</span>
<span class='line-number'>34</span>
<span class='line-number'>35</span>
<span class='line-number'>36</span>
<span class='line-number'>37</span>
<span class='line-number'>38</span>
<span class='line-number'>39</span>
<span class='line-number'>40</span>
<span class='line-number'>41</span>
<span class='line-number'>42</span>
<span class='line-number'>43</span>
<span class='line-number'>44</span>
<span class='line-number'>45</span>
<span class='line-number'>46</span>
<span class='line-number'>47</span>
<span class='line-number'>48</span>
<span class='line-number'>49</span>
<span class='line-number'>50</span>
<span class='line-number'>51</span>
<span class='line-number'>52</span>
<span class='line-number'>53</span>
<span class='line-number'>54</span>
<span class='line-number'>55</span>
<span class='line-number'>56</span>
<span class='line-number'>57</span>
<span class='line-number'>58</span>
<span class='line-number'>59</span>
<span class='line-number'>60</span>
<span class='line-number'>61</span>
<span class='line-number'>62</span>
<span class='line-number'>63</span>
<span class='line-number'>64</span>
<span class='line-number'>65</span>
<span class='line-number'>66</span>
<span class='line-number'>67</span>
</pre></td><td class='code'><pre><code class='ruby'><span class='line'><span class="no">Vagrant</span><span class="o">::</span><span class="no">Config</span><span class="o">.</span><span class="n">run</span> <span class="k">do</span> <span class="o">|</span><span class="n">config</span><span class="o">|</span>
</span><span class='line'>  <span class="k">if</span> <span class="no">ENV</span><span class="o">[</span><span class="s1">&#39;ES_BOX&#39;</span><span class="o">]</span>
</span><span class='line'>    <span class="n">config</span><span class="o">.</span><span class="n">vm</span><span class="o">.</span><span class="n">box</span> <span class="o">=</span> <span class="no">ENV</span><span class="o">[</span><span class="s1">&#39;ES_BOX&#39;</span><span class="o">]</span>
</span><span class='line'>  <span class="k">else</span>
</span><span class='line'>    <span class="n">config</span><span class="o">.</span><span class="n">vm</span><span class="o">.</span><span class="n">box</span> <span class="o">=</span> <span class="s2">&quot;es-dev&quot;</span>
</span><span class='line'>    <span class="n">config</span><span class="o">.</span><span class="n">vm</span><span class="o">.</span><span class="n">box_url</span> <span class="o">=</span> <span class="s2">&quot;https://opscode-vm.s3.amazonaws.com/vagrant/boxes/opscode-ubuntu-12.04.box&quot;</span>
</span><span class='line'>  <span class="k">end</span>
</span><span class='line'>
</span><span class='line'>  <span class="k">if</span> <span class="no">ENV</span><span class="o">[</span><span class="s1">&#39;ES_VAGRANT_NW&#39;</span><span class="o">]</span> <span class="o">==</span> <span class="s2">&quot;bridged&quot;</span>
</span><span class='line'>    <span class="n">config</span><span class="o">.</span><span class="n">vm</span><span class="o">.</span><span class="n">network</span> <span class="ss">:bridged</span>
</span><span class='line'>  <span class="k">else</span>
</span><span class='line'>    <span class="c1"># If you change this address, the conditional logic</span>
</span><span class='line'>    <span class="c1"># in console.rb will break</span>
</span><span class='line'>    <span class="n">config</span><span class="o">.</span><span class="n">vm</span><span class="o">.</span><span class="n">network</span> <span class="ss">:hostonly</span><span class="p">,</span> <span class="s2">&quot;172.16.129.19&quot;</span>
</span><span class='line'>  <span class="k">end</span>
</span><span class='line'>
</span><span class='line'>  <span class="c1"># These entries allow you to run code locally and talk to a</span>
</span><span class='line'>  <span class="c1"># &quot;working set&quot; of data services</span>
</span><span class='line'>  <span class="n">config</span><span class="o">.</span><span class="n">vm</span><span class="o">.</span><span class="n">forward_port</span> <span class="mi">15000</span><span class="p">,</span> <span class="mi">15000</span>   <span class="c1"># api</span>
</span><span class='line'>  <span class="n">config</span><span class="o">.</span><span class="n">vm</span><span class="o">.</span><span class="n">forward_port</span> <span class="mi">3302</span><span class="p">,</span> <span class="mi">3302</span>     <span class="c1"># dispatcher</span>
</span><span class='line'>  <span class="n">config</span><span class="o">.</span><span class="n">vm</span><span class="o">.</span><span class="n">forward_port</span> <span class="mi">2013</span><span class="p">,</span> <span class="mi">2013</span>     <span class="c1"># km</span>
</span><span class='line'>  <span class="n">config</span><span class="o">.</span><span class="n">vm</span><span class="o">.</span><span class="n">forward_port</span> <span class="mi">5672</span><span class="p">,</span> <span class="mi">5672</span>     <span class="c1"># RabbitMQ (autostarts)</span>
</span><span class='line'>  <span class="n">config</span><span class="o">.</span><span class="n">vm</span><span class="o">.</span><span class="n">forward_port</span> <span class="mi">8098</span><span class="p">,</span> <span class="mi">8098</span>     <span class="c1"># Riak HTTP (autostarts)</span>
</span><span class='line'>  <span class="n">config</span><span class="o">.</span><span class="n">vm</span><span class="o">.</span><span class="n">forward_port</span> <span class="mi">8097</span><span class="p">,</span> <span class="mi">8097</span>     <span class="c1"># Riak protobuf (autostarts)</span>
</span><span class='line'>  <span class="n">config</span><span class="o">.</span><span class="n">vm</span><span class="o">.</span><span class="n">forward_port</span> <span class="mi">3306</span><span class="p">,</span> <span class="mi">3306</span>     <span class="c1"># MySQL (autostarts)</span>
</span><span class='line'>  <span class="n">config</span><span class="o">.</span><span class="n">vm</span><span class="o">.</span><span class="n">forward_port</span> <span class="mi">55672</span><span class="p">,</span> <span class="mi">55672</span>   <span class="c1"># RabbitMQ management interface</span>
</span><span class='line'>
</span><span class='line'>  <span class="k">if</span> <span class="no">ENV</span><span class="o">[</span><span class="s1">&#39;ES_MEM&#39;</span><span class="o">]</span>
</span><span class='line'>    <span class="n">config</span><span class="o">.</span><span class="n">vm</span><span class="o">.</span><span class="n">customize</span> <span class="o">[</span><span class="s2">&quot;modifyvm&quot;</span><span class="p">,</span> <span class="ss">:id</span><span class="p">,</span> <span class="s2">&quot;--memory&quot;</span><span class="p">,</span> <span class="no">ENV</span><span class="o">[</span><span class="s1">&#39;ES_MEM&#39;</span><span class="o">]]</span>
</span><span class='line'>  <span class="k">else</span>
</span><span class='line'>    <span class="n">config</span><span class="o">.</span><span class="n">vm</span><span class="o">.</span><span class="n">customize</span>  <span class="o">[</span><span class="s2">&quot;modifyvm&quot;</span><span class="p">,</span> <span class="ss">:id</span><span class="p">,</span> <span class="s2">&quot;--memory&quot;</span><span class="p">,</span> <span class="mi">8192</span><span class="o">]</span>
</span><span class='line'>  <span class="k">end</span>
</span><span class='line'>
</span><span class='line'>  <span class="k">if</span> <span class="no">ENV</span><span class="o">[</span><span class="s1">&#39;ES_DEVDIR&#39;</span><span class="o">]</span>
</span><span class='line'>    <span class="n">config</span><span class="o">.</span><span class="n">vm</span><span class="o">.</span><span class="n">share_folder</span> <span class="s2">&quot;es-dev-data&quot;</span><span class="p">,</span> <span class="s2">&quot;/es_dev_data&quot;</span><span class="p">,</span> <span class="no">ENV</span><span class="o">[</span><span class="s1">&#39;ES_DEVDIR&#39;</span><span class="o">]</span>
</span><span class='line'>  <span class="k">end</span>
</span><span class='line'>
</span><span class='line'>  <span class="k">if</span> <span class="no">ENV</span><span class="o">[</span><span class="s1">&#39;ES_CACHE&#39;</span><span class="o">]</span>
</span><span class='line'>    <span class="nb">puts</span> <span class="s2">&quot;Shared cache enabled&quot;</span>
</span><span class='line'>    <span class="no">FileUtils</span><span class="o">.</span><span class="n">mkdir_p</span><span class="p">(</span><span class="no">File</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="s2">&quot;cache&quot;</span><span class="p">,</span><span class="s2">&quot;apt&quot;</span><span class="p">,</span><span class="s2">&quot;partial&quot;</span><span class="p">))</span> <span class="k">unless</span> <span class="no">Dir</span><span class="o">.</span><span class="n">exists?</span><span class="p">(</span><span class="no">File</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="s2">&quot;cache&quot;</span><span class="p">,</span><span class="s2">&quot;apt&quot;</span><span class="p">,</span> <span class="s2">&quot;partial&quot;</span><span class="p">))</span>
</span><span class='line'>    <span class="n">config</span><span class="o">.</span><span class="n">vm</span><span class="o">.</span><span class="n">share_folder</span><span class="p">(</span><span class="s2">&quot;apt&quot;</span><span class="p">,</span> <span class="s2">&quot;/var/cache/apt/archives&quot;</span><span class="p">,</span> <span class="s2">&quot;cache/apt&quot;</span><span class="p">)</span>
</span><span class='line'>  <span class="k">end</span>
</span><span class='line'>
</span><span class='line'>  <span class="n">config</span><span class="o">.</span><span class="n">vm</span><span class="o">.</span><span class="n">provision</span> <span class="ss">:shell</span> <span class="k">do</span> <span class="o">|</span><span class="n">shell</span><span class="o">|</span>
</span><span class='line'>    <span class="no">ES_LICENSE</span><span class="o">=</span><span class="no">ENV</span><span class="o">[</span><span class="s1">&#39;ES_LICENSE&#39;</span><span class="o">]</span>
</span><span class='line'>    <span class="no">ES_DLPASS</span><span class="o">=</span><span class="no">ENV</span><span class="o">[</span><span class="s1">&#39;ES_DLPASS&#39;</span><span class="o">]</span>
</span><span class='line'>    <span class="no">ES_PROFILE</span><span class="o">=</span><span class="no">ENV</span><span class="o">[</span><span class="s1">&#39;ES_PROFILE&#39;</span><span class="o">]</span> <span class="o">||</span> <span class="s2">&quot;vagrant-</span><span class="si">#{</span><span class="no">Time</span><span class="o">.</span><span class="n">now</span><span class="o">.</span><span class="n">to_i</span><span class="si">}</span><span class="s2">&quot;</span>
</span><span class='line'>
</span><span class='line'>    <span class="k">if</span> <span class="no">ES_LICENSE</span><span class="o">.</span><span class="n">nil?</span> <span class="ow">or</span> <span class="no">ES_DLPASS</span><span class="o">.</span><span class="n">nil?</span>
</span><span class='line'>      <span class="nb">puts</span> <span class="s2">&quot;You must set the environment variables: ES_LICENSE and ES_DLPASS!&quot;</span>
</span><span class='line'>      <span class="nb">exit</span> <span class="mi">1</span>
</span><span class='line'>    <span class="k">end</span>
</span><span class='line'>    <span class="no">ES_CLOUD</span><span class="o">=</span><span class="no">ENV</span><span class="o">[</span><span class="s1">&#39;ES_CLOUD&#39;</span><span class="o">]</span>
</span><span class='line'>    <span class="no">ES_CIDR</span><span class="o">=</span><span class="no">ENV</span><span class="o">[</span><span class="s1">&#39;ES_CIDR&#39;</span><span class="o">]</span>
</span><span class='line'>    <span class="no">ES_DEBUG</span><span class="o">=</span><span class="no">ENV</span><span class="o">[</span><span class="s1">&#39;ES_DEBUG&#39;</span><span class="o">]</span> <span class="o">||</span> <span class="kp">false</span>
</span><span class='line'>    <span class="n">setup_opts</span> <span class="o">=</span> <span class="s2">&quot;-l </span><span class="si">#{</span><span class="no">ES_LICENSE</span><span class="si">}</span><span class="s2"> -p </span><span class="si">#{</span><span class="no">ES_DLPASS</span><span class="si">}</span><span class="s2"> -s </span><span class="si">#{</span><span class="no">ES_PROFILE</span><span class="si">}</span><span class="s2"> &quot;</span>
</span><span class='line'>    <span class="n">setup_opts</span> <span class="o">&lt;&lt;</span> <span class="s2">&quot;-c </span><span class="si">#{</span><span class="no">ES_CLOUD</span><span class="si">}</span><span class="s2"> &quot;</span> <span class="k">if</span> <span class="no">ES_CLOUD</span>
</span><span class='line'>    <span class="n">setup_opts</span> <span class="o">&lt;&lt;</span> <span class="s2">&quot;-a </span><span class="si">#{</span><span class="no">ES_CIDR</span><span class="si">}</span><span class="s2"> &quot;</span> <span class="k">if</span> <span class="no">ES_CIDR</span>
</span><span class='line'>    <span class="no">ES_DEBUG</span> <span class="p">?</span> <span class="n">chef_opts</span><span class="o">=</span><span class="s2">&quot;-l debug -L local_settings/</span><span class="si">#{</span><span class="no">ES_PROFILE</span><span class="si">}</span><span class="s2">/chef-run.log&quot;</span> <span class="p">:</span> <span class="s2">&quot;&quot;</span>
</span><span class='line'>    <span class="n">shell</span><span class="o">.</span><span class="n">inline</span> <span class="o">=</span> <span class="s2">&quot;cd /vagrant; ./setup.sh </span><span class="si">#{</span><span class="n">setup_opts</span><span class="si">}</span><span class="s2">; chef-solo -j local_settings/</span><span class="si">#{</span><span class="no">ES_PROFILE</span><span class="si">}</span><span class="s2">/single_node.json -c solo.rb </span><span class="si">#{</span><span class="n">chef_opts</span><span class="si">}</span><span class="s2">&quot;</span>
</span><span class='line'>  <span class="k">end</span>
</span><span class='line'>  <span class="k">if</span> <span class="no">ENV</span><span class="o">[</span><span class="s1">&#39;ES_POSTRUN&#39;</span><span class="o">]</span>
</span><span class='line'>    <span class="n">config</span><span class="o">.</span><span class="n">vm</span><span class="o">.</span><span class="n">provision</span> <span class="ss">:shell</span> <span class="k">do</span> <span class="o">|</span><span class="n">shell</span><span class="o">|</span>
</span><span class='line'>      <span class="n">shell</span><span class="o">.</span><span class="n">inline</span> <span class="o">=</span> <span class="s2">&quot;chef-solo -j /vagrant/local_settings/</span><span class="si">#{</span><span class="no">ES_PROFILE</span><span class="si">}</span><span class="s2">/single_node.json -c /vagrant/solo.rb -o </span><span class="se">\&quot;</span><span class="si">#{</span><span class="no">ENV</span><span class="o">[</span><span class="s1">&#39;ES_POSTRUN&#39;</span><span class="o">]</span><span class="si">}</span><span class="se">\&quot;</span><span class="s2">&quot;</span>
</span><span class='line'>    <span class="k">end</span>
</span><span class='line'>  <span class="k">end</span>
</span><span class='line'><span class="k">end</span>
</span></code></pre></td></tr></table></div></figure>

]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[What production means]]></title>
    <link href="http://lusis.github.com/blog/2012/07/09/what-production-means/"/>
    <updated>2012-07-09T21:31:00-04:00</updated>
    <id>http://lusis.github.com/blog/2012/07/09/what-production-means</id>
    <content type="html"><![CDATA[<p>This post is something that&#8217;s been brewing for a while. While it may sound targeted in tone, it&#8217;s more general than that. Let&#8217;s just call it an open letter to family, friends and coworkers around the world.</p>

<!-- more -->


<p>One thing that I have the hardest time communicating to friends and family who aren&#8217;t in the IT industry is the concept of &#8220;production&#8221; and what it means to be on-call. Even coworkers have a hard time understanding what it means.</p>

<p>The topic recently came up again and the confusion bothered me so much that I resolved to write this blog post as soon as humanly possible.</p>

<h1>A few clarifications</h1>

<p>I want to clarify a few very important things:</p>

<ul>
<li>I&#8217;m not whining about what I do</li>
<li>I love what I do</li>
<li>I&#8217;m not burned out</li>
<li>I&#8217;m not being self-important</li>
<li>I&#8217;ve always had a hard time &#8216;ranking&#8217; problems. EVERYTHING is important to me.</li>
<li>I&#8217;m not really interested in critiques of what SHOULD have been done. Riak wasn&#8217;t around, for instance, when I was managing the financial stuff for instance.</li>
<li>Yes, rotations are important but not always viable. Luckily we have a solid rotation at enStratus.</li>
</ul>


<h1>What production means to me (and why)</h1>

<p>Production environments take many forms. It&#8217;s even harder to define. For me, production has always meant &#8220;any system, service or component that the business requires to do business&#8221;.
I&#8217;ve worked in several different companies over the last 17 years. In some cases, production was an ERP system or a file server. In other cases, production was a web presence. What&#8217;s interesting each of these is that in some cases, production had a time associated with it.</p>

<p>Let&#8217;s take a few of these and compare them:</p>

<h2>The retail financial company</h2>

<p>Long ago I worked for a company that did payday and title loans. We operated over 600+ retail locations from east to west coast. The primary application used by these outlets was a web-based loan management application (websphere + db2). Stores were located across the country and store hours were from 9AM to 6PM (IIRC). From this you might think &#8220;production was the web application and it needed to be online from 9AM EST to 10PM EST&#8221;. You would be correct at the highest level.</p>

<p>However employees started the day at 8AM (lining up customers to call and what not) and left at 7PM (closing out books for the day). After the last store went &#8220;offline&#8221;, we began various nightly batch jobs. Being that this was financial in nature, batch jobs were the norm. We also had backups that had to run as we rebuilt our QA database nightly from sanitized production database backups. If the batch jobs didn&#8217;t finish in time, we actually had to delay the start of the day for the stores. Our datawarehouse was also loaded from a secondary copy of the database restored from backup. If I recall the main reason for this was that our nightly window was SO crunched for time that we couldn&#8217;t even load the warehouse from the main database because we had to start various batch jobs as soon as backup was done.</p>

<p>But that&#8217;s just the main system. We had ancillary services as well. None of the retail outlets had access to the internet except through a squid proxy. There were print servers that did server-side check printing. In the backoffice, we had collections and other things that depending on the reports that came out of the data warehouse. We had nightly backups of the LAN stuff. DNS servers that the stores had to use. VPN concentrators. ALL of this had the same SLA as production.</p>

<p>All told, I recall the final number for any business hours outage as costing us something like $100k for 15 minutes of downtime.</p>

<h2>The Learning Management System</h2>

<p>This system was used by a charter school system in the state of Ohio. It was an online classroom system that provided education for at-risk students. This WAS the school for these kids. Obviously it had normal school hours but since the students had no physical textbooks of any kind, the system HAD to be online for something as simple as homework. As with the previous setup, there were all sorts of ancillary services that we had to have available. All of the static content was shared across all of the tomcat servers via a SAN (OCFS2 - I have scars). It was backed by MySQL. We still had to do backups. We had to maintain connectivity. Everything we had was &#8216;production&#8217;. Since we had developers in other countries on different hours, we burned what money we had when the development environment or our SVN repo was offline. That was production too.</p>

<h2>Web applications in general</h2>

<p>To those in the industry I&#8217;m not telling you anything new. But to those not in the know, the internet doesn&#8217;t have office hours. Yes, you can gauge where your largest userbase is but take a system like enStratus.</p>

<p>It manages cloud resources for people all across the world. For many of these people, the only access they have to their cloud account is VIA enStratus. Take AWS for instance. enStratus is responsible for detecting outages and replacing components in the infrastructure for these companies or autoscaling to meet some demand. If enStratus is offline, these actions are NOT being taken on behalf of the user. The biggest fear for me is that enStratus is offline when AWS is having an outage. Some customers are paying us for this use case alone. Mind you in the last few outages, not even enStratus could fix the problem because of control plane issues. One thing enStratus can do is scale across multiple clouds so even AWS control plane issues are no excuse.</p>

<p>enStratus production itself is a pretty complex beast. The stack in general is designed to be fairly &#8220;SOA&#8221;. We use RabbitMQ pretty heavily for workers. However we&#8217;ve had some issues in the recent past where our workers were getting OutOfMemory exceptions. We run multiple workers (obviously) but in this case an OOM on one worker would eventually translate into OOMs across all the workers. When all the workers OOMd, they would stop processing messages from the queue. When that happened, RabbitMQ could eventually tank from units of work waiting to be picked up. We never had this happen, mind you but that was the end game.</p>

<p>This meant we had to be diligent on these OOMs. All the time. 24x7x365.</p>

<p>Luckily this problem was fairly short-lived but until the bug was identified and fixed (it happened to be related to an edge case with S3 bucket sizes), we had to be on guard for these OOM exceptions.</p>

<h1>What does it mean?</h1>

<p>The thing that I want to get across is that &#8220;production&#8221; is DIRECTLY related to the bottom line of the business. If &#8220;production&#8221; is offline, customers can&#8217;t use the system. Customers are unhappy. If customers are unhappy they eventually go elsewhere. If customers go elsewhere the company loses money. If the company loses money eventually the company lays people off. This isn&#8217;t rocket science. We talk about complex systems and cascading failures. This is a cascading failure that means someone doesn&#8217;t have a job at Christmas.</p>

<p>Yes, I take it that seriously. When I talk about production, that&#8217;s what I mean.</p>

<h1>On-call</h1>

<p>Now that you know what production means and what impact it has, I shouldn&#8217;t need to say much about being &#8220;on-call&#8221;. Yes, I&#8217;m on-call now and then. Yes, just like a doctor. Sometimes, depending on the company, I&#8217;ve been the only person on-call. Ever. No rotation. No help. Just me. The one person keeping shit running all the time so that customers (internal or external) aren&#8217;t impacted by the slightest glitch in the system. Yes we should build more resilient systems and we strive to do that. However, tech debt is a thing. It&#8217;s not always immediately an option.</p>

<p>So when I&#8217;m on-call it means I&#8217;m the person responsible for production as defined above and all the baggage that comes with it.</p>

<p>Someone once said to me &#8220;Most of us work nights and weekends, although we try to balance it when we can.&#8221;</p>

<p>While I appreciate the sentiment, I don&#8217;t just work nights and weekends, I work ALL the time. I have to be able to respond at a moments notice to a production issue when I&#8217;m on-call. Think about that for a moment. I have to essentially be 15 minutes from a working internet connection and may need to sit in front of a computer for an unspecified amount of time until the issue is resolved. I can&#8217;t just go to bed or decide that I&#8217;ve worked enough for the day.</p>

<h1>The joys of working remote</h1>

<p>I&#8217;ve been pretty fortunate in the past several years to be in a situation where I can support production from pretty much anywhere. I need at most a 3G connection and my laptop. I can VPN into production and fix most any problem. Yes, I always bring my laptop on &#8220;vacation&#8221;. If I don&#8217;t have a decent signal, then there&#8217;s a chance I&#8217;ll have to drive to the McDonald&#8217;s 15 minutes away to use the wifi. I try not to be on-call when I&#8217;m on vacation but sometimes it&#8217;s not always an option.</p>

<p>I just got back from visiting my in-laws Up North in Michigan. The only time I was able to get a single bar of 3G was late at night when enough subscribers stopped using the cell towers. Up until a year or so ago, there was no option for any sort of broadband internet in the area.</p>

<p>I&#8217;ve finally gotten my in-laws to understand much of what I&#8217;ve written here. They&#8217;ll be wiring the cottage for cable internet so that I can take the family up for longer periods of time. Yes, I&#8217;ll be &#8220;working&#8221; on vacation but I&#8217;d rather work a bit on vacation and stay longer than have to cut the trip short just so I can get back.</p>

<h1>Poor poor me</h1>

<p>As I said at the start, please don&#8217;t misinterpret what I&#8217;m saying as whining or some sort of god complex. This is the field I&#8217;ve chosen. I love what I do and I take that responsibility very seriously. As we start to build more resilient systems and accept that failure WILL happen and design for it, things will only get better. I still remember the first time I was able to replace a problematic system from scratch in 20 minutes via configuration management and call it a night.</p>

<p>As enStratus migrates its backend to Riak DS and builds out secondary and tertiary datacenters, I look forward to losing a data node not being the end of the world. DevOps über alles and all that.</p>

<p>So dear friends and family, the next time someone tells you they&#8217;re on-call for production&#8230;.cut &#8216;em some slack.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Why EBS was a bad idea]]></title>
    <link href="http://lusis.github.com/blog/2012/06/15/why-ebs-was-a-bad-idea/"/>
    <updated>2012-06-15T09:41:00-04:00</updated>
    <id>http://lusis.github.com/blog/2012/06/15/why-ebs-was-a-bad-idea</id>
    <content type="html"><![CDATA[<p>Since I just tweeted about this and I know people would want an explaination, I figured I&#8217;d short circuit 140 character hell and explain why I think EBS was the worst thing Amazon ever did to AWS.</p>

<!-- more -->


<p><em>First time I&#8217;ve had to do this but: the following is my personal opinion and in no way reflects any policy or position of my employer</em></p>

<h1>A journey through time</h1>

<p>I remember when EC2 was first unleashed. At the time I was working at Roundbox Media (later Roundbox Global - because we had an office in Costa Rica!). I was asked frequently if we could possibly host some of our production stuff there.</p>

<p>It was pretty much a no go from the start:</p>

<ul>
<li>No persistent IPs</li>
<li>No persistent storage</li>
</ul>


<p>Sure we could bake a bunch of stuff into the AMI root but the ephemeral storage wasn&#8217;t big enough to hold a fraction of our data. Still, we leveraged it as much as possible. We ran quite a bit of low risk stuff on it - test installs of our platform and demo sites for customers.</p>

<p>After I left RBX in Feb of 2008, I didn&#8217;t get an opportunity to work with AWS for a year or so and by then quite a bit had changed. If Amazon does one thing really well, it&#8217;s iterating quickly on its service offerings.</p>

<h1>So why is EBS a bad thing?</h1>

<p>For Amazon, EBS is NOT a bad thing. It was probably one of the smartest business moves they made (along with Elastic IPs). They could now claim that EC2 was JUST like running your own kit - you have a SAN! you have static IPs!</p>

<p>The problem is it&#8217;s not.</p>

<h1>The nature of block storage</h1>

<p>Anyone who&#8217;s dealt with any sort of networked filesystem knows the pains it can cause with certain application profiles. Traditional databases are notorious for expecting actual local storage and real block devices. It amazes me the number of people who put up with the pain of running a database in something like vmware using virtual disks hosted on an NFS device.</p>

<p>The point is the block devices have specific semantics and presumptions.</p>

<p>With EBS you&#8217;re promised a tasty block device that your OS can address as if it were local disk. Only it&#8217;s not&#8230;.</p>

<h2>Latency</h2>

<p>Let&#8217;s get the biggest elephant out of the way. EBS is a block device to the OS but under the hood it&#8217;s using the network. It may or may not be shared with non-block device traffic but it&#8217;s still subject to network latencies. God I hope that EBS at least gets its own port on the host side&#8230;</p>

<h2>Shared</h2>

<p>There&#8217;s a whole lot of sharing going on here to:</p>

<ul>
<li>local bandwidth from the physical server where your instance is to a given EBS subsystem (array, CEC, whatever)</li>
<li>aggregate bandwidth from all pysical servers talking to a given EBS subsystem</li>
<li>disk I/O itself on a given EBS subsystem</li>
</ul>


<p>I don&#8217;t know how the connection from server to EBS is done. I would hope at least there are bonded ports or multiple uplinks/multipathing going on. I would REALLY hope that network I/O and Disk I/O are not on the same channel. Regardless, you&#8217;re still sharing whatever the size of that connection is with everyone else on the physical server your instance is on if they&#8217;re using EBS as well.</p>

<p>And the physical EBS array where your volume is? Depending on the size of your EBS volume, you&#8217;re dealing with network I/O on that unit&#8217;s connection from an unknown number of other customers. And to top it off, you&#8217;re not just sharing network bandwidth, you&#8217;re sharing disk bandwidth as well. There are still spindles under there folks. Sticking an API in front of it doesn&#8217;t change the fact that there is spinning rust under the covers.</p>

<p>Above ALL of that, you&#8217;ve got competing workloads - sequential vs random read.</p>

<p>Sure, just stick your root OS volume on that. That&#8217;s a great idea.</p>

<h1>Mixed messages</h1>

<p>To me, however, the biggest problem with EBS is not the latency. It&#8217;s not the shared resources. It&#8217;s not even taking something that is fundamentally locality oriented and trying to shoehorn it into something distributed.</p>

<p>It&#8217;s the fact that it sends the wrong damn message. I&#8217;ve said this before, I&#8217;ll say it again and I&#8217;ll stand by it.</p>

<p><strong>Unless you are willing, able or have designed your applications to have any single part of your infrastructure - connectivity, disk, node, whatever - ripped from under you with no warning whatsoever, you should not be running it on Amazon EC2.</strong></p>

<p>By providing EBS, Amazon sends the message that &#8220;you can treat this just like your own datacenter&#8221;. Just use EBS and you can treat it just like a SAN. Look, we have snapshots!</p>

<p>Hell, I get pissy when folks refer to instances as &#8220;boxes&#8221; and talk about them like they&#8217;re something they physically own. Stop trying to map physical datacenter analogies to AWS. It won&#8217;t work and you&#8217;ll be disappointed.</p>

<p>You want to know the real kicker? You should be designing like this ANYWAY. Yes, you have much greater control over failure points when you run everything yourself. You have much greater control over resource sharing and I/O profiles. That doesn&#8217;t remove the need to design for failure. How far you take it is up to you (and realistically your budget) but when you&#8217;re running on AWS, you need to be much more attentive to it.</p>

<h1>For the record</h1>

<p>I still think AWS and public clouds are awesome. I really do. I think private clouds are just as awesome. The flexibility they offer is almost unmatched but that flexibility comes at a price - performance hits, multiple layers of abstraction and other things.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Monitoring sucking just a little bit less]]></title>
    <link href="http://lusis.github.com/blog/2012/06/05/monitoring-sucking-just-a-little-bit-less/"/>
    <updated>2012-06-05T12:22:00-04:00</updated>
    <id>http://lusis.github.com/blog/2012/06/05/monitoring-sucking-just-a-little-bit-less</id>
    <content type="html"><![CDATA[<p>So <a href="http://blog.zenoss.com/2012/06/turning-monitoringsucks-into-monitoringsucksless">it&#8217;s come to my attention</a> that today is the &#8220;anniversary&#8221; of when I wrote my <a href="http://blog.lusis.org/blog/2011/06/05/why-monitoring-sucks/">first &#8220;#monitoringsucks&#8221; blog post</a>.</p>

<!--more-->


<p>So the question I found myself asking today is &#8220;Does monitoring suck less than it did a year ago?&#8221;. I&#8217;d have to be an idiot to say anything other than &#8220;yes&#8221;.</p>

<p>I didn&#8217;t set out to start a &#8220;movement&#8221;. I&#8217;m not someone who handles compliments very well. Mainly because I suffer from a bad case of <a href="http://kartar.net/2012/05/imposter-syndrome/">imposter syndrome</a>. The other side of this is that I know that I&#8217;ve done next to nothing to make it any better. Meanwhile the real heros are people releasing code every day. People rethinking how we think about monitoring, trending, alerting and everything else. Companies like Etsy, Netflix, Yammer and countless more are sharing the deep squishy bits of how they do things and are releasing code to back it up.</p>

<h1>What&#8217;s gotten better?</h1>

<p>I think the biggest thing that&#8217;s gotten better is that people are really starting to leverage advances we&#8217;ve had in ancillary tooling in recent times. Not that any of these ideas are new (message busses existed long before RabbitMQ) but the barrier to entry is much lower.</p>

<p>I still feel like what drove this was a byproduct of configuration management uptake. As it became easier to stamp out servers, the rate of change in our infrastructure surpassed what tools were original designed to deal with. As we started having more time to think about what we wanted to monitor (because we weren&#8217;t spending all our time building hand-crafted artisan machines), we started to feel the pain more. We wanted our monitoring systems to work as smoothly as the rest of the kit but it didn&#8217;t.</p>

<p>Combine that with:</p>

<ul>
<li>We now have data storage engines of varying complexity for storing time-series data. And with greater resolution and the ability to change that resolution.</li>
<li>We have tooling that can read that timeseries data and represent it in dynamic ways.</li>
<li>We revisted the idea of push vs. pull and leave/join of components in our infra.</li>
</ul>


<p>The world is a brighter place because of this and to everyone who had something to do with it - whether discussing on a mailing list, talking on IRC, tweeting, writing code or whatever - thank you.</p>

<h1>What&#8217;s coming down the pipe</h1>

<p>As a side effect of this monitoring thing, people ask my opinion a lot. Like I said, I&#8217;m still weirded out by this. Stepping back, though, and just geeking out on things, I see some really cool stuff in the future. Here&#8217;s just a subset of &#8216;stuff&#8217; I&#8217;ve been thinking about:</p>

<h2>Presence-based Discovery</h2>

<p>I couldn&#8217;t think of a better way to describe it but the idea is simply (or not so simply) that by virtue of coming online, a system is saying &#8220;I wish to be monitored in this way&#8221;. This is pretty dependent on configuration management for this to go smoothly, imho.</p>

<p>I mentioned this in a post on the devops toolchain but it goes something like this:</p>

<ul>
<li>new node comes online</li>
<li>new node registers its presence in some way (I&#8217;m kind of keen on the XMPP idea) with a notification of services it offers</li>
<li>centralized system is monitoring (har har) this presence system and starts monitoring the system based on predefined criteria at a &#8220;well-known&#8221; endpoint</li>
<li>optionally the system can dictate what it wants monitored</li>
</ul>


<p>Obviously this would all be very painful without some sort of configuration management system. However, it&#8217;s very easy for me in my base group or role for a system to say &#8220;Install <code>W</code>, register via <code>X</code>, listen on <code>Y</code> for active checks and publish everything to <code>Z</code> endpoint&#8221;. What <code>W</code>,<code>X</code>,<code>Y</code> and <code>Z</code> are is irrelevant. We can cookie cutter this stuff. FWIW this is nothing new. We&#8217;re just seeing &#8220;consumer-grade&#8221; options that are usable by everyone.</p>

<h2>Push vs Poll</h2>

<p>I&#8217;ve said many times that poll-based monitoring is dead. That&#8217;s a bit of hyperbole. What&#8217;s dead is the idea that we can only check <code>X</code> every <code>N</code> times over <code>K</code> period. This is a hold-over from ineffecient polling mechanisms that would crumble under too-frequent polling as well as systems that weren&#8217;t able to handle being polled that often. I see polling moving from &#8220;check host <code>X</code> every 5 minutes for memory usage&#8221; to &#8220;watch this bus for memory usage stats and if there&#8217;s not anything in 2 minutes, make sure the world is okay&#8221;. We&#8217;ll always need the outside-in checks of things but that&#8217;s much less intensive than polling ALL the things.</p>

<p>We&#8217;re so close with tools like Graphite now which can accept arbitrary metrics from anywhere with no need to preconfigure it to accept them. There are some concerns here around bad data being injected from unauthorized sources. As we automate more and make decisions based on this data, we need to be aware of it. Another discussion for another day though something akin to the way mcollective does trust is probably in order.</p>

<h2>Self-service and Culture</h2>

<p>This is a big one too and I think will cut down on many complaints that people have around even something like Nagios.</p>

<p>We have to be able to say &#8220;You know what? We don&#8217;t need to monitor that. Let&#8217;s disable that check&#8221;. If something is unactionable, then why the hell are you alerting on it? This is where decoupling the trending/visualization from alerting can be so powerful. I&#8217;m currently rebuilding our monitoring setup to do most checks based on data in Graphite. Why? Because if I flip the relationship around, I&#8217;ve now got to deal with the alert question before I can even get the information. Instead of alerting on data and then storing it as an afterthought (perfdata anyone?) let&#8217;s start collecting the data, storing it and then alerting based on it.</p>

<p>This also provides for options around self-service. Not everyone needs to know about the disk space on nodeX. Only the people who can fix it do. Maybe your database folks want to get alerts on queries taking longer than N. As an operations person, you can&#8217;t do anything about that and certainly not at that moment (in most cases). You&#8217;re just going to push it down the line ANYWAY. And do you really want to have to deal with changing thresholds on behalf of someone?</p>

<p>I&#8217;m also a big fan of the idea that components in your infrastructure - applications, os, whatever - self-host a pubsub type of endpoint where users can get realtime information about the system themselves. I do this with every logstash install I setup where possible. Every remote logstash agent I&#8217;ve setup in enStratus also provides a pubsub 0mq socket that you can use to live tail log information from that host broken down in topic keys around metadata.</p>

<h2>Application health by default</h2>

<p>I&#8217;ve fawned over Coda Hale&#8217;s &#8220;Metrics&#8221; talk several times. I&#8217;m far from a fanboy but &#8220;Metrics&#8221; gets it right. This ties into self-service quite a bit. Developers need to be free to instrument the application without creating &#8220;yet another place to look&#8221;. Metrics does this so well with the idea of pushing instrumentation out of the application (oh look - push again!) and into Graphite or whereever is appropriate. And if you aren&#8217;t ready for push yet, you can still poll via JMX.</p>

<p>The idea has been ported to multiple other languages at this point. There no excuse not to deeply instrument your applications. If you have an application that CAN&#8217;T be instrumented properly, maybe you should consider a different application?</p>

<h2>Applying science and common sense</h2>

<p>The last thing that I see as being a step forward is we start applying science to our process. No more <code>-w 60,60,60 -c 75,75,75</code> canned thresholds. We start thinking about our thresholds. Maybe we do away with them entirely as static constructs. Instead we apply event processing and historical data to build thresholds that are intelligent.</p>

<p>We start looking at the shape of our infra. Is it REALLY important that you get woken up at 3AM because a Riak node is down? Not if it&#8217;s just one but maybe if it&#8217;s two depending on your cluster size. Maybe both of those nodes were in the same rack. Okay that&#8217;s bad.</p>

<p>We start to consider context and start applying science! We step out of the shamanistic ages of monitoring (who the hell still sets swap space to double physical memory and is 75% of a 1TB volume in use really something that can&#8217;t wait?) where &#8220;We&#8217;ve always done it this way&#8221;. Start thinking like the Apple 1984 commercial, whip out your hammer and smash your preexisting notions around what constitutes an alert.</p>

<h1>What I&#8217;m building</h1>

<p>Right now I&#8217;m spending most of my time being pragmatic. I&#8217;m still adding new checks to Nagios but the information is coming from different sources now. I&#8217;m dumping data into graphite via logstash and doing checks on that. Collectd is now pushing directly to graphite as well. Nagios is becoming less and less of a factor. When I finally strip it down to its bare essentials, I&#8217;ll have a better idea what gap needs to be filled. It&#8217;s starting to look like riemann at that point.</p>

<p>I still want to tackle this presence based idea in some form. Even if presence is just a signal to run chef-client. At my previous company, we used Noah for this. I&#8217;ve not yet had time to decide if Noah is the right fight here.</p>

<h1>What others are building</h1>

<p>What&#8217;s more important is what others are building. There are too many to list but I&#8217;m going to give it a shot off the top of my head. Please don&#8217;t take offense if your project isn&#8217;t here.</p>

<ul>
<li>Sensu</li>
<li>Graphite-tattle</li>
<li>Cepmon</li>
<li>Logstash</li>
<li>Librato</li>
<li>PagerDuty</li>
<li>Umpire</li>
<li>alerting-controller</li>
<li>Graphite</li>
<li>Statsd</li>
<li>Logster</li>
<li>Metrics</li>
<li>Incinga (yes, they&#8217;re starting to diverge from Nagios)</li>
<li>ZeroMQ</li>
<li>Chef</li>
<li>Puppet</li>
<li>Zookeeper</li>
<li>Riemann</li>
<li>OpenTSDB</li>
<li>Ganglia</li>
<li>TempoDB</li>
<li>CollectD</li>
<li>Datadog</li>
<li>Folsom</li>
<li>JMXTrans</li>
<li>Pencil</li>
<li>Rocksteady</li>
<li>Boundary</li>
<li>Circonus</li>
<li>GDash</li>
</ul>


<p>Probably the best bet is to head over to <a href="https://github.com/monitoringsucks/tool-repos">the monitoringsucks tool repo on github</a>. I can&#8217;t do the awesomeness of what people are doing justice here.</p>

<p>and so many more.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Configuration Drift and Next-gen CM]]></title>
    <link href="http://lusis.github.com/blog/2012/05/24/configuration-drift-and-next-gen-cm/"/>
    <updated>2012-05-24T01:10:00-04:00</updated>
    <id>http://lusis.github.com/blog/2012/05/24/configuration-drift-and-next-gen-cm</id>
    <content type="html"><![CDATA[<p>It always starts with a tweet. However it normally doesn&#8217;t start with a tweet from <a href="https://twitter.com/moonpolysoft">Cliff Moon</a>.</p>

<blockquote><p>Of all the problems to fix in chef or puppet, the diffusion and drift of state that occurs in idiomatic usage seems highest priority.</p></blockquote>




<!-- more -->


<p>Now for sure what spawned this comment was something unrelated but it got me thinking. Oddly enough <a href="https://twitter.com/dysinger">Tim Dysinger</a> was either poking around in my head or just had the same idea:</p>

<blockquote><p>Devops tools should move towards an active assertion of state (instead of passive/polling). This the next-level.</p></blockquote>


<p>Tim and I hooked up via Skype and bantered about this stuff back and forth. We were on the same wavelength. That&#8217;s pretty cool because Tim is pretty fucking smart (and he was able to explain Maybe Monads to me over dinner).</p>

<h1>My thoughts on the subject</h1>

<p>What follows is something of a brain dump on what both Cliff and Tim had to say. However I&#8217;m going to be scoping in the context of security because</p>

<ul>
<li><a href="https://twitter.com/beaker">Beaker</a> gave a <a href="http://www.rationalsurvivability.com/presentations/SMCES-Gluecon2012.pdf">presentation at Gluecon today</a> (<em>warning! bigass pdf</em>)</li>
<li>I work with <a href="https://twitter.com/mortman">David Mortman</a> who is one of the folks I say &#8220;gets it&#8221; w.r.t configuration management and security</li>
<li>Security was the FIRST context that came to mind</li>
<li>I was lucky enough to be involved with <a href="https://twitter.com/markburgess_osl">Mark Burgess</a>, <a href="https://twitter.com/cjeffblaine">Jeff Blaine</a> and <a href="https://twitter.com/filler">Nick Silkey</a> about a very similar topic where Mark said</li>
</ul>


<blockquote><p>Good point. I wonder why folks often tear down a perfectly good machine and rebuild it instead of fixing what is broken.</p></blockquote>


<p>What I&#8217;m going to say isn&#8217;t new to anyone and smarter folks than I are already working on this (I&#8217;m sure) but this is the Internet. I get to babble too!</p>

<h2>On Drift</h2>

<p>So what exactly <em>IS</em> the problem here? What&#8217;s configuration drift and how the hell does it even happen?</p>

<p>The problem here is that, as Tim said, configuration management systems aren&#8217;t assertive enough. Look at how a typical CM client run behaves:</p>

<ul>
<li>Hey guys, cm is running</li>
<li>Oh look, this file doesn&#8217;t look like it&#8217;s supposed to</li>
<li>/me changes file</li>
<li>File looks good</li>
<li>Hey guys, cm isn&#8217;t running anymore</li>
</ul>


<p>That last line is part of the problem. I&#8217;ve talked about my Noah project to largish groups of folks (both Puppet and Chef users) a few times now and the answer to the question of &#8220;Do you leave puppet (or chef) running in the background?&#8221; has always been &#8220;No&#8221;. There are plenty of valid reasons for this but this is what I would consider the primary cause of drift at the node level. Now maybe this isn&#8217;t EXACTLY what Cliff was talking about. I&#8217;m not quite on his level so I sometimes misinterpret but when I heard &#8216;drift of state that occurs in idiomatic usage&#8217;, this was what came to mind.</p>

<p>The thing is that these tools are designed to verify state of a resource at the point they run</p>

<ul>
<li>Does this file look right? No! Fix it.</li>
<li>Is this service running? Yes! Cool.</li>
</ul>


<p>And then they go away. They don&#8217;t manage the state of those resources until they next inspect them. The act of managing those resources is not in response to those resources changing but in response to a user ASKING them to be checked. I would even wager that when a user runs <code>chef-client</code> the first thing on her mind isn&#8217;t &#8220;I sure hope chef fixes my sudoers file&#8221; but &#8220;I need chef to update my Nagios configs again&#8221;. The incorrect state of the sudoers file isn&#8217;t really even thought about. That&#8217;s because we shove that stuff into some &#8220;base&#8221; role or group. Something that&#8217;s applied to all nodes in our infrastructure. We don&#8217;t think of a node as being a &#8220;managed sudoers&#8221; node. We think of it as a &#8220;web server&#8221;.</p>

<p>Additionally, because we aren&#8217;t in a constant state of verification about these resources, we may have drift that occurs across nodes of different types whilst they share a common base block. Sure I just ran my CM tool to update my Nagios server but what about my web servers? I don&#8217;t want to run it there because I <strong>KNOW</strong> nothing has changed in the web server role.</p>

<p>To me, this is the &#8220;idiomatic&#8221; usage Cliff spoke about. The tools encourage us to think in terms of composition and reusable patterns but the final function of the node is the way we classify it. Mind you, the answer here is really to run your CM tool in the background but that still doesn&#8217;t take us to the next level. We&#8217;re still exposed to drift even if it&#8217;s for a short period of time. What&#8217;s worse is these tools operate by default with a splay value. This actually makes the drift exposure even worse as you can&#8217;t even guarantee that it will run at the interval specified.</p>

<p>I first heard about CFEngine when Mark talked about &#8220;Anomoly Detection&#8221; at LISA &#8216;04 in Atlanta. My mind was blown but I could never get past the idea that I couldn&#8217;t dictate state immediately. The idea (partially a naive understanding on my part) that systems would not become X when I said &#8220;Become X&#8221; bothered me. The idea that systems have a personality that needs to be respected bothered me.</p>

<p>The point here is that when I want a system to look like X, I want it to look like X right then. I want it to STAY looking like X and I don&#8217;t want it to try and account for localized variations. I might feel differently if I were managing a network of servers that were essentially treated like desktops.</p>

<h2>But does drift really matter?</h2>

<p>Yes and no. If you&#8217;re living the cloud life, probably not. The reason is that resources tend to have a short shelf life. If I&#8217;m autoscaling via &#8216;the cloud&#8217; to meet capacity demands then it&#8217;s highly likely that those systems won&#8217;t be around long enough to drift that far. In the land of configuration management, drift is largely time driven. The longer systems stay around, the greater the chance for drift.</p>

<p>However if you&#8217;re running physical hardware that you don&#8217;t tear down regularly, then drift is likely to become more pronounced. Interestingly enough there&#8217;s a psychological factor at play here. Systems become like pets instead of cattle. We become attached to them. &#8220;Oh that&#8217;s just db1 acting up again. You know it&#8217;s like the oldest one in the fleet&#8221;. People start forgetting that the system will fail (<em>Everything fails. Embrace failure!</em>). The start storing one-off scripts on there. Maybe it&#8217;s core kit and while it&#8217;s managed with Puppet, it&#8217;s not frequently touched.</p>

<p>Here&#8217;s another interesting point. As time progresses and modules, cookbooks, recipes, bundles, promises whatever are more infrequently touched, the confidence in them goes down. I&#8217;ve frequently found myself saying &#8220;Shit&#8230;I wrote that code like&#8230;I don&#8217;t even fucking remember when. I have no idea what would happen if I ran it now.&#8221;. This uncertainty eventually leads to me MANUALLY COMPARING the current state of resources that would be modified with the versions that would be generated. I&#8217;ve even copied comment blocks wholesale from one-off changes I&#8217;ve made into ERB templates just to ensure that a service restart didn&#8217;t happen.</p>

<p>How fucked is that? Pretty fucked, Alex.</p>

<p>This partially leads into a quote from <a href="https://twitter.com/allspaw">John Allspaw</a> on the dangers of OVER automation:</p>

<blockquote><p>Some people, when confronted with a problem, think &#8220;I&#8217;ll use more automation!&#8221; Now they have Three Mile Island problems.</p></blockquote>


<p>and is even discussed in <a href="https://twitter.com/mcdonnps">Patrick McDonnell&#8217;s</a> talk at <a href="http://www.youtube.com/watch?v=nSnJCJiZDDU">ChefConf</a>.</p>

<p>I don&#8217;t agree with John 100% on his take but I can totally understand his perspective. Maybe I&#8217;m just more optimistic around exactly how far we can automate.</p>

<h2>So where does security fit into this?</h2>

<p>Here&#8217;s yet another quote from Tim Dysinger on this topic:</p>

<blockquote><p>If a super-user logs in and changes the sshd_config, you could have the service change it back before he even exited vi.  they&#8217;d then find a warning email in their in-box.  if they tried it again, even on another box, it could send a warning to the team lead and lock the user out.</p></blockquote>


<p>Mind you security is only one aspect of this thought process. The thing that makes it applicable is that the security domain has already tackled this problem a bit. The problem is it still requires human response. We have tools like Tripwire, Samhain and OSSEC that do active inspection of state but the response is left up to a human. Additionally they&#8217;re cumbersome to configure (even with CM). What&#8217;s missing is the &#8216;glue&#8217; between the two problem spaces.</p>

<p>In my head I envisioned something much like Tim described. In fact I even thought about a way that existing tools could be leveraged. It&#8217;s not pretty but it&#8217;s possible. The idea here, if it wasn&#8217;t already blindingly obvious, is that the response to an event that the security tool recognizes should be to run configuration management to correct the errant state.</p>

<p>There are a few problems with this approach that should also be immediately obvious:</p>

<ul>
<li>CM is currently an all or nothing approach. Something like the idea behind &#8216;partial run lists&#8217; in Chef could sort of address this</li>
<li>If we start down the path of partial CM, we now have to take into account dependencies.</li>
<li>We have no way to express this. We lack primitives with which to build this logic</li>
<li>Security is still somewhat in the dark ages. Many security decisions are still binary in nature</li>
</ul>


<p>What&#8217;s also missing in this picture is something to identify patterns. So now we&#8217;re bolting three tools together - the tripwire component, our CM tool and some sort of CEP. But again, even if we had this wondertool nirvana, how do we express it?</p>

<p>Let me be clear that I firmly believe that configuration management is absolutely a part of the security story. I stand by my assertion that consistent and repeatable configuration of a system from base OS to in-service is the foundation. Being able to express in code what state a system should have means that you never have to think &#8220;Did I forget to disable that apache module?&#8221; or &#8220;Did I make sure and disable root logins over SSH&#8221;. Where we find gaps in this is how we assert the negative without going insane.</p>

<p>Denied unless explicitly allowed is the mantra I followed for years when I was responsible for security. Ask me some time WHY I got out of security and why I don&#8217;t have ulcers anymore.</p>

<p>The questions I find myself asking are:</p>

<ul>
<li>If we use our CM system as the source of truth, can we sanely infer policy based on our CM codebase?</li>
<li>If we use the network as the source of truth, can we sanely infer policy based on our neighbors? Should we even trust our neighbors?</li>
<li>Do we even have the language to express what we mean and is it flexible and primitive enough to be used in composition? You can only go so far with &#8220;trusted&#8221;, &#8220;untrusted&#8221;,&#8221;mauve&#8221; and &#8220;taupe&#8221;.</li>
</ul>


<h1>Wrap up</h1>

<p>As I said, the original discussion was around configuration drift. I realize I went off on a tangent about security but that was intended as an example. I do believe that folks are working on this idea of &#8220;active assertion of state&#8221; as Tim puts it. I just wanted to brain dump my take on it. I don&#8217;t know that it can be solved with even the most flexible of CM tools. I do feel like it&#8217;s going to have to be a new generation of tool that takes these ideas into account and includes them in a ground-up design.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[It sucks to be right]]></title>
    <link href="http://lusis.github.com/blog/2012/03/20/it-sucks-to-be-right/"/>
    <updated>2012-03-20T09:00:00-04:00</updated>
    <id>http://lusis.github.com/blog/2012/03/20/it-sucks-to-be-right</id>
    <content type="html"><![CDATA[<p>So it looks like Adrian Cockcroft finally spilled the beans on <a href="http://perfcap.blogspot.com/2012/03/ops-devops-and-noops-at-netflix.html">Netflix (no)Operations</a> and sadly it reads like I expected.</p>

<!-- more -->


<h1>Netflix still does operations</h1>

<p>Regardless of what words Adrian uses, Netflix still does operations. <a href="http://twitter.com/allspaw">John Allspaw</a> summed it up pretty well in this tweet:</p>

<p><img src="http://i.imgur.com/OW0kh.png" alt="Imgur" /></p>

<p>and here are the things, he mentions:</p>

<ul>
<li>Metrics collection</li>
<li>PaaS/IaaS evaluation/investigation</li>
<li>Automation (auto-build, auto-recovery)</li>
<li>Fault tolerance</li>
<li>Availability</li>
<li>Monitoring</li>
<li>Performance</li>
<li>Capex and Opex forecasting</li>
<li>Outage response</li>
</ul>


<h1>So what does Adrian get wrong?</h1>

<p>These are just a few things that jumped out at me (and annoyed me)</p>

<blockquote><p>However, there are teams at Netflix that do traditional Operations, and teams that do DevOps as well.</p></blockquote>


<p>Ops is ops is ops. No matter what you call it, Operations is operations.</p>

<blockquote><p>Notice that we didn&#8217;t use the typical DevOps tools Puppet or Chef to create builds at runtime</p></blockquote>


<p>There&#8217;s no such thing as a &#8220;DevOps tool&#8221;. People were using CFengine, Puppet and Chef long before DevOps was even a term. These are configuration management tools. In fact Adrian has even said they use Puppet in their legacy datacenter:</p>

<p><img src="http://i.imgur.com/RJIX1.png" alt="Imgur" /></p>

<p>yet he seems to make the distinction between the ops guys there and the &#8220;devops&#8221; guys (whatever those are).</p>

<blockquote><p>There is no ops organization involved in running our cloud&#8230;</p></blockquote>


<p>Just because you outsourced it, doesn&#8217;t mean it doesn&#8217;t exist. Oh and it&#8217;s not your cloud. It&#8217;s Amazon&#8217;s.</p>

<h1>Reading between the lines</h1>

<p>Actually this doesn&#8217;t take much reading between the lines. It&#8217;s out there in plain sight:</p>

<blockquote><p>In reality we had the usual complaints about how long it took to get new capacity, the lack of consistency across supposedly identical systems, and failures in Oracle, in the SAN and the networks, that took the site down too often for too long.</p></blockquote>




<blockquote><p>We tried bringing in new ops managers, and new engineers, but they were always overwhelmed by the fire fighting needed to keep the current systems running.</p></blockquote>




<blockquote><p>This is largely because the people making decisions are development managers, who have been burned repeatedly by configuration bugs in systems that were supposed to be identical.</p></blockquote>




<blockquote><p>The developers used to spend hours a week in meetings with Ops discussing what they needed, figuring out capacity forecasts and writing tickets to request changes for the datacenter.</p></blockquote>




<blockquote><p>There is no ops organization involved in running our cloud, no need for the developers to interact with ops people to get things done, and less time spent actually doing ops tasks than developers would spend explaining what needed to be done to someone else.</p></blockquote>


<p>I&#8217;m glad to see this spelled out in such detail. This is what I&#8217;ve been telling people semi-privately for a while now. Because Netflix had such a terrible experience with its operations team, they went to the opposite extreme and disintermediated them.</p>

<p>Imagine you were scared as a kid by a clown. Now imagine you have kids of your own. You hate clowns. You had a bad experience with clowns. But it&#8217;s your kid&#8217;s birthday party so here you are making baloon animals, telling jokes and doing silly things to entertain the kids.</p>

<p>Just because you aren&#8217;t wearing makeup doesn&#8217;t make you any less of a clown. You&#8217;re doing clown shit. Through the eyes of the kids, you&#8217;re a clown. Deal with it.</p>

<p>Netflix is still doing operations. What should be telling and frightening to operations teams everywhere is this:</p>

<p>The Netflix response to poorly run operations that can&#8217;t service the business is going to become the norm and not the exception. Evolve or die.</p>

<p>Please note that I don&#8217;t lay all the blame on the Netflix operations team. I would love to hear the flipside of this story from someone who was there originally when the streaming initiative started. It would probably be full of stories we&#8217;ve heard before - no resources, misalignment of incentives and a whole host of others.</p>

<p>Adrian, thank you for writing the blog post. I hope it serves as a warning to those who come. Hopefully someday you&#8217;ll be able to see a clown again and not get scared ;)</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Why you should stop fighting distro vendors]]></title>
    <link href="http://lusis.github.com/blog/2012/03/16/why-you-should-stop-fighting-distro-vendors/"/>
    <updated>2012-03-16T14:37:00-04:00</updated>
    <id>http://lusis.github.com/blog/2012/03/16/why-you-should-stop-fighting-distro-vendors</id>
    <content type="html"><![CDATA[<p>Recently I saw a tweet from <a href="https://twitter.com/#!/kohsukekawa/status/180717301795008512">Kohsuke Kawaguchi</a> that really got me frustrated.</p>

<!-- more -->


<p>I&#8217;ve addressed this topic a bit before <a href="http://lusislog.blogspot.com/2010/09/distributions-and-dynamic-languages.html">here</a>. At the time it was addressing specifically dynamic languages. However the post that Kohsuke wrote (and the post that inspired it) have led me to a new line attitude.</p>

<p><strong>Don&#8217;t bother trying to get your packages into upstream vendor distros</strong></p>

<h1>Wait. What? Let&#8217;s step back a sec</h1>

<p>Let me clarify something first. System packages are a good thing. The hassle has always been with BUILDING those packages. It was simply easier to build the software on the machine and install to <code>/usr/local/</code> than to try and express anything more than the most moderately simple application in RPM or DEB build scripts:</p>

<ul>
<li>If what you are packaging has dependencies not shipped with the OS, now you&#8217;ve got to package those</li>
<li>If your dependency conflicts with a vendor-shipped version, you&#8217;re screwed.</li>
<li>If your dependency is a language runtime, give up.</li>
<li>If your dependency is a specific version of python, just go into another line of work.</li>
<li>If it&#8217;s a distro LTS release, just don&#8217;t bother</li>
</ul>


<h1>Ahh but we can work around this!</h1>

<p>Yes, you&#8217;re right. We now have tools like <a href="https://github.com/jordansissel/fpm">fpm</a> that take the pain out of it! Maven has had plugins that generate rpms and debs for you for a while now. Things are looking up! Let&#8217;s just use those tools.</p>

<p>So now you think, I&#8217;ll just get these things submitted to Debian&#8230;.</p>

<p><strong>KABLOCK</strong></p>

<p>I could rant a bit about Debian&#8217;s packaging policy but it&#8217;s addressed in the posts above. So maybe the Fedora people are more flexible?</p>

<p><img src="http://i.imgur.com/px5ug.png" alt="Imgur" /></p>

<p><strong>WAT</strong></p>

<p>So here we have the two major distros that won&#8217;t even consider your package unless you give the end-user the &#8220;freedom&#8221; to make your application unusable. Essentially you are told if you want your package to be included in upstream then you have to make sure they can swap out <code>libfunkytown.so.23</code> with <code>libfunkytown.so.1</code>.</p>

<p>But maybe your application doesn&#8217;t work on that version. So maybe you think, I&#8217;ll just vendor ALL the things and shove it into <code>/opt</code> or <code>/usr/local</code>? Yeah that doesn&#8217;t fly either (for various reasons).</p>

<p>The point is that you&#8217;ll probably never be able to get your package included upstream because you&#8217;ll never be able to jump through the hoops to do it.</p>

<h1>So stop trying</h1>

<p>I know, I know. It would be awesome if you could tell users to just <code>yum install kickass</code> or <code>apt-get install kickass</code> but it&#8217;s not worth it for several reasons as enumerated above.</p>

<p>Distributions are not your friend. One could argue that its not thier job to be your friend. I would even agree with that argument. The distros have (or at least SHOULD have) an allegience to their user base. My argument is that position is directly opposed to your needs as a software provider.</p>

<h2>Things you should not do</h2>

<ul>
<li>Waste your time trying to ensure that your software works on some busted as old version of libfunkytown that won&#8217;t get upgrade for 7 years.</li>
<li>Waste your time breaking your application into 436 interdependent subpackages just to please upstream</li>
<li>Ignore the prexisting dependency management ecosystem of your language of choice (especially if it works)</li>
</ul>


<h2>Things you should do</h2>

<ul>
<li>Use your language&#8217;s preexisting dependency management system to collect all your dependencies</li>
<li>Rebar, bundle, virtualenv, mavenize, fatjar whatever ALL the dependencies</li>
<li>Use FPM or some homegrown script to create a monolithic rpm or deb of your codebase that installs to <code>/opt/appname</code></li>
<li>Make these packages available to your users on your download site</li>
<li>Alternately, create a repo and repo config file they can use to stay up to date</li>
</ul>


<p>You will be happy. Your users will be happy. The distros can go lick themselves. We have reached something of a crossroads. As I argued in the previous post, the concept of a distribution is becoming somewhat irrelevant. Distros are more concerned about politics and making statements and broken concepts like software that doesn&#8217;t need upgrading for 7 years (or even 2 years) than providing a framework and ecosystem that encourages developers to target software at it.</p>

<p>If someone takes up the noble cause of trying to get your software included upstream, I would go so far as to make it plainly clear on whatever communication you have that you simply cannot support an unofficial repackaging of your software. Be polite. These are still your potential userbase. Simply state that those were not created by you and that the official packages are here.</p>

<h1>A case in point</h1>

<p>What I&#8217;m suggesting you do is not unheard of and honestly is the most tenable long term path for your users. Look at projects like Vagrant, Chef and Puppet among others. All of these tools are &#8220;owning their availability&#8221; the right way and are arguably providing better end user experiences than getting included in upstream could provide. In fact the experience of official packaging is above and beyond trying to do it yourself. As it should be.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Graphs in operations]]></title>
    <link href="http://lusis.github.com/blog/2012/03/06/graphs-in-operations/"/>
    <updated>2012-03-06T23:59:00-05:00</updated>
    <id>http://lusis.github.com/blog/2012/03/06/graphs-in-operations</id>
    <content type="html"><![CDATA[<p>So anyone who knows me knows I spend an inordinate amount of time bitching about Maven. I don&#8217;t know if it&#8217;s the type of companies I end up working for or what but I always seem to find myself ass-deep in Maven.</p>

<!-- more -->


<p><em>please note that I&#8217;m drifiting into deeply unfamiliar territory for me. Someone once told me the best way to learn about something is to write about it. Keep that in mind when making comments?</em></p>

<p>One of the more interesting parts of maven is the dependency graph and concepts like transitive and (god forbid) circular dependencies. These problems aren&#8217;t exlcusive to java, mind you. See bundler for Ruby.</p>

<h2>A bit on graphs</h2>

<p>Graph is a fairly overloaded term. In the context of this discussion I&#8217;m talking about graph theory (insofar as I can grok it). Specifically I want to talk about it in the context of IT operations.</p>

<p>Graphs are nothing &#8220;new&#8221;. Programmers have binary trees. Network geeks have OSPF. Puppet and Git are fans of the DAG (directed acyclic graph). These are all rooted in the same place no? You have nodes and edges. It&#8217;s math all the way down. Unfortunately I suck at math.</p>

<p>If the topic interests you at all, wikipedia has a good couple of articles worth reading. Seeing as I&#8217;m far from a domain expert, I can&#8217;t vouch for the quality:</p>

<ul>
<li><a href="http://en.wikipedia.org/wiki/Graph_theory">Graph Theory</a></li>
<li><a href="http://en.wikipedia.org/wiki/Graph_(mathematics">Graph (mathematics)</a>)</li>
<li><a href="http://en.wikipedia.org/wiki/Glossary_of_graph_theory">Glossary of graph theory</a></li>
</ul>


<h1>How can graphs apply to IT operations</h1>

<p>I&#8217;ve said for a while now that I feel like there&#8217;s something fuzzy on the horizon that I can&#8217;t make quite make out and it involves orchestration and graphs. I&#8217;m still not clear on how to express it but I&#8217;ll try.</p>

<p>Anyone who has ever used Puppet or Git has dabbled in graphs even if they don&#8217;t know it. However my interest in graphs in operations relates to the infrastructure as a whole. James Turnbull expressed it very well last year in Mt. View when discussion orchestration. Obviously this is a topic near and dear to my heart.</p>

<p>Right now much of orchestration is in the embryonic stages. We define relationships manually. We register watches on znodes. We define hard links between components in a stack. X depends on Y depends on Z. We&#8217;re not really being smart about it. If someone disagrees, I would LOVE to see a tool addressing the space.</p>

<p>Justin Sheehy did an awesome high level presentation on distributed systems, databases and the like at Velocity last year. While the talk was good, one thing that stuck out with me was his usage of the Riak logo:</p>

<p><img src="https://assets.github.com/img/b4d183fe3181209da593ed5c6bf0f4c805ab2a62/687474703a2f2f6769746875622d696d616765732e73332e616d617a6f6e6177732e636f6d2f626c6f672f323031302f7269616b2d6c6f676f2e706e67" alt="Riak Logo" /></p>

<p>During the presentation he would zoom out of the logo and replace it with the same logo. It expressed the idea of moving up the stack. Macro versus micro. I have the same feeling about where orchestration is going.</p>

<h2>Express yourself</h2>

<p>Currently we do a great job (and have the tools) to express relationships and dependencies at the node level:</p>

<ul>
<li>webapp needs container</li>
<li>container needs java</li>
<li>container needs system user</li>
</ul>


<p>Going a level higher, we even have some limited ability to express relationships between nodes:</p>

<ul>
<li>Load balancer needs app servers</li>
<li>App server needs database</li>
</ul>


<p>We&#8217;re not quite as good at this part yet but people have workarounds. I use Noah for this. enStratus also handles this very well.</p>

<p>But we&#8217;re still defining those relationships manually.</p>

<p>When we get to this next level up, things get REALLY fuzzy. As people start to (re)discover SOA, we now have stacks that have dependencies on other stacks. Currently we use tools like Zookeeper to broker that relationship. But we still have to explcitly manage it.</p>

<p>The level of coupling here isn&#8217;t the problem. You can mitigate failure in one stack as it relates to another stack. Fail fast and fall back to sane/safe defaults. Read any article about how Netflix architects to get an idea.</p>

<h1>What&#8217;s missing?</h1>

<p>What I feel like we&#8217;re missing is a way to express those relationships and then trigger on them all the way up and down the chain as needed. We&#8217;re starting to get into graph territory here.</p>

<p>We must we be able to express and act on changes at the micro level (<em>I changed a config, I must restart nginx</em>) and even at the intranode level (<em>something changed in my app tier, need to tell my load balancer</em>) but now we need a way handle it at that macro level. Not only do we need a way to handle it but we must also be able to calculate what is impacted by that change.</p>

<ul>
<li>If I have this internode change, does it affect the intranode relationship?</li>
<li>If I have an intranode change, does it affect the intrastack relationship?</li>
</ul>


<p>It seems to me that a graph of SOME kind is the best way to express this. I just can&#8217;t quite make it out. Does current graph technology even handle that subgraph relationship? Excuse the pun but where do we draw the line? Are there multiple lines?</p>

<p>Maybe this isn&#8217;t an issue. Maybe through resilience engineering we simply keep that &#8220;intrastack&#8221; dependency as loose as possible so that we don&#8217;t have this problem?</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[ZeroMQ and Logstash - Part 2]]></title>
    <link href="http://lusis.github.com/blog/2012/02/08/zeromq-and-logstash-part-2/"/>
    <updated>2012-02-08T21:08:00-05:00</updated>
    <id>http://lusis.github.com/blog/2012/02/08/zeromq-and-logstash-part-2</id>
    <content type="html"><![CDATA[<p>A few days ago I wrote up some notes on how we&#8217;re making Logstash better by adding ZeroMQ as an option for inputs and outputs. That night we decided to take it a bit further and add support for ZeroMQ as a filter plugin as well.</p>

<!-- more -->


<p>I&#8217;ve had a lot of people ask me what&#8217;s so hot about ZeroMQ. It&#8217;s hard to explain but I really would suggest you read the excellent <a href="http://zguide.zeromq.org">zguide</a>. The best way I can describe it is that it&#8217;s sockets on steroids. Sockets that behave the way you would expect sockets to behave as opposed to the way they do now. <a href="http://www.quora.com/What-is-the-background-of-the-just-open-a-socket-meme">Just open a socket!</a>.</p>

<h1>Inputs and Outputs</h1>

<p>I&#8217;m only going to touch briefly on inputs and outputs. They were discussed briefly previously and I have a full fledged post in the wings about it.</p>

<p>They essentially work like the other implementations (AMQP and Redis) with the exception that you don&#8217;t have a broker in the middle. Let me show you:</p>

<pre><code>[Collector 1] ------ load balanced events ----&gt; [Indexer 1, Indexer 2, Indexer 3, Indexer 4]
[Collector 2] ------ load balanced events ----&gt; [Indexer 1, Indexer 2, Indexer 3, Indexer 4]
[Collector 3] ------ load balanced events ----&gt; [Indexer 1, Indexer 2, Indexer 3, Indexer 4]
[Collector 4] ------ load balanced events ----&gt; [Indexer 1, Indexer 2, Indexer 3, Indexer 4]
</code></pre>

<p>As you can see we&#8217;re doing a pattern very similar to before. We want to send events of our nodes over to a cluster of indexers that do filtering. The difference here is that we don&#8217;t have a broker. Not big deal, right? One less thing to worry about! You don&#8217;t have to learn some new tool just to get some simple load balancing of workers. This works great&#8230;..until you need to scale workers.</p>

<p>Even using awesome configuration management, you&#8217;ve now got to cycle all your collectors to add the new endpoints. This means lost events. This makes me unhappy. It makes you unhappy. The world is sad. Why are you doing this to us?</p>

<p>Luckily I&#8217;ve been authorized by the Franklin Mint to release the source code to an enterprise class ZeroMQ broker that you can use. Not only is it enterprise class but it has built-in clustering. You can <a href="https://github.com/lusis/enterprise-zeromq-broker">grab the code here from github</a>.</p>

<p>Here are the configs for the logstash agents (output.conf is collector config, input.conf is indexer config):</p>

<p>output.conf:</p>

<figure class='code'><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>input { stdin { type =&gt; "stdin" } }
</span><span class='line'>output {
</span><span class='line'>  zeromq {
</span><span class='line'>    topology =&gt; "pushpull"
</span><span class='line'>    address =&gt; ["tcp://localhost:5555", "tcp://localhost:5557"]
</span><span class='line'>  }
</span><span class='line'>}</span></code></pre></td></tr></table></div></figure>


<p>input.conf:</p>

<figure class='code'><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>input { 
</span><span class='line'>  zeromq {
</span><span class='line'>    type =&gt; "pull-input"
</span><span class='line'>    topology =&gt; "pushpull"
</span><span class='line'>    address =&gt; ["tcp://localhost:5556", "tcp://localhost:5558"]
</span><span class='line'>    mode =&gt; "client"
</span><span class='line'>  }
</span><span class='line'>}
</span><span class='line'>output { stdout { debug =&gt; true }}</span></code></pre></td></tr></table></div></figure>


<h2>Action shot</h2>

<p>Here&#8217;s a shot of our fancy clustered broker in action (click to zoom):</p>

<p><a href="http://lusis.github.com/images/posts/zeromq-part2/zeromq-broker-ss.png"><img src="http://lusis.github.com/images/posts/zeromq-part2/zeromq-broker-ss.png" alt="zeromq-broker-ss.png" /></a></p>

<p>As you can see the two events we sent were automatically load balanced across our <em>&#8220;brokers&#8221;</em> which then load balanced across our indexers.</p>

<h2>What have we bought ourselves?</h2>

<p>Obviously this is all something of a joke. All we have done is point our collectors at other nodes instead of directly at our indexers. But realize that you can create 2 fixed points on your network with 8 lines of core code and use those as the static information in your indexers and collectors. You can then scale either side without ever having to update a configuration file.</p>

<p>I dare say you can even run those on t1.micro instances on Amazon.</p>

<p>Oh and if you don&#8217;t like Ruby, write it in something else. That&#8217;s the beauty of ZeroMQ.</p>

<h1>Filters</h1>

<p>The thing that has me most excited is the addition of ZeroMQ as a filter to logstash. As you&#8217;ve already seen, ZeroMQ makes it REALLY easy to wire network topologies up with complex patterns. In the inputs and outputs we&#8217;ve exposed a few topologies that make sense. However there&#8217;s another topology that we had not yet exposed because it didn&#8217;t make sense - <code>reqrep</code>.</p>

<h2>REQ/REP</h2>

<p><code>reqrep</code> is short for request and reply. The reason we didn&#8217;t expose it previously is that it didn&#8217;t really make sense with the nature of inputs and outputs. However after talking with Jordan, we decided it actually DID make sense to use it for filters. After all, filters get a request -> do something -> return a response.</p>

<p>If it&#8217;s not immediately clear yet how this makes sense, I&#8217;ve got another example for you. Let&#8217;s take the case of needing to look something up externally to mutate a field. You COULD write a Logstash filter to do this ONE thing for you. Maybe you can make it generic enough to even submit a pull request.</p>

<p>Or you could use a ZeroMQ filter:</p>

<figure class='code'><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>input { stdin { type =&gt; "stdin-type" } }
</span><span class='line'>filter { zeromq { } }
</span><span class='line'>output { stdout { debug =&gt; true } }</span></code></pre></td></tr></table></div></figure>


<p>Here&#8217;s the code for the filter:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
<span class='line-number'>13</span>
<span class='line-number'>14</span>
<span class='line-number'>15</span>
<span class='line-number'>16</span>
<span class='line-number'>17</span>
<span class='line-number'>18</span>
<span class='line-number'>19</span>
<span class='line-number'>20</span>
<span class='line-number'>21</span>
<span class='line-number'>22</span>
<span class='line-number'>23</span>
<span class='line-number'>24</span>
</pre></td><td class='code'><pre><code class='ruby'><span class='line'><span class="nb">require</span> <span class="s1">&#39;rubygems&#39;</span>
</span><span class='line'><span class="nb">require</span> <span class="s1">&#39;ffi-rzmq&#39;</span>
</span><span class='line'><span class="nb">require</span> <span class="s2">&quot;json&quot;</span>
</span><span class='line'>
</span><span class='line'><span class="n">context</span> <span class="o">=</span> <span class="no">ZMQ</span><span class="o">::</span><span class="no">Context</span><span class="o">.</span><span class="n">new</span>
</span><span class='line'><span class="n">socket</span> <span class="o">=</span> <span class="n">context</span><span class="o">.</span><span class="n">socket</span><span class="p">(</span><span class="no">ZMQ</span><span class="o">::</span><span class="no">REP</span><span class="p">)</span>
</span><span class='line'><span class="n">socket</span><span class="o">.</span><span class="n">bind</span><span class="p">(</span><span class="s2">&quot;tcp://*:2121&quot;</span><span class="p">)</span>
</span><span class='line'><span class="n">msg</span> <span class="o">=</span> <span class="s1">&#39;&#39;</span>
</span><span class='line'><span class="nb">puts</span> <span class="s2">&quot;starting up&quot;</span>
</span><span class='line'><span class="k">while</span> <span class="kp">true</span> <span class="k">do</span>
</span><span class='line'>  <span class="n">socket</span><span class="o">.</span><span class="n">recv_string</span><span class="p">(</span><span class="n">msg</span><span class="p">)</span>
</span><span class='line'>  <span class="n">modified_message</span> <span class="o">=</span> <span class="no">JSON</span><span class="o">.</span><span class="n">parse</span><span class="p">(</span><span class="n">msg</span><span class="p">)</span>
</span><span class='line'>  <span class="nb">puts</span> <span class="s2">&quot;Message received: </span><span class="si">#{</span><span class="n">msg</span><span class="si">}</span><span class="s2">&quot;</span>
</span><span class='line'>  <span class="c1"># Simulate using an external data source to </span>
</span><span class='line'>  <span class="c1"># to something that you need</span>
</span><span class='line'>  <span class="k">case</span> <span class="n">modified_message</span><span class="o">[</span><span class="s2">&quot;@source&quot;</span><span class="o">]</span>
</span><span class='line'>  <span class="k">when</span> <span class="s2">&quot;stdin://jvstratusmbp.lusis.org/&quot;</span>
</span><span class='line'>    <span class="nb">puts</span> <span class="s2">&quot;Doing db lookup&quot;</span>
</span><span class='line'>    <span class="nb">sleep</span> <span class="mi">10</span>
</span><span class='line'>    <span class="n">modified_message</span><span class="o">[</span><span class="s2">&quot;@source&quot;</span><span class="o">]</span> <span class="o">=</span> <span class="s2">&quot;john&#39;s laptop&quot;</span>
</span><span class='line'>  <span class="k">end</span>
</span><span class='line'>  <span class="nb">puts</span> <span class="s2">&quot;Message responded: </span><span class="si">#{</span><span class="n">modified_message</span><span class="o">.</span><span class="n">to_json</span><span class="si">}</span><span class="s2">&quot;</span>
</span><span class='line'>  <span class="n">socket</span><span class="o">.</span><span class="n">send_string</span><span class="p">(</span><span class="n">modified_message</span><span class="o">.</span><span class="n">to_json</span><span class="p">)</span>
</span><span class='line'><span class="k">end</span>
</span></code></pre></td></tr></table></div></figure>


<p>By default, the filter will send the entire event over a ZeroMQ <code>REQ</code> socket to <code>tcp://localhost:2121</code>. It will then take the reply and send it up the chain to the Logstash output with the following results:</p>

<p><a href="http://lusis.github.com/images/posts/zeromq-part2/zeromq-filter-event.png"><img src="http://lusis.github.com/images/posts/zeromq-part2/zeromq-filter-event.png" alt="zeromq-filter-event.png" /></a></p>

<p>Alternately, you can send a single field to the filter and have it to work with:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
</pre></td><td class='code'><pre><code class='ruby'><span class='line'><span class="n">input</span> <span class="p">{</span> <span class="n">stdin</span> <span class="p">{</span> <span class="n">type</span> <span class="o">=&gt;</span> <span class="s2">&quot;stdin-test&quot;</span> <span class="p">}</span> <span class="p">}</span>
</span><span class='line'><span class="n">filter</span> <span class="p">{</span> <span class="n">zeromq</span> <span class="p">{</span> <span class="n">field</span> <span class="o">=&gt;</span> <span class="s2">&quot;@message&quot;</span> <span class="p">}</span> <span class="p">}</span>
</span><span class='line'><span class="n">output</span> <span class="p">{</span> <span class="n">stdout</span> <span class="p">{</span> <span class="n">debug</span> <span class="o">=&gt;</span> <span class="kp">true</span> <span class="p">}}</span>
</span></code></pre></td></tr></table></div></figure>


<p>and the code:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
<span class='line-number'>13</span>
<span class='line-number'>14</span>
<span class='line-number'>15</span>
<span class='line-number'>16</span>
</pre></td><td class='code'><pre><code class='ruby'><span class='line'><span class="nb">require</span> <span class="s1">&#39;rubygems&#39;</span>
</span><span class='line'><span class="nb">require</span> <span class="s1">&#39;ffi-rzmq&#39;</span>
</span><span class='line'><span class="nb">require</span> <span class="s2">&quot;json&quot;</span>
</span><span class='line'>
</span><span class='line'><span class="n">context</span> <span class="o">=</span> <span class="no">ZMQ</span><span class="o">::</span><span class="no">Context</span><span class="o">.</span><span class="n">new</span>
</span><span class='line'><span class="n">socket</span> <span class="o">=</span> <span class="n">context</span><span class="o">.</span><span class="n">socket</span><span class="p">(</span><span class="no">ZMQ</span><span class="o">::</span><span class="no">REP</span><span class="p">)</span>
</span><span class='line'><span class="n">socket</span><span class="o">.</span><span class="n">bind</span><span class="p">(</span><span class="s2">&quot;tcp://*:2121&quot;</span><span class="p">)</span>
</span><span class='line'><span class="n">msg</span> <span class="o">=</span> <span class="s1">&#39;&#39;</span>
</span><span class='line'><span class="nb">puts</span> <span class="s2">&quot;starting up&quot;</span>
</span><span class='line'><span class="k">while</span> <span class="kp">true</span> <span class="k">do</span>
</span><span class='line'>  <span class="n">socket</span><span class="o">.</span><span class="n">recv_string</span><span class="p">(</span><span class="n">msg</span><span class="p">)</span>
</span><span class='line'>  <span class="nb">puts</span> <span class="s2">&quot;Recieved message: </span><span class="si">#{</span><span class="n">msg</span><span class="si">}</span><span class="s2">&quot;</span>
</span><span class='line'>  <span class="n">modified_message</span> <span class="o">=</span> <span class="s2">&quot;this field was changed externally&quot;</span>
</span><span class='line'>  <span class="nb">puts</span> <span class="s2">&quot;Modified message: </span><span class="si">#{</span><span class="n">modified_message</span><span class="si">}</span><span class="s2">&quot;</span>
</span><span class='line'>  <span class="n">socket</span><span class="o">.</span><span class="n">send_string</span><span class="p">(</span><span class="n">modified_message</span><span class="p">)</span>
</span><span class='line'><span class="k">end</span>
</span></code></pre></td></tr></table></div></figure>


<p>and the result:</p>

<p><a href="http://lusis.github.com/images/posts/zeromq-part2/zeromq-filter-field.png"><img src="http://lusis.github.com/images/posts/zeromq-part2/zeromq-filter-field.png" alt="zeromq-filter-field.png" /></a></p>

<p>Many people have been asking for an <code>exec</code> filter for some time now. Dealing with that overhead is insane when coming from the JVM. By doing this type of work over ZeroMQ, there&#8217;s much less overhead AND a reliable conduit for making it happen.</p>

<p>Here&#8217;s just a few of the use cases I could think of:</p>

<ul>
<li>Artifically throttling your flow. Just use a sleep and return the original event.</li>
<li>Doing external lookups for replacing parts of the event</li>
<li>Adding arbitrary tags to a message using external criteria based on the event.</li>
<li>Moving underperforming filters out of logstash and into an external process that is more performant</li>
<li>Reducing the need to modify configs in logstash for greater uptime.</li>
</ul>


<h1>Wrap up</h1>

<p>All the ZeroMQ support is currently tagged experimental (hence the warnings you saw in my screenshots). It also exists in the form described only in master. If this interests you at all, please build from master and run some tests of your own. We would love the feedback and any bugs or tips you can provide are always valuable.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[ZeroMQ and Logstash - Part 1]]></title>
    <link href="http://lusis.github.com/blog/2012/02/06/zeromq-and-logstash-part-1/"/>
    <updated>2012-02-06T01:07:00-05:00</updated>
    <id>http://lusis.github.com/blog/2012/02/06/zeromq-and-logstash-part-1</id>
    <content type="html"><![CDATA[<p>Every once in a while, a software project comes along that makes you rethink how you&#8217;ve done things up until that point. I&#8217;ve often said that ElasticSearch was the first of those projects for me. The other is ZeroMQ.</p>

<!-- more -->


<h1>Edit and update</h1>

<p>Evidently my testing missed a pretty critical usecase - pubsub. It doesn&#8217;t work right now. Due to the way we&#8217;re doing sockopts works for setting topics. However we don&#8217;t have a commensurate setting on the PUB side. I&#8217;ve created <a href="https://logstash.jira.com/browse/LOGSTASH-399">LOGSTASH-399</a> and <a href="https://logstash.jira.com/browse/LOGSTASH-400">LOGSTASH-400</a> to deal with these issues. I am so sorry about that however it doesn&#8217;t change the overall tone and content of this message as <code>pair</code> and <code>pushpull</code> still work.</p>

<h1>A little history</h1>

<p>In January of this year, <a href="https://twitter.com/jordansissel">Jordan</a> merged the first iteration of ZeroMQ support for Logstash. Several people had been asking for it and I had it on my plate to do as well. Funny side note, the pull request for the ZeroMQ plugin was my inspiration for adding <a href="http://logstash.net/docs/1.1.0/plugin-status">plugin_status</a> to Logstash.</p>

<p>The reason for wanting to mark it experimental is that there was concern over the best approach to using ZeroMQ with Logstash. Did we create a single context per agent? Did we do a context per thread? How well would the multiple layers of indirection work (jvm + ruby + ffi)?</p>

<p><a href="https://twitter.com/_masterzen_">Brice&#8217;s</a> original pull request only hadnled one part of the total ZeroMQ package (PUBSUB) but it was an awesome start. We actually had two other pull requests around the same time but his was first.</p>

<p>A week or so ago, I started a series of posts around doing load balanced filter pipelines with Logstash. The first was <a href="http://goo.gl/vWyCH">AMQP</a> and then <a href="http://goo.gl/6W8Lv">Redis</a>. The next logical step was ZeroMQ (and something of a &#8220;Oh..and one more thing..&#8221; post). Sadly, the current version of the plugin was not amenable to doing the same flow. Since it only supported PUBSUB, I needed to do some work on the plugin to get the other socket types supported. I made this my weekend project.</p>

<h1>Something different</h1>

<p>One thing that ZeroMQ does amazingly well is make something complex very easy. It exposes common communication patterns over socket types and makes it easy to use them. It really is just plug and play communication.</p>

<p>However it also makes some really powerful flows available to you if you dig deep enough. Look at this example from the <a href="http://zguide.zeromq.org">zguide</a></p>

<p><img src="https://github.com/imatix/zguide/raw/master/images/fig14.png" alt="complex-flow" /></p>

<p>Mind you the code for that is pretty simple (<a href="http://zguide.zeromq.org/rb:taskwork2">ruby example</a>) but we need to enable that level of flexibility and power behind the Logstash config language. We also wanted to avoid the confusion that we faced with the AMQP plugin around exchange vs. queue.</p>

<p>Jordan came up with the idea of removing the socket type confusion and just exposing the patterns. And that&#8217;s what we&#8217;ve done.</p>

<h1>Configuration</h1>

<p>In the configuration language, Logstash exposes the ZeroMQ socket type pairs in the using the same syntax on both inputs and outputs. We call these a &#8220;topology&#8221;. In fact, out of the box, Logstash ZeroMQ support will work out of the box with two agents on the same machine:</p>

<h2>Output</h2>

<figure class='code'><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>input {
</span><span class='line'>  stdin { type =&gt; "stdin-input" }
</span><span class='line'>}
</span><span class='line'>output {
</span><span class='line'>  zeromq { topology =&gt; "pushpull" }
</span><span class='line'>}</span></code></pre></td></tr></table></div></figure>


<h2>Input</h2>

<figure class='code'><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>input {
</span><span class='line'>  zeromq { topology =&gt; "pushpull" type =&gt; "zeromq-input" }
</span><span class='line'>}
</span><span class='line'>output {
</span><span class='line'>  stdout { debug =&gt; true }
</span><span class='line'>}</span></code></pre></td></tr></table></div></figure>


<h2>Opinionated</h2>

<p>Because any side of a socket type in ZeroMQ can be the connecting or binding side (the underlying message flow is disconnected from how the connection is established), Logstash follows the recommendation of the zguide. The more &#8220;stable&#8221; parts of your infrastructure should be the side that binds/listens while they ephemeral side should be the one that initiates connections.</p>

<p>Following this, we have some sane defaults around the plugins:</p>

<ul>
<li>Logstash inputs will, by default, be the <code>bind</code> side and bind to all interfaces on port 2120</li>
<li>Logstash outputs will, by default, be the <code>connect</code> side</li>
<li>Logstash inputs will be the consumer side of a flow</li>
<li>Logstash outputs will be the producing side of a flow</li>
</ul>


<p>The last two are obviously pretty &#8220;duh&#8221; but worth mentioning. Right now Logstash exposes three socket types that make sense for Logstash:</p>

<ul>
<li>PUSHPULL (Output is PUSH. Input is PULL)</li>
<li>PUBSUB (Output is PUB. Input is SUB)</li>
<li>PAIR</li>
</ul>


<p>It&#8217;s worth reading up on ALL <a href="http://api.zeromq.org/2-1:zmq-socket">the socket types in ZeroMQ</a>.</p>

<p>By default, because of how ZeroMQ will most commonly be slotted into your pipeline, it sets the default message format to the Logstash native <em>json_event</em>.</p>

<p>You can still get to the low-level tuning of the sockets via the <code>sockopts</code> configuration setting. This is a Logstash config hash. For example, if you wanted to tune the high water mark of a socket (<code>ZMQ_HWM</code>), you would do so with this option:</p>

<p><code>zeromq { topology =&gt; "pushpull" sockopts =&gt; ["ZMQ::HWM", 20] }</code></p>

<p>These options are passed directly to the <code>ffi-rzmq</code> library we use (hence the syntax on the option name). If a new option is added in a later release, it&#8217;s already available that way.</p>

<h1>Usage of each topology</h1>

<p>While I have a few more blog posts in the hopper around ZeroMQ (and various patterns with Logstash), I&#8217;ll briefly cover where each type might fit.</p>

<h2>PUBSUB</h2>

<p>This is exactly what it sounds like. Each output (PUB) broadcasts to all connected inputs (SUB).</p>

<h2>PUSHPULL</h2>

<p>This most closely mimics the examples in my previous posts on AMQP and Redis. Each output (PUSH) load-balances across all connected inputs (PULL).</p>

<h2>PAIR</h2>

<p>This is essentially a one-to-one streaming socket. While messages CAN flow both directions, Logstash does not support (nor need) that. Outputs stream events to the input.</p>

<p>ZeroMQ has other topologies (like REQREP - request response and ROUTER/DEALER) but they don&#8217;t really make sense for Logstash right now. For the type of messaging that Logstash does between peers, PAIR is a much better fit. We have plans to expose these in a future release.</p>

<h1>Future</h1>

<p>As I said, I&#8217;ve got quite a few ideas for posts around this plugin. It opens up so many avenues for users and makes doing complex pipelines much easier. Here&#8217;s a sample of some things you&#8217;ll be able to do:</p>

<ul>
<li>Writing your own &#8220;broker&#8221; to sit between edges and indexers in whatever language works best (8 lines of Ruby)</li>
<li>Log directly from your application (e.g. log4j ZMQ appender) to logstash with minimal fuss</li>
<li>Tune ZeroMQ sockopts for durability</li>
</ul>


<p>Current ZeroMQ support only exists in master right now. However building from source is very easy. Simply clone the repo and type <code>make</code>. You don&#8217;t even need to have Ruby installed. This will leave your very own jar file in the <code>build</code> directory.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Load balancing Logstash with Redis]]></title>
    <link href="http://lusis.github.com/blog/2012/01/31/load-balancing-logstash-with-redis/"/>
    <updated>2012-01-31T23:24:00-05:00</updated>
    <id>http://lusis.github.com/blog/2012/01/31/load-balancing-logstash-with-redis</id>
    <content type="html"><![CDATA[<p>After yesterday&#8217;s post about load balancing logstash with AMQP and RabbitMQ, I got to thinking that it might be useful to show a smilar pattern with other inputs and outputs.
To me this, is the crux of what makes Logstash so awesome. Someone asked me to describe Logstash in one sentence. The best I could come up with was:</p>

<blockquote><p>Logstash is a unix pipe on steroids</p></blockquote>


<p>I hope this post helps you understand what I meant by that</p>

<!-- more -->


<h1>Revisiting our requirements and pattern</h1>

<p>If you recall from the post <a href="http://goo.gl/vWyCH">yesterday</a>, we had the following &#8216;requirements&#8217;:</p>

<ul>
<li>No lost messages in transit/due to inputs or outputs.</li>
<li>Shipper only configuration on the source</li>
<li>Worker based filtering model</li>
<li>No duplicate messages due to transit mediums (i.e. fanout is inappropriate as all indexers would see the same message)</li>
</ul>


<h2>EDIT</h2>

<p>Originally our list stated the requirements as <em>No lost messages</em> and <em>No duplicate messages</em>. I&#8217;ve amended those with a slight modification to closer reflect the original intent. Please see <a href="http://blog.lusis.org/blog/2012/01/31/load-balancing-logstash-with-amqp/#comment-426175086">comment from Jelle Smet here</a> for details. Thanks Jelle!</p>

<p>Our design looked something like this:</p>

<p><a href="http://lusis.github.com/images/posts/load-balancing-logstash-with-amqp/gliffy-overview.png"><img src="http://lusis.github.com/images/posts/load-balancing-logstash-with-amqp/gliffy-overview.png" alt="gliffy-overview.png" /></a></p>

<p>One of the reasons that post was so long was that AMQP is a complicated beast. There was quite a bit of dense frontloading I had to do to cover AMQP before we got to the meat.
We&#8217;re going to take that same example, and swap out RabbitMQ for something a bit simpler and achieve the same results.</p>

<h1>Quick background on Redis</h1>

<p><a href="http://redis.io">Redis</a> is commonly lumped in with a group of data storage technologies called NoSQL. Its name is short for &#8220;REmoteDIctionaryServer&#8221;. It typically falls into the &#8220;key/value&#8221; family of NoSQL.
Several things set Redis apart from most key/value systems however:</p>

<ul>
<li>&#8220;data types&#8221; as values</li>
<li>native operations on those data types</li>
<li>atomic operations</li>
<li>built-in PUB/SUB subsystem</li>
<li>No external dependencies</li>
</ul>


<h2>Data types</h2>

<p>I&#8217;m not going to go into too much detail about the data types except to list them and highlight the one we&#8217;ll be leveraging. You can read more about them <a href="http://redis.io/topics/data-types">here</a></p>

<ul>
<li>Strings</li>
<li>Lists*</li>
<li>Sets</li>
<li>Hashes</li>
<li>Sorted Sets</li>
</ul>


<h3>How Logstash uses Redis</h3>

<p>Looking back at our AMQP example, we note three distinct exchange types. These are mapped to the following functionality in Redis (and Logstash <code>data_type</code> config for reference):</p>

<p><a href="http://lusis.github.com/images/posts/load-balancing-logstash-with-redis/mapping-table.png"><img src="http://lusis.github.com/images/posts/load-balancing-logstash-with-redis/mapping-table.png" alt="mapping-table.png" /></a></p>

<p>This is a somewhat over simplified list. In the case of a message producer, mimicing <code>direct</code> exchanges is done by writing to a Redis <code>list</code> while consumption of that is done via the Redis command <code>BLPOP</code><a href="http://redis.io/commands/blpop">*</a>. However mimicing the <code>fanout</code> and <code>topic</code> functionality is done strictly with the commands <code>PUBLISH</code><a href="http://redis.io/commands/publish">*</a>, <code>SUBSCRIBE</code><a href="http://redis.io/commands/subscribe">*</a> and <code>PSUBSCRIBE</code><a href="http://redis.io/commands/psubscribe">*</a>. It&#8217;s worth reading each of those for a better understanding.</p>

<p>Oddly enough, the use of Redis as a messaging bus is something of a side effect. Redis supported lists that are auto-sorted by insert order. The <code>POP</code> command variants allowed single transaction get and remove of the data. It just fit the use case.</p>

<h1>The configs</h1>

<p>As with our previous example, we&#8217;re going to show the configs needed on each side and explain them a little bit.</p>

<h2>Client-side/Producer</h2>

<figure class='code'><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>input { stdin { type =&gt; "producer"} }
</span><span class='line'>output {
</span><span class='line'>redis {
</span><span class='line'> host =&gt; 'localhost'
</span><span class='line'> data_type =&gt; 'list'
</span><span class='line'> key =&gt; 'logstash:redis'
</span><span class='line'>}
</span><span class='line'>}</span></code></pre></td></tr></table></div></figure>


<h3>data_type</h3>

<p>This is where we tell Logstash how to send the data to Redis. In the case, again, we&#8217;re storing it in a list data type.</p>

<h3>key</h3>

<p>Unfortunately, key means different things (though with the same effect) depending on the <code>data_type</code> being used. In the case of a <code>list</code> this maps cleanly to the understanding of a <code>key</code> in a key/value system. It&#8217;s common in Redis to namespace keys with a <code>:</code> though it&#8217;s entirely unneccesary.</p>

<p>As an aside, when using <code>key</code> on <code>channel</code> data type, this behaves like the routing key in AMQP parlance with the exception of being able to use any separator you like (in other words, you can namespace with <code>.</code>,<code>:</code>,<code>::</code> whatever).</p>

<h2>Indexer-side/Consumer</h2>

<figure class='code'><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>input {
</span><span class='line'>redis {
</span><span class='line'>  host =&gt; 'localhost'
</span><span class='line'>  data_type =&gt; 'list'
</span><span class='line'>  key =&gt; 'logstash:redis'
</span><span class='line'>  type =&gt; 'redis-input'
</span><span class='line'>}
</span><span class='line'>}
</span><span class='line'>output {stdout {debug =&gt; true} }</span></code></pre></td></tr></table></div></figure>


<h3>data_type</h3>

<p>This needs to match up with the value from the output plugin. Again, in this example <code>list</code>.</p>

<h3>key</h3>

<p>In the case of a <code>list</code> this needs to map EXACTLY to the output plugin. Following on to our previous aside, for <code>data_type</code> values of <code>channel</code> input, the key must match exactly while <code>pattern_channel</code> can support wildcards. Redis PSUBSCRIBE wildcards actually much simpler than AMQP ones. You can use <code>*</code> at any point in the key name.</p>

<h1>Starting it all up</h1>

<p>We&#8217;re going to simplify our original tests a little bit in the interest of brevity. Showing 2 producers and 2 consumers gives us the same benefit as showing four of each. Since we don&#8217;t have the benefit of a pretty management interface, we&#8217;re going to use the redis server debug information and the <code>redis-cli</code> application to allow us to see certain management information.</p>

<h2>redis-server</h2>

<p>Start the server with the command <code>redis-server</code> I&#8217;m running this from homebrew but you literally build Redis on any machine that has <code>make</code> and a compiler. That&#8217;s all you need. You can even run it straight from the source directory:</p>

<p><a href="http://lusis.github.com/images/posts/load-balancing-logstash-with-redis/redis-server.png"><img src="http://lusis.github.com/images/posts/load-balancing-logstash-with-redis/redis-server.png" alt="redis-server.png" /></a></p>

<p>You&#8217;ll notice that the redis server is periodically dumping some stats - number of connected clients and the amount of memory in use.</p>

<h2>Starting the logstash agents</h2>

<p>We&#8217;re going to start two producers (redis output) and two consumers (redis input):</p>

<p><a href="http://lusis.github.com/images/posts/load-balancing-logstash-with-redis/agents.png"><img src="http://lusis.github.com/images/posts/load-balancing-logstash-with-redis/agents.png" alt="agents.png" /></a></p>

<p>Back in our redis-server window, you should now see two connected clients in the periodic status messages. Why not four? Because the producers don&#8217;t have a persistent connection to Redis. Only the consumers do (via BLPOP):</p>

<p><a href="http://lusis.github.com/images/posts/load-balancing-logstash-with-redis/two-clients.png"><img src="http://lusis.github.com/images/posts/load-balancing-logstash-with-redis/two-clients.png" alt="two-clients.png" /></a></p>

<h1>Testing message flow</h1>

<p>As with our previous post, we&#8217;re going to alternate messages between the two producers. In the first producer, we&#8217;ll type <code>window 1</code> and in the second <code>window 2</code>. You&#8217;ll see the consumers pick up the messages:</p>

<p><a href="http://lusis.github.com/images/posts/load-balancing-logstash-with-redis/delivery.png"><img src="http://lusis.github.com/images/posts/load-balancing-logstash-with-redis/delivery.png" alt="delivery.png" /></a></p>

<p>If you look over in the redis-server window, you&#8217;ll also see that our client count went up to four. If we were to leave these clients alone, eventually it would drop back down to two.</p>

<p><a href="http://lusis.github.com/images/posts/load-balancing-logstash-with-redis/new-connections.png"><img src="http://lusis.github.com/images/posts/load-balancing-logstash-with-redis/new-connections.png" alt="new-connections.png" /></a></p>

<p>Feel free to run the tests a few times and get a feel for message flow.</p>

<h2>Offline consumers</h2>

<p>This is all well and good but as with the previous example, we want to test how this configuration handles the case of consumers going offline. Shut down the two indexer configs and let&#8217;s verify. To do this, we&#8217;re going to also open up a new window and run the <code>redis-cli</code> app. Technically, you don&#8217;t even need that. You can telnet to the redis port and just run these commands yourself. We&#8217;re going to use the <code>LLEN</code> command to get the size of our &#8220;backlog&#8221;.</p>

<p>In the producer windows, type a few messages. Alternate between producers for maximum effect. Then go over to the <code>redis-cli</code> window and type <code>LLEN logstash:redis</code>. You should see something like the following (obviously varied by how many messages you sent):</p>

<p><a href="http://lusis.github.com/images/posts/load-balancing-logstash-with-redis/llen.png"><img src="http://lusis.github.com/images/posts/load-balancing-logstash-with-redis/llen.png" alt="llen.png" /></a></p>

<p>You&#8217;ll also notice in the redis server window that the amount of memory in use went up slightly.</p>

<p>Now let&#8217;s start our consumers back up and ensure they drain (and in insert order):</p>

<p><a href="http://lusis.github.com/images/posts/load-balancing-logstash-with-redis/drain.png"><img src="http://lusis.github.com/images/posts/load-balancing-logstash-with-redis/drain.png" alt="drain.png" /></a></p>

<p>Looks good to me!</p>

<h1>Persistence</h1>

<p>You might have noticed I didn&#8217;t address disk-based persistence at all. This was intentional. Redis is primarily a memory-based store. However it does have support for a few different ways of persisting to disk - RDB and AOF. I&#8217;m not going to go into too much detail on those. The Redis documentation does a good job of explaining the pros and cons of each. You can read that <a href="http://redis.io/topics/persistence">here</a>.</p>

<h1>Wrap up</h1>

<p>One thing that&#8217;s important to note is that Redis is pretty damn fast. The limitation for Redis is essentially memory. However if speed isn&#8217;t your primary concern, there&#8217;s an interesting alpha project called <a href="http://inaka.github.com/edis">edis</a> worth investigating. It is a port of Redis to Erlang. Its primary goal is better persistence for Redis. For this post I also tested Logstash against edis and I&#8217;m happy to say it works:</p>

<p><a href="http://lusis.github.com/images/posts/load-balancing-logstash-with-redis/edis.png"><img src="http://lusis.github.com/images/posts/load-balancing-logstash-with-redis/edis.png" alt="edis.png" /></a></p>

<p>I hope to do further testing with it in the future in a multinode setup.</p>

<h2>Part three</h2>

<p>I&#8217;m also working on a part three in this &#8220;series&#8221;. The last configuration I&#8217;d like to show is doing this same setup but using <a href="http://zeromq.org">0mq</a> as the bus. This is going to be especially challenging since our 0mq support is curretly &#8216;alpha&#8217;-ish quality. Beyond that, I plan on doing a similar series using pub/sub patterns. If you&#8217;re enjoying these posts, please comment and let me know!</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Load balancing Logstash with AMQP]]></title>
    <link href="http://lusis.github.com/blog/2012/01/31/load-balancing-logstash-with-amqp/"/>
    <updated>2012-01-31T01:12:00-05:00</updated>
    <id>http://lusis.github.com/blog/2012/01/31/load-balancing-logstash-with-amqp</id>
    <content type="html"><![CDATA[<p>AMQP in Logstash is one of the most complicated parts of the workflow. I&#8217;ve taken it on myself, as the person with the most AMQP experience (both RabbitMQ and Qpid) to try and explain as much as need for logstash users.</p>

<p><a href="https://twitter.com/patrickdebois">Patrick DeBois</a> hit me up with a common logstash design pattern that I felt warranted a full detailed post.</p>

<p><em>Warning: This is an image heavy post. Terminal screenshots are linked to larger versions</em></p>

<h2>Requirements</h2>

<ul>
<li>No lost messages in transit/due to inputs or outputs.</li>
<li>Shipper-only configuration on the source</li>
<li>Worker-based filtering model</li>
<li>No duplicate messages due to transit mediums (i.e. fanout is inappropriate as all indexers would see the same message)</li>
<li>External ElasticSearch cluster as final destination</li>
</ul>


<!-- more -->


<h2>EDIT</h2>

<p>Originally our list stated the requirements as <em>No lost messages</em> and <em>No duplicate messages</em>. I&#8217;ve amended those with a slight modification to closer reflect the original intent. Please see <a href="http://blog.lusis.org/blog/2012/01/31/load-balancing-logstash-with-amqp/#comment-426175086">comment from Jelle Smet here</a> for details. Thanks Jelle!</p>

<h2>Notes</h2>

<p>We&#8217;re going to leave the details of filtering and client-side input up to the imagination.
For this use case we&#8217;ll simply use <code>stdin</code> as our starting point. You can modify this as you see fit.
The same goes for filtering. The assumption is that your filters will be correct and not be the source of any messages NOT making it into ElasticSearch.</p>

<p>Each configuration will be explained so don&#8217;t stress over it at first glance. We&#8217;re also going to explicitly set some options for the sake of easier comprehension.</p>

<h1>Client-side agent config</h1>

<figure class='code'><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
<span class='line-number'>13</span>
<span class='line-number'>14</span>
<span class='line-number'>15</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>{
</span><span class='line'>  input {
</span><span class='line'>    stdin { debug =&gt; true type =&gt; "host-agent-input" }
</span><span class='line'>  }
</span><span class='line'>  output {
</span><span class='line'>    amqp {
</span><span class='line'>      name =&gt; "logstash-exchange"
</span><span class='line'>      exchange_type =&gt; "direct"
</span><span class='line'>      host =&gt; "rabbitmq-server"
</span><span class='line'>      key =&gt; "logstash-routing-key"
</span><span class='line'>      durable =&gt; true
</span><span class='line'>      persistent =&gt; true
</span><span class='line'>    }
</span><span class='line'>  }
</span><span class='line'>}</span></code></pre></td></tr></table></div></figure>


<h2>Config Explained</h2>

<p>The amqp output:</p>

<h3>name</h3>

<p>This is the name that will be provided to RabbitMQ for the exchange. By default, the Bunny driver will auto-generate a name. This won&#8217;t work in this usecase because the consumers will need a known name. Remember exchanges are for producers. Queues are for consumers. When we wire up the indexer side, we&#8217;ll need to know the name of the exchange to perform the binding.</p>

<h3>exchange_type</h3>

<p>For this particular design, we want to use a direct exchange. It&#8217;s the only way we can guarantee that only one copy of a log message will be processed.</p>

<h3>key</h3>

<p>We&#8217;re going to explicitly set the routing key as direct exchanges do not support wildcard routing key bindings. Again, we&#8217;ll need this on the consumer side to ensure we get the right messages.</p>

<h3>durable</h3>

<p>This setting controls if the exchange should survive RabbitMQ restarts or not.</p>

<h3>persistent</h3>

<p>This is for the messages. Should they be persisted to disk or not?</p>

<p>Note that for a fully &#8220;no lost messages scenario&#8221; to work in RabbitMQ, you have to jump through some hoops. This is explain more below.</p>

<h2>Running the agent</h2>

<p>This same configuration should be used on ALL host agents where logs are being read. You can have variation in the inputs. You can have additional outputs however the amqp output stanza above will ensure that all messages will be sent to RabbitMQ.</p>

<h1>Indexer agent config</h1>

<figure class='code'><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
<span class='line-number'>13</span>
<span class='line-number'>14</span>
<span class='line-number'>15</span>
<span class='line-number'>16</span>
<span class='line-number'>17</span>
<span class='line-number'>18</span>
<span class='line-number'>19</span>
<span class='line-number'>20</span>
<span class='line-number'>21</span>
<span class='line-number'>22</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>input {
</span><span class='line'>  amqp {
</span><span class='line'>    host =&gt; "rabbitmq-server"
</span><span class='line'>    name =&gt; "indexer-queue"
</span><span class='line'>    exchange =&gt; "logstash-exchange"
</span><span class='line'>    key =&gt; "logstash-routing-key"
</span><span class='line'>    exclusive =&gt; false
</span><span class='line'>    durable =&gt; true
</span><span class='line'>    auto_delete =&gt; false
</span><span class='line'>    type =&gt; "logstash-indexer-input"
</span><span class='line'>  }
</span><span class='line'>}
</span><span class='line'>
</span><span class='line'>filter {
</span><span class='line'>  # your filters here
</span><span class='line'>}
</span><span class='line'>
</span><span class='line'>output {
</span><span class='line'>  elasticsearch {
</span><span class='line'>    # your elasticsearch settings here
</span><span class='line'>  }
</span><span class='line'>}</span></code></pre></td></tr></table></div></figure>


<h2>Config explained</h2>

<p>The amqp input:</p>

<h3>name</h3>

<p>This is the name that will be provided to RabbitMQ for the queue. Again, as with exchange, we need a known name. The reason for this is that all of our indexers are going to share a common queue. This will make sense in a moment.</p>

<h3>exchange</h3>

<p>This should match exactly with the name of the exchange that was created before in the host-side config.</p>

<h3>key</h3>

<p>This should, again, match the routing key provided in the host-side configuration exactly. <code>direct</code> exchanges do NOT support wildcard routing keys. By providing a routing key, you are creating a <code>binding</code> in RabbitMQ terms. This <code>binding</code> says &#8220;I want all messages sent to the <code>logstash-exchange</code> with a routing key of <code>logstash-routing-key</code> to be sent to the queue named <code>indexer-queue</code>.</p>

<h3>exclusive</h3>

<p>As with the exchange in the host-side config, we&#8217;re going to have multiple workers using this queue. This is another AMQP detail. When you bind a queue to an exchange, a <code>channel</code> is created for the messages to flow across. A single queue can have multiple channels. This is how our worker pool is going to operate.</p>

<p><strong>You do not want a different queue name for each worker despite how weird that sounds</strong></p>

<p>If you give each worker its own queue, then you <strong>WILL</strong> get duplicate messages. It&#8217;s counterintuitive, I know. Just trust me. The way to ensure that multiple consumers don&#8217;t see the same message is to use mutliple channels on the same queue.</p>

<h3>durable</h3>

<p>Same as the exchange declarition, this ensures that the queue will stick around if the broker (the RabbitMQ server) restarts.</p>

<h3>auto_delete</h3>

<p>This is the setting most people miss when trying to ensure no lost messages. By default, RabbitMQ will throw away even durable queues once the last user of the queue disconnects.</p>

<h3>type</h3>

<p>This is the standard logstash requirement for inputs. They must have a <code>type</code> defined. Arbitrary string.</p>

<h1>Sidebar on RabbitMQ message reliability</h1>

<p>Simply put, RabbitMQ makes you jump through hoops to ensure that no message is lost. There&#8217;s a trifecta of settings that you have to have for it to work:</p>

<ul>
<li>Your exchange must be durable with persistent messages</li>
<li>Your queue must be durable</li>
<li>Auto-delete must not be disabled</li>
</ul>


<p><strong>EVEN IF YOU DO ALL THESE THINGS, YOU CAN STILL LOSE MESSAGES!</strong></p>

<h2>Order matters</h2>

<p>I know &#8230; you&#8217;re thinking &#8220;What the F&#8212;?&#8221;. There is still a scenario where you can lose messages. It has to do with how you start things up.</p>

<ul>
<li>If you start the exchange side but never start the queue side, messages are dropped on the floor</li>
<li>You can&#8217;t start the queue side without first starting the exchange side</li>
</ul>


<p>While RabbitMQ let&#8217;s you predeclare exchanges and queues from the command-line, it normally only creates things when someone asks for it. Since exchanges know nothing about the consumption side of the messages (the queues), creating an exchange with all the right settings does NOT create the queue and thus no binding is ever created.</p>

<p>Conversely, you can&#8217;t declare a totally durable queue when there is no exchange in place to bind against.</p>

<p>Follow these rules and you&#8217;ll be okay. You only need to do it once:</p>

<ul>
<li>Start a producer (the host-side logstash agent)</li>
<li>Ensure via <code>rabbitmqctl</code> or the management web interface that the exchange exists</li>
<li>Start one of the consumers (the indexer config)</li>
</ul>


<p>Once the indexer agent has started, you will be good to go. You can shutdown the indexers and messages will start piling up. You can shut everything down - rabbitmq (with backlogged messages), the indexer agent and the host-side agent. When you start RabbitMQ, the queues, exchanges and messages will all still be there. If you start an indexer agent, it will drain the outstanding messages.</p>

<p>However, if you screw the configuration up you&#8217;ll have to delete the exchange and the queue via <code>rabbitmqctl</code> or the management web interface and start over.</p>

<h1>How it looks visually</h1>

<p>There are two plugins you should install with RabbitMQ:</p>

<ul>
<li>rabbitmq_management</li>
<li>rabbitmq_management_visualizer</li>
</ul>


<p>The first will provide a web interface (and HTTP API!) listening on port 55672 of your RabbitMQ server. It provides a really easy way to see messages backlogged, declared exchanges/queue and pretty much everything else. Seeing as it also provides a very nice REST api to everything inside the RabbitMQ server, you&#8217;ll want it anyway if for nothing but monitoring hooks.</p>

<p>The visualizer is an ad-hoc addon that helps you see the flows through the system. It&#8217;s not as pretty as the management web interface proper but it gets the job done.</p>

<h1>Starting it all up</h1>

<p>Now we can start things up</p>

<h2>Producers</h2>

<p>We&#8217;re going to start up our four client side agents. These will create the exchange (or alternately connect to the existing one). If you look at the management interface, you&#8217;ll see four channels established:</p>

<p>Management view:
<img src="http://lusis.github.com/images/posts/load-balancing-logstash-with-amqp/amqp-four-channels.png" alt="amqp-four-channels.png" /></p>

<p>Visualizer view:
<img src="http://lusis.github.com/images/posts/load-balancing-logstash-with-amqp/amqp-four-producers.png" alt="amqp-four-producers.png" /></p>

<p>Remember that until we connect with a consumer configuration (the indexer) messages sent to these exchanges WILL be lost.</p>

<h2>Consumers</h2>

<p>Now we start our indexer configurations - all four of them</p>

<p>Now if we take a peek around the management interface and the visualizer, we start to see some cool stuff.</p>

<p>In the managment interface, you&#8217;ll see eight total channels - four for the queue and four for the exchange</p>

<p><img src="http://lusis.github.com/images/posts/load-balancing-logstash-with-amqp/amqp-eight-channels.png" alt="amqp-eight-channels.png" /></p>

<p>If you click on &#8220;Queues&#8221; at the top and then on the entry for our <code>indexer-queue</code>, you&#8217;ll see more details:</p>

<p><img src="http://lusis.github.com/images/posts/load-balancing-logstash-with-amqp/amqp-indexer-queue-details.png" alt="amqp-indexer-queue-details.png" /></p>

<p>But the real visual is in the visualizer tab. Click on it and then click on the <code>indexer-queue</code> on the far right</p>

<p><img src="http://lusis.github.com/images/posts/load-balancing-logstash-with-amqp/amqp-visualizer-detail.png" alt="amqp-visualizer-detail.png" /></p>

<p>You can see the lines showing the flow of messages.</p>

<p>One thing to make note of about RabbitMQ load balancing. Messages are load balanced across CONSUMERS not QUEUES. There&#8217;s a subtle distinction there from RabbitMQ&#8217;s semantic point of view.</p>

<h2>Testing the message flow</h2>

<p>Over in your terminal window, let&#8217;s send some test messages. For this test, again, I&#8217;m using <code>stdin</code> for my origination and <code>stdout</code> to mimic the ElasticSearch destination.</p>

<p>In my first input window, I&#8217;m going just type 1 through 4 with a newline after each. This should result in each consumer getting a message round-robin style:</p>

<p><a href="http://lusis.github.com/images/posts/load-balancing-logstash-with-amqp/load-balance-test-1.png"><img src="http://lusis.github.com/images/posts/load-balancing-logstash-with-amqp/load-balance-test-1.png" alt="load-balance-test-1.png" /></a></p>

<p>Now I&#8217;m going to cycle through the input windows and send a single message from each:</p>

<p><a href="http://lusis.github.com/images/posts/load-balancing-logstash-with-amqp/load-balance-test-4.png"><img src="http://lusis.github.com/images/posts/load-balancing-logstash-with-amqp/load-balance-test-4.png" alt="load-balance-test-4.png" /></a></p>

<p>You can see that messages 4-7 were sent round-robin style.</p>

<h2>Testing persistence</h2>

<p>All of this is for naught if we lose messages because our workers are offline. Let&#8217;s shutdown all of our workers and send a bunch of messages from each input window:</p>

<p><a href="http://lusis.github.com/images/posts/load-balancing-logstash-with-amqp/workers-offline-terminal.png"><img src="http://lusis.github.com/images/posts/load-balancing-logstash-with-amqp/workers-offline-terminal.png" alt="workers-offline-terminal.png" /></a></p>

<p>We sent two lines of text per window. This amounts to eight log messages that should be queued up for us. Let&#8217;s check the management interface:</p>

<p><img src="http://lusis.github.com/images/posts/load-balancing-logstash-with-amqp/eight-messages-waiting.png" alt="eight-messages-waiting.png" /></p>

<p>Now if we stop rabbitmq entirely and restart it, those messages should still be there (along with the queue and exchanges we created).</p>

<p>Once you&#8217;ve verified that, start one of the workers back up. When it comes fully online, it should drain all of the messages from the exchange:</p>

<p><a href="http://lusis.github.com/images/posts/load-balancing-logstash-with-amqp/drained-messages.png"><img src="http://lusis.github.com/images/posts/load-balancing-logstash-with-amqp/drained-messages.png" alt="drained-messages.png" /></a></p>

<p>Yep, there they went. The last two messages you get should be the ones from window 4. This is another basic functionality of message queue software in general. Messages should be delivered in the order in which they were recieved.</p>

<h1>One last diagram</h1>

<p>Here&#8217;s a flowchart I created with Gliffy to show what the high-level overview of our setup would look like. Hope it helps and feel free to hit me up on freenode irc in the <code>#logstash</code> channel or on <a href="https://twitter.com/lusis">twitter</a>.</p>

<p><a href="http://lusis.github.com/images/posts/load-balancing-logstash-with-amqp/gliffy-overview.png"><img src="http://lusis.github.com/images/posts/load-balancing-logstash-with-amqp/gliffy-overview.png" alt="gliffy-overview.png" /></a></p>

<p><em>This post will eventually make its way into the <a href="http://cookbook.logstash.net">Logstash Cookbook Site</a>.</em></p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Lowtech monitoring with Jenkins]]></title>
    <link href="http://lusis.github.com/blog/2012/01/23/lowtech-monitoring-with-jenkins/"/>
    <updated>2012-01-23T00:07:00-05:00</updated>
    <id>http://lusis.github.com/blog/2012/01/23/lowtech-monitoring-with-jenkins</id>
    <content type="html"><![CDATA[<p>I mentioned briefly in my previous post that I got quite a few people coming up to me after the panel and asking me for advice on monitoring.</p>

<!-- more -->


<p>I tweeted about this scenario not long after it happened but here&#8217;s the gist:</p>

<blockquote><p>I just need something simple to check on the status of a few jobs and run some SQL statements. I&#8217;m a DBA and I can&#8217;t get any help from my ops team.</p></blockquote>


<p>The person who asked this was very friendly and I could sense the frustration in her voice. It frustrates me to no end to hear stories of my tribe being this way to customers.</p>

<p>I thought for a minute because I really wanted to help and the best thing I could think of was Jenkins. Yes, Jenkins.</p>

<h1>Reasoning</h1>

<p>Let&#8217;s look for a minute at what we need from a simple health check system:</p>

<ul>
<li>Performing some task on a given schedule</li>
<li>Ability to run a given command</li>
<li>Reporting on the output of the given command (success/failure)</li>
</ul>


<p>Now you might think to your self &#8220;Self, this sure does sound a lot like cron&#8221;. You&#8217;d be right. And that&#8217;s EXACTLY what took me down the Jenkins path. There have been numerous posts about people replacing individual cron jobs with a centralized model based on Jenkins. This makes perfect sense and is something of a holy grail. I clearly remember researching and evaluating batch scheduling products many years ago to essentially do just this. If only Jenkins had been around then.</p>

<h2>Small disclaimer</h2>

<p>While Jenkins is a great low friction way to accomplish this task, it may or may not be scalable in the long run. While Jenkins jobs are defined as XML files and can be managed via an API, it&#8217;s still a bit cumbersome to automate.</p>

<h1>Getting started</h1>

<p>First thing to do is grab the <a href="http://mirrors.jenkins-ci.org/war/latest/jenkins.war">latest Jenkins war</a>. The nice thing about Jenkins is that it ships in such an easy to use format - a self-contained executable war. You can start it very simply with:</p>

<figure class='code'><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>java -jar jenkins.war</span></code></pre></td></tr></table></div></figure>


<p>You should probably click around to get comfortable with the interface. It&#8217;s pretty difficult to screw something up but if you do, just shutdown jenkins, <code>rm -rf ~/.jenkins</code> and start it back up.</p>

<p>Since this post is geared primarily at someone who probably isn&#8217;t familiar with Jenkins, I&#8217;m going to go over a few quick basics and key areas we&#8217;ll be working with:</p>

<h2>Menus</h2>

<p>The menu is the section on the left handside. It will change based on your location in the application. If you don&#8217;t always see something you&#8217;re expecting, you can use the breadcrumb navigation to work your way back. Alternately, you can click on the Jenkins logo to get to the main page.</p>

<h3>Main menu</h3>

<p>This is the menu you see from the top-level of the Jenkins interface</p>

<p><img src="http://lusis.github.com/images/posts/lowtech-monitoring-with-jenkins/menu.png" alt="Main interface" /></p>

<h3>Job menu</h3>

<p>This is the menu you see when you are viewing the main page of a job</p>

<p><img src="http://lusis.github.com/images/posts/lowtech-monitoring-with-jenkins/job-menu.png" alt="Job Menu" /></p>

<p>Note the &#8220;Build History&#8221; section at the bottom. This is a list of all builds that have been performed for this job. You can click on a given build to see details about it.</p>

<h3>Build menu</h3>

<p>This menu is visible when you select a specific build from the &#8220;Build History&#8221; menu of a Job page</p>

<p><img src="http://lusis.github.com/images/posts/lowtech-monitoring-with-jenkins/build-menu.png" alt="Build Menu" /></p>

<p>Notice the &#8220;Console Output&#8221; menu option. This will show you the log of what Jenkins did during a build. If you ever have problems with a build, you should come here and look at what happened.</p>

<h3>Auto Refresh</h3>

<p>In the interest of eliminating any confusion, we&#8217;re going to enable &#8220;Auto Refresh&#8221; from the link on the top right:</p>

<p><img src="http://lusis.github.com/images/posts/lowtech-monitoring-with-jenkins/enable-auto-refresh.png" alt="Enable Auto Refresh" /></p>

<h2>Configuration</h2>

<p>For the purposes of this exercise, we won&#8217;t do too much configuration. We&#8217;re going to take the perspective of the person above. We&#8217;ll make few assumptions though in the interest of expidiency:</p>

<ul>
<li>The user has no passphrase on the SSH key. While this is probably not true, it makes this demo easier.</li>
<li>The DB test will be executed locally.</li>
<li>The local environment is some unix-y/linux-y one. The environment for this post was OS X communicating with Linux VMs</li>
</ul>


<p>The key to success here is something called a &#8220;free-style software project&#8221;. This is essentially a blank canvas with very few requirements. I&#8217;m aware that the &#8220;Monitor an external job&#8221; type has been recently added but the steps were a bit too invasive for this particular case.</p>

<h1>Our test case</h1>

<p>I don&#8217;t obviously have the specifics of what the user wanted checked so I&#8217;m going to extrapolate from her original statement:</p>

<ul>
<li>Run a SQL statement to see if some record is found</li>
<li>Check for a running process</li>
<li>Check a log file for some given string</li>
</ul>


<h2>The test database</h2>

<p>The test database will be MySQL running on a Linux VM. Getting this going is an exercise for the reader, however here is the DDL and test data we&#8217;re using:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
</pre></td><td class='code'><pre><code class='sql'><span class='line'><span class="c1">--- This is just a sample, folks. Yes I know it&#39;s insecure.</span>
</span><span class='line'><span class="k">create</span> <span class="k">database</span> <span class="n">foo_db</span><span class="p">;</span>
</span><span class='line'><span class="n">use</span> <span class="n">foo_db</span><span class="p">;</span>
</span><span class='line'><span class="k">create</span> <span class="k">table</span> <span class="n">jobs</span> <span class="p">(</span> <span class="n">id</span> <span class="nb">int</span> <span class="k">not</span> <span class="k">null</span> <span class="n">auto_increment</span> <span class="k">primary</span> <span class="k">key</span><span class="p">,</span> <span class="n">name</span> <span class="nb">varchar</span><span class="p">(</span><span class="mi">10</span><span class="p">),</span> <span class="n">depth</span> <span class="nb">int</span><span class="p">);</span>
</span><span class='line'><span class="k">insert</span> <span class="k">into</span> <span class="n">jobs</span> <span class="p">(</span><span class="n">name</span><span class="p">,</span> <span class="n">depth</span><span class="p">)</span> <span class="k">values</span> <span class="p">(</span><span class="ss">&quot;job_a&quot;</span><span class="p">,</span> <span class="mi">100</span><span class="p">);</span>
</span><span class='line'><span class="k">insert</span> <span class="k">into</span> <span class="n">jobs</span> <span class="p">(</span><span class="n">name</span><span class="p">,</span> <span class="n">depth</span><span class="p">)</span> <span class="k">values</span> <span class="p">(</span><span class="ss">&quot;job_b&quot;</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
</span><span class='line'><span class="k">insert</span> <span class="k">into</span> <span class="n">jobs</span> <span class="p">(</span><span class="n">name</span><span class="p">,</span> <span class="n">depth</span><span class="p">)</span> <span class="k">values</span> <span class="p">(</span><span class="ss">&quot;job_c&quot;</span><span class="p">,</span> <span class="mi">5</span><span class="p">);</span>
</span><span class='line'><span class="k">grant</span> <span class="k">select</span> <span class="k">on</span> <span class="n">jobs</span> <span class="k">to</span> <span class="s1">&#39;jenkins&#39;</span><span class="o">@</span><span class="s1">&#39;%&#39;</span> <span class="n">IDENTIFIED</span> <span class="k">BY</span> <span class="s1">&#39;password&#39;</span><span class="p">;</span>
</span><span class='line'><span class="n">flush</span> <span class="k">privileges</span><span class="p">;</span>
</span></code></pre></td></tr></table></div></figure>


<h2>First Job</h2>

<p>So we&#8217;ll create a new freestyle job called &#8220;Check FooDB Backlog&#8221;. Click on &#8220;New Job&#8221; from the main menu.</p>

<p><img src="http://lusis.github.com/images/posts/lowtech-monitoring-with-jenkins/create-db-job.png" alt="Create Job" /></p>

<p>Once you&#8217;ve created the job, the screen gets a bit more hectic. We&#8217;re only going to concern ourselves with a few key areas:</p>

<ul>
<li><p>Build Triggers <img class="right" src="http://lusis.github.com/images/posts/lowtech-monitoring-with-jenkins/build-triggers.png"></p></li>
<li><p>Build Steps <img class="right" src="http://lusis.github.com/images/posts/lowtech-monitoring-with-jenkins/build-step.png"></p></li>
<li><p>I&#8217;ll frequently refer to the <code>?</code> icon. <img class="right" src="http://lusis.github.com/images/posts/lowtech-monitoring-with-jenkins/help-icon.png"></p></li>
</ul>


<h3>Scheduling</h3>

<p>Under build triggers, we want to use the &#8220;Build Periodically&#8221; option. The syntax is akin to cron and there are some additional macros for known intervals. As with any Jenkins option, you can click on the <code>?</code> icon to the right of the option for inline help.</p>

<p><img src="http://lusis.github.com/images/posts/lowtech-monitoring-with-jenkins/build-triggers.png" alt="Build Triggers" /></p>

<p>So we&#8217;re going to set up our health check to run every 15 minutes:
<img src="http://lusis.github.com/images/posts/lowtech-monitoring-with-jenkins/build-periodically.png" alt="Build Periodically" /></p>

<h3>Defining the Build Step</h3>

<p>Through Jenkins plugins, you can get an insane amount of additional build steps. However, the shipped experience has the stuff we need for now. We&#8217;re going to be using the &#8220;Execute Shell&#8221; option. If you are running Jenkins on Windows, you&#8217;ll want to use the &#8220;Execute Windows Batch command&#8221; instead. You will, of course, need to modify the commands appropriately yourself.</p>

<p><img src="http://lusis.github.com/images/posts/lowtech-monitoring-with-jenkins/build-step.png" alt="Build Step Options" /></p>

<p>Here&#8217;s the body of our build step:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
</pre></td><td class='code'><pre><code class='sql'><span class='line'><span class="o">#!/</span><span class="n">bin</span><span class="o">/</span><span class="n">bash</span> <span class="o">-</span><span class="n">l</span>
</span><span class='line'><span class="k">CHECK</span><span class="o">=`</span><span class="n">mysql</span> <span class="o">-</span><span class="n">u</span> <span class="n">jenkins</span> <span class="o">-</span><span class="n">ppassword</span> <span class="o">-</span><span class="n">h</span> <span class="mi">192</span><span class="p">.</span><span class="mi">168</span><span class="p">.</span><span class="mi">56</span><span class="p">.</span><span class="mi">101</span> <span class="o">-</span><span class="n">BNe</span> <span class="s1">&#39;SELECT COUNT(*) FROM foo_db.jobs WHERE depth &gt;= 100&#39;</span><span class="o">`</span>
</span><span class='line'><span class="n">exit</span> <span class="err">${</span><span class="k">CHECK</span><span class="err">}</span>
</span></code></pre></td></tr></table></div></figure>


<p><img src="http://lusis.github.com/images/posts/lowtech-monitoring-with-jenkins/db-build-step.png" alt="DB Build Step" /></p>

<h3>Running the job</h3>

<p>Once you click save, you can click &#8220;Build Now&#8221; on the job menu to give it a test. It should fail:</p>

<p><img src="http://lusis.github.com/images/posts/lowtech-monitoring-with-jenkins/db-failed.png" alt="Failed build" /></p>

<p>Let&#8217;s modify the job so we can see what success looks like. Click on the &#8220;Configure&#8221; link in the build menu and modify your build step. Set the threshold in the query to <code>101</code>. The build should now be blue:</p>

<p><img src="http://lusis.github.com/images/posts/lowtech-monitoring-with-jenkins/db-good.png" alt="Good build" /></p>

<p>This all works very well if you just want to manually inspect the status however let&#8217;s take it a step further. Click on the &#8220;Configure&#8221; link from the Job menu. Notice at the bottom of the following screen, there&#8217;s a section called &#8220;Post-build Actions&#8221;. The very last option is &#8220;E-mail Notification&#8221;. You can click the <code>?</code> to see the default behaviour. Check the box and add your email address:</p>

<p><img src="http://lusis.github.com/images/posts/lowtech-monitoring-with-jenkins/email-notification.png" alt="Email Notification" /></p>

<h3>Getting Notified</h3>

<p>Sadly, this isn&#8217;t enough to enable email notifications. You&#8217;ll need to tell Jenkins an SMTP server it can use. Go back to the main menu and click &#8220;Manage Jenkins&#8221;.</p>

<p><img src="http://lusis.github.com/images/posts/lowtech-monitoring-with-jenkins/manage-jenkins.png" alt="Manage Jenkins" /></p>

<p>From here, we&#8217;re going to click &#8220;Configure System&#8221;</p>

<p>Another busy screen! The settings in this section can get you in trouble if you aren&#8217;t careful. The most common problem is people attempting to enable security and inadvertently locking themselves out.
We&#8217;re not worried about that for now. Scroll to the bottom and configure your SMTP server. The settings shown are for gmail and you&#8217;ll need to click the &#8220;Advanced&#8221; button to enable additional settings.</p>

<p><img src="http://lusis.github.com/images/posts/lowtech-monitoring-with-jenkins/email-settings.png" alt="Configure Email Servers" /></p>

<p>You can select the last checkbox to test that your settings work.</p>

<p>Once that&#8217;s done, click save. Now we&#8217;re going to rerun the job (Go back to the main menu then click your job from the list to see the &#8220;Build Now&#8221; option in the job menu.
You most likely won&#8217;t get an email because the job is passing. Let&#8217;s configure our job again and set the threshold back to 100. Save the job and click &#8220;Build Now&#8221; again from the job menu.</p>

<p>You should get an email that looks something like this:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
</pre></td><td class='code'><pre><code class='sql'><span class='line'><span class="n">Subject</span><span class="p">:</span> <span class="n">Build</span> <span class="n">failed</span> <span class="k">in</span> <span class="n">Jenkins</span><span class="p">:</span> <span class="k">Check</span> <span class="n">FooDB</span> <span class="n">backlog</span>
</span><span class='line'><span class="n">See</span> <span class="o">&lt;</span><span class="n">http</span><span class="p">:</span><span class="o">//</span><span class="n">localhost</span><span class="p">:</span><span class="mi">8080</span><span class="o">/</span><span class="n">job</span><span class="o">/</span><span class="k">Check</span><span class="o">%</span><span class="mi">20</span><span class="n">FooDB</span><span class="o">%</span><span class="mi">20</span><span class="n">backlog</span><span class="o">/</span><span class="mi">8</span><span class="o">/&gt;</span>
</span><span class='line'>
</span><span class='line'><span class="c1">------------------------------------------</span>
</span><span class='line'><span class="n">Started</span> <span class="k">by</span> <span class="k">user</span> <span class="n">anonymous</span>
</span><span class='line'><span class="n">Building</span> <span class="k">in</span> <span class="n">workspace</span> <span class="o">&lt;</span><span class="n">http</span><span class="p">:</span><span class="o">//</span><span class="n">localhost</span><span class="p">:</span><span class="mi">8080</span><span class="o">/</span><span class="n">job</span><span class="o">/</span><span class="k">Check</span><span class="o">%</span><span class="mi">20</span><span class="n">FooDB</span><span class="o">%</span><span class="mi">20</span><span class="n">backlog</span><span class="o">/</span><span class="n">ws</span><span class="o">/&gt;</span>
</span><span class='line'><span class="p">[</span><span class="n">workspace</span><span class="p">]</span> <span class="err">$</span> <span class="o">/</span><span class="n">bin</span><span class="o">/</span><span class="n">bash</span> <span class="o">-</span><span class="n">l</span> <span class="o">/</span><span class="n">var</span><span class="o">/</span><span class="n">folders</span><span class="o">/</span><span class="n">d6</span><span class="o">/</span><span class="n">h7dxb_zj49s8xlj91zd3z6fr0000gn</span><span class="o">/</span><span class="n">T</span><span class="o">/</span><span class="n">hudson1906255485094144268</span><span class="p">.</span><span class="n">sh</span>
</span><span class='line'><span class="n">Build</span> <span class="n">step</span> <span class="s1">&#39;Execute shell&#39;</span> <span class="n">marked</span> <span class="n">build</span> <span class="k">as</span> <span class="n">failure</span>
</span></code></pre></td></tr></table></div></figure>


<p>There&#8217;s not much information in there since our job is swallowing the mysql output and using it as the exit code. You can spice the output however you like it by adding <code>echo</code> statements to the build step. Any output from the job will be included in the email. If you change the thresholds back to a value that you know will pass, you&#8217;ll get at least one email when the build recovers. Unless the build starts failing again, you won&#8217;t get any emails.</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
</pre></td><td class='code'><pre><code class='sql'><span class='line'><span class="n">Subject</span><span class="p">:</span> <span class="n">Jenkins</span> <span class="n">build</span> <span class="k">is</span> <span class="n">back</span> <span class="k">to</span> <span class="n">normal</span> <span class="p">:</span> <span class="k">Check</span> <span class="n">FooDB</span> <span class="n">backlog</span>
</span><span class='line'><span class="n">See</span> <span class="o">&lt;</span><span class="n">http</span><span class="p">:</span><span class="o">//</span><span class="n">localhost</span><span class="p">:</span><span class="mi">8080</span><span class="o">/</span><span class="n">job</span><span class="o">/</span><span class="k">Check</span><span class="o">%</span><span class="mi">20</span><span class="n">FooDB</span><span class="o">%</span><span class="mi">20</span><span class="n">backlog</span><span class="o">/</span><span class="mi">9</span><span class="o">/&gt;</span>
</span></code></pre></td></tr></table></div></figure>


<h2>Second Job</h2>

<p>So now we&#8217;ve got something handling our DB test. We also needed to check to see if some process was running. Let&#8217;s do a simple one to see if MySQL is running. Let&#8217;s call it &#8220;Check MySQL Running&#8221;. Follow the steps for creating a free-style job but this time we&#8217;re going to create our build step like so:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
</pre></td><td class='code'><pre><code class='sh'><span class='line'><span class="c">#!/bin/bash -l</span>
</span><span class='line'>ssh 192.168.56.101 <span class="s1">&#39;ps -ef&#39;</span> | grep mysqld
</span></code></pre></td></tr></table></div></figure>


<p>Again, we&#8217;re going to assume that SSH keys are setup with no password. We&#8217;re keeping it simple. Just as in the case of the other job, we should get a blue build status.</p>

<h2>Third Job</h2>

<p>The third job is the most complex in that we&#8217;re going to need to install a plugin for maximum effect. This will have you jumping around a bit but hopefully you&#8217;re a bit more comfortable navigating by now.
At a high level we&#8217;re going to do the following:</p>

<ul>
<li>Install a new Jenkins plugin</li>
<li>Create a new job</li>
<li>Take note of a new build option</li>
<li>Configure the plugin globally</li>
<li>Enable the plugin in our job</li>
</ul>


<h3>Installing a Plugin</h3>

<p>We&#8217;re going to go back to &#8220;Manage Jenkins&#8221; (accessible from the main menu) but now we&#8217;re going to select &#8220;Manage Plugins&#8221;.</p>

<p><img src="http://lusis.github.com/images/posts/lowtech-monitoring-with-jenkins/plugin-main.png" alt="Plugin Page" /></p>

<p>Once on the plugin screen, click the &#8220;Available&#8221; tab. This part can be overwhelming. It&#8217;s especially confusing since plugins will be listed twice if they fall into multiple categories. However, you only need to mark it once.</p>

<p>The plugin we want is called the &#8220;Log Parser Plugin&#8221;. If you can&#8217;t easily find it, use your browser&#8217;s &#8220;find on page&#8221; (CTRL-F, APPLE-F) to find it.</p>

<p>Check the box and click &#8220;Install without Restart&#8221;. You should see a screen similar to this:</p>

<p><img src="http://lusis.github.com/images/posts/lowtech-monitoring-with-jenkins/plugin-install.png" alt="Plugin Install" /></p>

<h3>Back to the job</h3>

<p>Now let&#8217;s create our final job. Following the same steps as above, create a new job called &#8220;Check DHCP Errors&#8221;. Again, reaching for a contrived case, I&#8217;m going to check my VM&#8217;s syslog to see if it had any errors related to DHCP.</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
</pre></td><td class='code'><pre><code class='sh'><span class='line'><span class="c">#!/bin/bash -l</span>
</span><span class='line'>ssh 192.168.56.101 <span class="s1">&#39;tail -n 5 /var/log/syslog&#39;</span>
</span></code></pre></td></tr></table></div></figure>


<p>Now we could have done this with a grep statement just like above. However I wanted to show installing plugins and the &#8220;Log Parser Plugin&#8221; actually offers some more flexible options, understands more than just pass or fail and can match multiple items without building overly complex flow into your shell step.</p>

<p>You&#8217;ll notice at the bottom we now have an ADDITIONAL option in our &#8220;Post-build Actions&#8221; - <code>Console output (build log) parsing</code>:</p>

<p><img src="http://lusis.github.com/images/posts/lowtech-monitoring-with-jenkins/post-build-log-parser.png" alt="New Option" /></p>

<p>Whenever you install a plugin, where it&#8217;s used depends on what it does. In this case, we&#8217;re doing post processing of the job run log. We can add a third state via this plugin as opposed to just &#8220;Pass&#8221; or &#8220;Fail&#8221; - &#8220;Unstable&#8221;. Before we can enable it, however, we need to give it some parsing rules.</p>

<p>For now leave the option unchecked and click &#8220;Save&#8221;</p>

<h3>Configuring the new plugin</h3>

<p>Go back to the &#8220;Manage Jenkins&#8221; screen (where you set the Email settings). At the bottom, you should now have an option for <code>Console Output Parsing</code>:</p>

<p><img src="http://lusis.github.com/images/posts/lowtech-monitoring-with-jenkins/console-output-parsing.png" alt="Console Output Parsing Configuration" /></p>

<p>Again, anything you configure in this section is GLOBAL. Luckily you can define various rule sets for parsing and apply them individualy to jobs. This plugin is a bit complex so you&#8217;ll probably want to look at the <a href="https://wiki.jenkins-ci.org/display/JENKINS/Log+Parser+Plugin">documentation</a>.</p>

<p>We&#8217;re going to create a very basic rules file in <code>/tmp</code> on our LOCAL machine (where Jenkins is running) called <code>jenkins-dhclient-rules</code>:</p>

<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
</pre></td><td class='code'><pre><code class='sh'><span class='line'>warn /^.*dhclient: can<span class="err">&#39;</span>t create .*: No such file or directory<span class="nv">$/</span>
</span><span class='line'>info /^.*dhclient: bound to .*<span class="nv">$/</span>
</span></code></pre></td></tr></table></div></figure>


<p>This is telling the log parser that the following line is a &#8220;warning&#8221;:</p>

<p><code>Jan 23 01:49:52 ubuntu dhclient: can't create /var/lib/dhcp3/dhclient.eth1.leases: No such file or directory</code></p>

<p>and that</p>

<p><code>Jan 23 01:49:52 ubuntu dhclient: bound to 192.168.56.101 -- renewal in 1367 seconds.</code></p>

<p>is informational. These distinctions are handy for the plugin&#8217;s colorized output support.</p>

<p>Now that we&#8217;ve created that file, under the plugin settings we want to name it and give it the location to the file:</p>

<p><img src="http://lusis.github.com/images/posts/lowtech-monitoring-with-jenkins/console-parsing-plugin.png" alt="Configured Parsing Rules" /></p>

<h3>Back to our job</h3>

<p>Finally!</p>

<p>Let&#8217;s go back to our new job (Check DHCP Errors) and modify it. We want to enable the parsing plugin in the post-build steps. We&#8217;re going to check &#8220;Mark build Unstable&#8221; for warnings and select our rule. Now save the job. The reason we&#8217;re going for warning is that this error is not fatal. Our system still gets an IP address. What we want to do is draw attention to it.</p>

<p><img src="http://lusis.github.com/images/posts/lowtech-monitoring-with-jenkins/console-parsing-plugin-enabled.png" alt="Post Build Parsing" /></p>

<p>Now if you run the job, you&#8217;ll get a yellow ball indicating that the job is unstable. If we were to change the first line of our rule to error and check the appropriate box, the job would have been marked a failure. Something additional this plugin provides is the &#8220;parsed console output&#8221;. If you click on the job detail and select &#8220;Parse Console Output&#8221; from the job menu, you&#8217;ll actually get a nicer way to see exactly what was wrong:</p>

<p><img src="http://lusis.github.com/images/posts/lowtech-monitoring-with-jenkins/parsed-output.png" alt="Parsed Console Output" /></p>

<p>Again, this is a totally contrived example. Obviously we would fail long before the parsing had the host been down.</p>

<h1>Tying it all together</h1>

<p>All of these individual jobs are neat but there&#8217;s an obvious dependency there. We need to be able to SSH to the host, we need mysql to be running and then we want to query it. We don&#8217;t want multiple emails for each failure. We only want the actual failed job to alert us. Let&#8217;s chain these jobs together to match that flow.</p>

<ul>
<li>Under the &#8220;MySQL Running&#8221; and &#8220;FooDB&#8221; jobs, disable the cron schedule. We only want it on the &#8220;DHCP&#8221; job.</li>
<li>Under the DHCP job, we&#8217;re going to select the Post-build step of &#8220;Build other projects&#8221;</li>
<li>Check &#8220;Trigger even if build is unstable&#8221; since we know it&#8217;s going to be unstable.</li>
<li>In the text area, we want to add our &#8220;Check MySQL Running&#8221; job</li>
<li>Under the &#8220;Check MySQL Running&#8221; job, we want to select &#8220;Trigger only if build succeeds&#8221; and set our text area to the &#8220;Check FooDB Backlog&#8221; job.</li>
</ul>


<p>Now if you run the top-level job (Check DHCP Errors), all of the jobs will run. If any fail, the run will stop there and alert you! Since this is now scheduled, every 15 minutes this entire workflow will be checked.</p>

<h1>Additional plugins and tips</h1>

<p>Jenkins has a boatload of plugins. It&#8217;s worth investigating them to see if they make some given task (like output parsing) easier. Some provide additional notification paths like jabber or irc. Others provide additional build step types in specific languages like Groovy or Powershell. You can also do things like create a &#8220;Parameterized Build&#8221;. This is especially handy for thresholds. There&#8217;s also a very handy SSH plugin that let&#8217;s you define hosts globally and keys per host. This helps clean up your build steps too.</p>

<p>One plugin that was recommended is the &#8220;Email-ext&#8221; plugin. This allows you to REALLY spice up and configure your email notifications.</p>

<p>There&#8217;s a plugin for checking a web site for some criteria and plugins for starting virtual machines. There are also plugins for creating a radiator view so you can get a nice big dashboard for just checking the state of jobs at a glance.</p>

<p>The key to remember is that Jenkins is an unopinionated build tool. This flexiblity lends itself to doing off-the-wall stuff (like being a monitoring system or a cron replacement). The trick is translating the concepts and terminology of building software to something that fits your use case.</p>

<h1>Additional Credits</h1>

<p>I&#8217;d like to thank <a href="https://twitter.com/miller_joe">Joe Miller</a>, <a href="https://twitter.com/ches">Ches Martin</a> and <a href="https://twitter.com/agentdero">R. Tyler Croy</a> for reviewing this post and offering up corrections, tips and advice.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Scale10x Recap]]></title>
    <link href="http://lusis.github.com/blog/2012/01/22/scale10x-recap/"/>
    <updated>2012-01-22T07:02:00-05:00</updated>
    <id>http://lusis.github.com/blog/2012/01/22/scale10x-recap</id>
    <content type="html"><![CDATA[<p>This past week I had the awesome pleasure of participating in my first <a href="http://www.socallinuxexpo.org/">SoCal Linux Expo</a>. As I later discovered, this was the 10th installment of this awesome event (hence the 10x).</p>

<!-- more -->


<h1>The email and enStratus</h1>

<p>I got an email from <a href="https://twitter.com/irabinovitch">Ilan Rabinovitch</a> just as things were going down with me headed to <a href="http://enstratus.com">enStratus</a>. Since the event was going to be right around the time I started, I pretty much put it out of my mind.
Then I realized that <a href="https://twitter.com/botchagalupe">my boss</a> was going to be attending. I figured it wouldn&#8217;t hurt to ask and it was decided that I would shadow John on his trip to enStratus HQ for my mandatory cultural immersion (translation: does the fat redneck own a winter coat?) and on to SCaLE. While I didn&#8217;t make it to Minnesota (diverted to San Jose on business), I did still make it to the conference.</p>

<p>I had an awesome time in San Francisco and San Jose. I got to meet a great bunch of folks and geek out hardcore.</p>

<h1>Monitoring sucks</h1>

<p>Ilan asked me if I would be willing to be on a panel about the whole <a href="https://github.com/monitoringsucks">#monitoringsucks</a> thing. We were able to score a great panel of folks:</p>

<ul>
<li>Simon Jakesch from Zenoss</li>
<li>James Litton from PagerDuty</li>
<li>Jody Mulkey from Shopzilla</li>
</ul>


<p>The event was awesome. I did some minor introductions, got the ball rolling with some questions and then we let the audience take it from there.
The participation was AWESOME. We had great questions from the audience and the feedback I got AFTER the fact was mindblowing. One particular post-panel question is worth a blog post in its own right.</p>

<p>One thing that really stood out was this: People just don&#8217;t know where to start. The landscape is pretty &#8220;cluttered&#8221;. <a href="https://twitter.com/cwebber">Chris Webber</a> brought up a very salient point that I sometimes forget; When we talk about &#8220;monitoring&#8221;, we&#8217;re really talking about multiple things - collection, alerting, visualization, trending and multitudes of other aspects.</p>

<p>I got asked several times in the hallway - &#8220;What should I use?&#8221; or &#8220;What do you think about <foo>?&#8221;. My first response was always &#8220;What are you using now?&#8221;.</p>

<p>I like to think I&#8217;m pretty pragmatic. I love the new shiny. I love pretty graphs. I&#8217;m a technologist. However, I know when to be realistic. My thought process goes something like this:</p>

<h2>Do you have something in place now?</h2>

<h3>Yes</h3>

<p>Why are you looking to switch? Is it unreliable? Is it painful to configure? Basically, if it&#8217;s getting the job done and has relatively minor overhead there&#8217;s no reason to switch.
The pain points for me with monitoring solutions usually come much later. It doesn&#8217;t scale or scaling it is difficult. It doesn&#8217;t provide the visibility I need. It&#8217;s unreliable (usually due to scaling problems).
Until then, use what you&#8217;ve got and be guard for early signs of problems like check latency going up or missed notifications.</p>

<p>If you have a configuration management solution in place, it probably has native support for configuring Nagios. When you add a new host to your environment, you only need to tell your CM tool to run on your monitoring server. If you&#8217;ve done any sort of logical grouping, you&#8217;ll have the right things monitored quickly.</p>

<h3>No</h3>

<p>If you don&#8217;t have ANYTHING in place, you need to cover two bases pretty quick:</p>

<ul>
<li>Outside-In Checks: is my site up and responding timely?</li>
<li>Stupid stuff: Are my disks filling up? Is my database slave behind?</li>
</ul>


<p>For outside in checks, use something quick and easy like Pingdom. For the inside checks, don&#8217;t underestimate the power of a cron job. If you want something a bit more packaged, look at <a href="http://mmonit.com">monit</a>. It&#8217;s dead simple and can get you to a safe place.</p>

<h2>A note on visibility</h2>

<p>Monitoring tools are great but many times they fall down when you need to diagnose a problem ex post facto. If you went the simple route, you probably don&#8217;t have any real trending data. This is where many complaints start to come from folks. You end up monitoring the same thing twice - once for alerting systems like Nagios and another time for your visualization, trending and other engines. When you reach this point, start looking at things like Sensu or all-in-one solutions that, while cumbersome and imprecise use the same collected data - Zenoss, Zabbix, Icinga (originally a fork of Nagios).</p>

<p>The event was recorded (both audio and video) but I have no timeframe on when it&#8217;s going to be available but I&#8217;ll let you know as soon as it&#8217;s up.</p>

<h1>The rest of the conference</h1>

<p>The rest of the conference was epic as well. Being that this was my first time, I didn&#8217;t know what to expect. The thing that most stood out was the number of children. This was probably the most family friendly conference I&#8217;ve ever been to. Encouraging stuff. Plenty of events and in fact an entire track dedicated to children.</p>

<p>I didn&#8217;t get to attend as many talks as I wanted to. While the facility was really nice, the building is like a faraday cage. My phone spent what little battery life it had just trying to get a signal. I spent quite a bit of time running back to my room to charge up. <a href="https://twitter.com/cwebber">Chris Webber</a> totally got me hooked on portable chargers.</p>

<h1>Juju talk</h1>

<p><em>disclaimer: I&#8217;m fully aware that Juju is undergoing heavy active development and is a very young project</em></p>

<p>One of the talks I attended was on <a href="http://juju.ubuntu.com">Juju</a>. I was probably a bit harsh on Juju when it was first announced. The original name was much better and the whole witch doctor thing just doesn&#8217;t sit well with me.</p>

<p>I also hate the tag line - &#8220;DevOps distilled&#8221;. It&#8217;s marketing pure and simple. I have very little tolerance for things that bill themselves as a &#8220;devops tool&#8221; or &#8220;for devops&#8221;.</p>

<p>But more than the name, something about Juju didn&#8217;t feel right. After the talk, something still doesn&#8217;t feel right. While I don&#8217;t like pooping all over someone else&#8217;s hard work so writing this part is tough.</p>

<h2>Where does it fit?</h2>

<p>Right now, I don&#8217;t think Juju even knows where it fits. It&#8217;s got some great ideas and on any other day, I&#8217;d be all over it. The problem is that Juju tries to do too much in some areas and not enough in others.</p>

<p>Parts of Juju are EXACTLY what I see as my primary use case for Noah. The service orchestration is great. The ideas are pretty solid. Juju even uses ZooKeeper under the hood.</p>

<h2>Services not servers</h2>

<p>Everyone knows that I preach the mantra of &#8220;services matter. hosts don&#8217;t&#8221;</p>

<p>The problem is that in an attempt to be the Nagios (unlimited flexibility) of configuration management, it can&#8217;t actually do enough in that area. Because it only concerns itself with services (and the configuration of them), it doesn&#8217;t do enough to manage the host. Just because the end state is &#8220;I&#8217;m serving a web page&#8221; doesn&#8217;t mean you should ignore the host its running on. Since Juju isn&#8217;t designed to deal with that (and actually LACKS any primitives to do it), you&#8217;re left with needing to manage a system in multiple places - once with your CM tool and then again with the charms.</p>

<p>Someone said it best when he described Juju as &#8220;apt for services&#8221;. It&#8217;s quite evident that the same broken mentality that apt takes to managing packages is applied to Juju as well. Charms have upgrade and downgrade steps. They&#8217;re just as complicated too. Not only is there no standard (since charms can be written in any language) it&#8217;s actually detrimental. The reason for a common DSL or language like the ones exposed by CM tools is not some academic mental masturbation. It&#8217;s repeatability and efficiency. I can go into a puppet shop and look at a module and know what it does. I can look at most chef recipes (outside of ones that might use a custom LWRP) and know what&#8217;s going on.</p>

<p>In the Juju world, a single charm could be written in one spot in Python and another spot in Bash. It pushes too much responsibility to the end user NOT to mess something up. I dare say that idempotence doesn&#8217;t even exist in Juju.</p>

<h2>A fair shake</h2>

<p>Again, I&#8217;m going to do some more playing around with Juju. I think it can meet a critical need for folks but I think they need to revisit what problem they&#8217;re trying to solve. I appreciate the work they&#8217;ve done and I&#8217;m totally excited that orchestration is getting the proper attention. The presenters were fantastic.</p>

<h1>Other stuff</h1>

<p>I attended a really good talk about the history of Openstack and where it&#8217;s going. It was great. As someone who is working with openstack professionally now (and had just dealt with some of its warts not 3 days before hand), I found it very valuable. Also congrats to the speaker, <a href="https://twitter.com/anotherjesse">Jesse Andrews</a> on the birth of his first child!</p>

<p>I managed to make it to Brendan Gregg&#8217;s talk as well. If you ever have the opportunity to hear him speak, you should take it. While I&#8217;m not a SmartOS user, the talk was really not about that. I walked out with some amazing insight on how smart people troubleshoot performance problems. Very well done.</p>

<h1>The hallway track</h1>

<p>Of course the real value in any conference is the hallway track. The chance to interact with your peers. I met so many smart people (some twice because I suck at remembering faces at first - sorry!). Chatting with folks like C. Flores, Jason Cook, Sean O&#8217;Meara, Chris Webber, Dave Rawks, Matt Ray, Matt Silvey and so many others that I can&#8217;t keep straight in my head. Everyone was awesome and I hope that you were able to get as much out of me as I got out of you.</p>

<p>Thanks again to Ilan for the invitation and for running such an amazing conference.</p>

<p>Also, little known made-up fact: Lusis is Tagalog for &#8220;He who eats with both hands&#8221;&#8230;..</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[2011-in-review]]></title>
    <link href="http://lusis.github.com/blog/2012/01/02/2011-in-review/"/>
    <updated>2012-01-02T15:40:00-05:00</updated>
    <id>http://lusis.github.com/blog/2012/01/02/2011-in-review</id>
    <content type="html"><![CDATA[<p>The holidays are a busy time for me. I was hoping to get this written before the end of the year but it didn&#8217;t happen.</p>

<!-- more -->


<p>I can say, wihtout a doubt, that 2011 has been the most awesome year both professionally and personally in my life. And I have pretty much everyone else to thank for it.</p>

<h1>Some special shoutouts</h1>

<p>There are so many folks to thank for this year. I owe so many beers that I can&#8217;t keep count. I can&#8217;t possibly thank everyone but I want to throw a few special shoutouts to folks.</p>

<h2>My wife</h2>

<p>Why she puts up with my shit, I&#8217;ll never know. Needless to say, without her this year would have been radically different. She managed two toddlers by herself on almost every trip I took. She&#8217;s been nothing but encouraging and she also helps keep me grounded by reminding me what&#8217;s important.</p>

<h2><a href="http://twitter.com/patrickdebois">Patrick DeBois</a></h2>

<p>Thanks for giving me the opportunity to help with the DevOps Days events. Through the events I&#8217;ve met some of the most amazing people in the world. The DevOps community is a wonderful group of folks and I would never have met half of them if Patrick hadn&#8217;t given me the opportunity to participate.</p>

<h2><a href="http://verticalacuity.com">Vertical Acuity</a></h2>

<p>While I&#8217;m sad to be leaving friends behind, VA was amazing in letting me travel so much. Not only that but they trusted and valued my opinion on so many things. Whoever takes my place will be lucky to work with such an awesome group of folks.</p>

<h2><a href="http://twitter.com/botchagalupe">John Willis</a></h2>

<p>Not only for being a good friend but for giving me an opportunity to work with him at enStratus.</p>

<h2><a href="http://twitter.com/puppetmasterd">Luke Kanies</a>, <a href="http://twitter.com/kartar">James Turnbull</a>, <a href="http://twitter.com/cruzfox">Jose Palafox</a> and the Puppet Labs crew</h2>

<p>Puppet Labs gets double the thanks - for giving me the opportunity to talk about my project at PuppetConf and also for sponsoring me to travel to Goteborg and speak. The whole crew over there is amazing.</p>

<h2><a href="http://twitter.com/damonedwards">Damon Edwards</a>, <a href="http://twitter.com/alexhonor">Alex Honor</a> and <a href="http://www.dtosolutions.com">DTO Solutions</a></h2>

<p>I&#8217;m grateful to DTO for giving me the opportunity to attend Velocity and letting me be a booth babe. Damon and Alex both have been forces of awesome for the DevOps community.</p>

<h2><a href="http://twitter.com/potus98">John Christian</a></h2>

<p>John asked me early on to help with the Atlanta DevOps meetups and I&#8217;m glad he did. He stands alone in the corporate bullshit world of the financial services industry. He tought me a lot and I want to thank him for it.</p>

<h2><a href="http://twitter.com/lnxchk">Mandi Walls</a></h2>

<p>For being pretty much awesome by listening to my ranting, letting me bounce ideas off her. And for being my sister from another mother.</p>

<h2><a href="http://twitter.com/schisamo">Seth Chisamore</a></h2>

<p>For the various lunches, talks, introductions and meetup involvement. Local folks rock. Seth rocks.</p>

<h2><a href="http://twitter.com/jordansissel">Jordan Sissel</a></h2>

<p>For being awesome, down to earth and not an asshole. And for all the code. And for giving me the honor of contributing to SysAdvent.</p>

<h2><a href="http://twitter.com/kelseyhightower">Kelsey Hightower</a></h2>

<p>For helping me navigate my foray into the world of Python (technically that was a few years ago). Also for showing folks how to just get shit done.</p>

<h2>Everyone else</h2>

<p>I can&#8217;t possible fit everyone here&#8217;s an abbreviated list of folks, in no specific order, who have impacted me this year off the top of my head. If I leave you off, please don&#8217;t take offense. I&#8217;m shooting from the cuff here.</p>

<p><a href="http://twitter.com/bradleyktaylor">Bradley Taylor</a>, <a href="http://twitter.com/wfarr">Will Farrington</a>, <a href="http://twitter.com/coreyhaines">Corey Haines</a>, <a href="http://twitter.com/dysinger">Tim Dysinger</a>, <a href="http://twitter.com/miller_joe">Joe Miller</a>, <a href="http://twitter.com/roidrage">Mathias Meyer</a>, <a href="http://twitter.com/vvuksan">Vladimir Vuksan</a>, <a href="http://twitter.com/adamhjk">Adam Jacob</a>, <a href="http://twitter.com/portertech">Sean Porter</a>, <a href="http://twitter.com/bascule">Tony Arcieri</a>, <a href="http://twitter.com/ripienaar">R.I. Pienaar</a>, <a href="http://twitter.com/adamfblahblah">Adam Fletcher</a>, <a href="http://twitter.com/anthonygoddard">Anthony Goddard</a>, <a href="http://twitter.com/williamsjoe">Joe Williams</a>, <a href="http://twitter.com/boorad">Brad Anderson</a>, Cat Muecke (alas, Cat does not tweet!), <a href="http://twitter.com/harlanbarnes">Harlan Barnes</a>, <a href="http://twitter.com/geemus">Wesley Beary</a>, <a href="http://twitter.com/mitchellh">Mitchell Hashimoto</a>, <a href="http://twitter.com/wayneeseguin">Wayne Seguin</a>, <a href="http://twitter.com/kallistec">Dan DeLeo</a>, <a href="http://twitter.com/jtimberman">Josh Timberman</a>, <a href="http://twitter.com/kantrn">Noah Kantrowitz</a>, <a href="http://twitter.com/littleidea">Andrew Clay Schafer</a>, <a href="http://twitter.com/markimbriaco">Mark Imbriaco</a>, <a href="http://twitter.com/lordcope">Stephen Nelson-Smith</a>, <a href="http://twitter.com/garethr">Gareth Rushgrove</a>, <a href="http://twitter.com/ianmeyer">Ian Meyer</a>, <a href="http://twitter.com/f3ew">Devdas Bhagat</a>, <a href="http://twitter.com/actionjack">Martin Jackson</a>, <a href="http://twitter.com/mleinart">Michael Leinartas</a>, <a href="http://twitter.com/KrisBuytaert">Kris Buytaert</a>, <a href="http://twitter.com/solarce">Brandon Burton</a>, <a href="http://twitter.com/altobey">Al Tobey</a>, <a href="http://twitter.com/matthew_jones">Matthew Jones</a>, <a href="http://twitter.com/builddoctor">Julian Simpson</a>, <a href="http://twitter.com/macros">Jason Cook</a>, <a href="http://twitter.com/jiboumans">Jos Boumans</a>, <a href="http://twitter.com/susanpotter">Susan Potter</a>, <a href="http://twitter.com/thommay">Thom May</a>, <a href="http://twitter.com/kit_plummer">Kit Plummer</a>, <a href="http://twitter.com/sascha_d">Sascha Bates</a>, <a href="http://twitter.com/unclebobmartin">Bob Martin</a>, <a href="http://twitter.com/bdha">Bryan Horstmann-Allen</a>, <a href="http://twitter.com/benjaminws">Benjamin W. Smith</a>, <a href="http://twitter.com/ches">Ches Martin</a>, <a href="http://twitter.com/obfuscurity">Jason Dixon</a>, <a href="http://twitter.com/philiph">Phil Hollenback</a>, <a href="http://twitter.com/rockpapergoat">Nate St. Germain</a>, <a href="http://twitter.com/ohlol">Scott Smith</a>, <a href="http://twitter.com/seancribbs">Sean Cribbs</a>, <a href="http://twitter.com/argv0">Andy Gross</a>, <a href="http://twitter.com/benr">Ben Rockwood</a>, <a href="http://twitter.com/jamesc_000">James Casey</a>, <a href="http://twitter.com/lhazlewood">Les Hazlewood</a>, <a href="http://twitter.com/aditzel">Allan Ditzel</a>, <a href="http://twitter.com/mariusducea">Marius Ducea</a>, <a href="http://twitter.com/noahcampbell">Noah Campbell</a>, <a href="http://twitter.com/timanglade">Tim Anglade</a>, <a href="http://twitter.com/atmos">Corey Donohoe</a>, <a href="http://twitter.com/standaloneSA">Matt Simmons</a>, <a href="http://twitter.com/ernestmueller">Ernest Mueller</a>, <a href="http://twitter.com/auxesis">Lindsay Holmwood</a>, <a href="http://twitter.com/redbluemagenta">Christian Paredes</a>, <a href="http://twitter.com/_masterzen_">Brice Figureau</a>, <a href="http://twitter.com/griggheo">Grig Gheorghiu</a>, <a href="http://twitter.com/dje">Darrin Eden</a>, <a href="http://twitter.com/kimchy">Shay Banon</a>, <a href="http://twitter.com/ramonvanalteren">Ramon Van Alteren</a> and so many others.</p>

<h1>Software that changed my world</h1>

<p>I also wanted to give a shoutout to a few projects that pretty much changed how I thought about the software world around me.</p>

<h2><a href="http://elasticsearch.org">ElasticSearch</a></h2>

<p>ElasticSearch has beeen, bar none, in my top two amazing things the past year. Having first heard about it via Logstash, when I started digging in it blew my mind. The one thing that amazed me most about ES was the Zen discovery. It&#8217;s like the first time you heard about consistent hashing. It&#8217;s one of those things that makes you say &#8220;how the fuck did I not think of this first?&#8221;. The other thing that I find awesome is that ES not only makes scaling up painless but scaling DOWN (which is the hard part) is just as easy. As a sysadmin, ElasticSearch has been the most pleasant bit of infrastructure I&#8217;ve ever had the pleasure of standing up.</p>

<h2><a href="http://zeromq.org">0mq</a></h2>

<p>The other thing that amazed me this year was 0mq. 0mq essentially makes the difficult and next to impossible things possible. I am not lying when I say that every project I have floating around in my head is either built around or has a perfect spot for 0mq. Along with ElasticSearch, it has fundamentally changed how I think about software, infrastructure and more.</p>

<h2><a href="http://zookeeper.apache.org">Apache ZooKeeper</a></h2>

<p>While I&#8217;m not a fan of ZooKeeper on several levels, It would be wrong to totally ignore it. For the longest time, ZK stood alone in what it provided. People are building amazing things with it and it inspired me to write Noah.</p>

<h2><a href="http://logstash.net">Logstash</a></h2>

<p>At first I was pretty dismissive of Logstash. Mainly because I didn&#8217;t have a real need. Then I started digging in and realized that Logstash is only tangentially about logs. Logstash is kind of what you always wanted a pipe to be. Arbitrary input, arbitrary filtering, arbitrary output. The use cases for logstash are so much greater when you stop thinking about logs and start thinking about moving data.</p>

<h2><a href="http://erlang.org">Erlang</a>, <a href="http://www.erlang.org/doc/design_principles/users_guide.html">OTP</a> and <a href="http://basho.com">Riak</a></h2>

<p>While my Erlang only extends to passable reading, via Riak, I found a desire to learn more about it. The biggest thing that stuck in my head and also changed the way I think is the Actor model. For the first time that I can remember, a concept made perfect sense to me. While learning Erlang in earnest is a goal for 2012, I think about how simple and understandable the Actor model was.</p>

<h2><a href="http://celluloid.github.com">Celluloid</a></h2>

<p>Following up on the Actor model, I have mad respect for Tony Arcieri. If you look back at his projects, they all follow a similar theme: improving the concurrency story on Ruby. He&#8217;s tenacious and passionate about something that most people would have (and if they&#8217;re arrogant dickfaces) laughed at by now. Celluloid brings some amazing functionality to Ruby inspired by Erlang and OTP. I find myself defaulting to it whenever I need to even think about concurrency in Ruby. Even outside of that, it encourages good behaviour with threads and reduces the chances you&#8217;ll fuck something up. The companion project, DCell, is something I&#8217;m itching to work with as well.</p>

<h2><a href="http://codeascraft.etsy.com/2011/02/15/measure-anything-measure-everything/">Statsd</a> and <a href="http://graphite.wikidot.com/">Graphite</a></h2>

<p>If this past year was about anything, it was about metrics. Lots and lots of metrics. Etsy pushed out statsd and brought metrics collection to the masses. Coda Hale gave an amazing talk on metrics and released the code to back it up. Shooting in the dark sucks. You need numbers. Collect ALL the metrics.</p>

<h1>Final Thoughts</h1>

<p>The world of open source and the community around devops is amazing. I learned from and met so many people in 2011. I&#8217;m hoping that 2012 is the year that I can give back to them in some small way. Thanks to everyone for making this past year amazing.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Github Trolling for Fun and Profit]]></title>
    <link href="http://lusis.github.com/blog/2011/11/22/github-trolling-for-fun-and-profit/"/>
    <updated>2011-11-22T21:43:00-05:00</updated>
    <id>http://lusis.github.com/blog/2011/11/22/github-trolling-for-fun-and-profit</id>
    <content type="html"><![CDATA[<p>Last Friday was a pretty crappy day.</p>

<!-- more -->


<p>I&#8217;m a fairly <em>active</em> Twitter user.</p>

<p>Patrick has joked that it&#8217;s as if I have Twitter wired directly to my brain. It&#8217;s not far from the truth.
I like to engage people and normally Twitter is great medium for engaging folks. Unfortunately, the message size limit makes Twitter an imperfect medium for involved discussions.</p>

<p>I know better but sometimes I forget.</p>

<p>Anyway, last friday I realized near the end of the day that I had pretty much gone off the rails. If I wasn&#8217;t bitching about Maven and Java, I was involved in random discussions about the SaltStack project. Combine that with normal inane bullshit and I somehow managed to pull off a 60+ tweets. It took a comment by <code>roidrage</code> on IRC to point out that I really needed to calm down.</p>

<p>With that, I declared communication blackout for the weekend. I decided to go to happy hour and spend the weekend just having fun with the family. It was awesome. So sorry for the sheer number of people I managed to piss off on Friday.</p>

<h1>Trolling Github</h1>

<p>One of the things I also decided to do not stress about making time to hack. I knew that if I got working on one of my projects, I would totally stay distracted thinking about it.</p>

<p>So I went trolling. On Github.</p>

<p>I was just poking around Github, when I saw in my feed that some commits were done to the <a href="https://github.com/imatix/zguide">user&#8217;s guide for ZeroMQ</a>. Because I have a serious geek woody for ZeroMQ, I got wrapped up reading the guide and looking at some of the more advanced examples. I&#8217;ve got some ideas I want to implement in Ark and Noah that involved 0mq so I figured it would be time well spent.</p>

<p>Now if anyone has bothered to read the <a href="http://zguide.zeromq.org/">zguide</a>, you&#8217;ll know that one of the BEST parts is the code samples. Seriously. They have examples for all of the architectures in almost every language. I don&#8217;t know a single goddamn person who knows <a href="http://en.wikipedia.org/wiki/HaXe">Haxe</a>, but there are examples in the guide for Haxe. You can see an example of what I&#8217;m talking about <a href="http://zguide.zeromq.org/page:all#Divide-and-Conquer">here</a>.</p>

<p>Notice at the bottom the list of examples for languages. If you mouse over the last entry, many times you&#8217;ll get multiples highlighted. This means that chunk of highlighted languages doesn&#8217;t have any examples written.</p>

<p>I noticed that quite a few of the advanced ones didn&#8217;t have Ruby versions. I started back at the beginning of the guide until I found the first one that didn&#8217;t have a Ruby example - the <code>interrupt</code> example.</p>

<h1>Challenge Accepted</h1>

<p>So here I am - resolved not to work on any of my own projects and knowing that I didn&#8217;t have time to get involved with something TOO heavy. I decided to fork the guide and start adding missing Ruby examples.</p>

<p>Now I only got two example done the entire weekend. This mainly revolved around how limited my time was but also around getting REALLY comfortable with <a href="https://github.com/chuckremes/ffi-rzmq">ffi-rzmq</a>. I wanted to make sure that the examples I wrote had the write mix of idiomatic Ruby and yet explicit enough for someone who didn&#8217;t know the specifics of <code>ffi-rzmq</code>.</p>

<p>One that I really struggled with was this one:</p>

<p><a href="https://github.com/imatix/zguide/commit/4c231d1023819152813fad09a45458bd33cb02a9">https://github.com/imatix/zguide/commit/4c231d1023819152813fad09a45458bd33cb02a9
</a></p>

<p>If you get familiar with the zguide, you&#8217;ll see a lot of references to <code>zhelpers</code>. It&#8217;s really just a bunch of boilerplate code that helps keep the actual examples to a nice consumable chunk size. There was not a <code>zhelpers</code> for the Ruby examples. I looked at the others to get an idea of what kinds of things were in there. In relation to the <code>identity</code> examples, there was a dump helper that just dumped the contents of a message. If you look at the <a href="https://github.com/imatix/zguide/blob/master/examples/Python/zhelpers.py">Python</a> and <a href="https://github.com/imatix/zguide/blob/master/examples/C/zhelpers.h">C</a> examples for <code>dump</code>, you&#8217;ll see how they pull the identity of the message out. An interesting comparision, is how the <a href="https://github.com/imatix/zguide/blob/master/examples/Scala/utils.scala">Scala</a> version of <code>dump</code> works.</p>

<p>Instead of focusing on duplicating the strategy employed by the C and Python versions, I went with something that fit how <code>ffi-rzmq</code> works a bit more. I realized that the point was not the content of the helpers so much as the end result, showing that 0mq would generate an identity for a message if one wasn&#8217;t explcitly provided.</p>

<p>I&#8217;m quite sure that at some point, ZMQ::Message objects will get an attribute accessor to simply return the identity. Right now the code base is under a bit of a refactor.</p>

<h1>Call to Action</h1>

<p>I really want to encourage others to do something like this. No pressure. Just troll Github. Look at some projects that are interesting. Look at the open issues. Fork, fix a bug or two and make some pull requests. After that, go on your merry way. No obligations. At worst, you&#8217;ve spent some time sharpening your skills. At best, however, you&#8217;ve made a lasting contribution.</p>

<p>And shit, it doesn&#8217;t even have to be a code contribution. If you can grok the project well enough, add some wiki pages.</p>

<p>I dunno. Github just has this amazingly easy flow for contribution. Do unto others and all that.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Deploy ALL the Things]]></title>
    <link href="http://lusis.github.com/blog/2011/10/18/deploy-all-the-things/"/>
    <updated>2011-10-18T06:59:00-04:00</updated>
    <id>http://lusis.github.com/blog/2011/10/18/deploy-all-the-things</id>
    <content type="html"><![CDATA[<p><em>This is part 2 in a post on deployment strategies. The previous post is located <a href="http://blog.lusis.org/blog/2011/10/18/rollbacks-and-other-deployment-myths/">here</a></em></p>

<p>My previous post covered some of the annoying excuses and complaints that people like to use when discussing deployments. The big take away should have been the following:</p>

<ul>
<li>The risk associated with deploying new code is not in the deploy itself but everything you did up to that point.</li>
<li>The way to make deploying new code less risky is to do it more often, not less.</li>
<li>Create a culture and environment that enables and encourages small, frequent releases.</li>
<li>Everything fails. Embrace failure.</li>
<li>Make deploys trivial, automated and tolerant of failure.</li>
</ul>


<p>I want to make one thing perfectly clear. I&#8217;ve said this several times before. You can get 90% of the way to a fully automated environment, never go that last 10% and still be better off than you were before. I understand that people have regulations, requirements and other things that prevent a fully automated system. You don&#8217;t ever have to flip that switch but you should strive to get as close as possible.</p>

<!--more-->


<h1>Understanding the role of operations</h1>

<p>Operations is an interesting word. Outside of the field of IT it means something completely different than everywhere else in the business world. <a href="http://en.wikipedia.org/wiki/Business_operations">According to Wikipedia</a>:</p>

<blockquote><p>Business operations encompasses three fundamental management imperatives that collectively aim to maximize value harvested from business assets</p>

<ul>
<li><p>Generate recurring income</p></li>
<li><p>Increase the value of the business assets</p></li>
<li><p>Secure the income and value of the business</p></li>
</ul>
</blockquote>

<p>IT operations traditionally does nothing in that regard. Instead IT operations has become about cock blocking and being greybeareded gatekeepers who always say &#8220;No&#8221; regardless of the question. We shunt the responsibility off to the development staff and then, in some sick game of &#8216;fuck you&#8217;, we do all we can to prevent the code from going live. This is unsustainable; counter-productive; and in a random twist of fate, self destructive.</p>

<p>One thing I&#8217;ve always tried to get my operations and sysadmin peers to understand is that we are fundamentally a cost center. Unless we are in the business of managing systems for profit, we provide no direct benefit to the company. This is one of the reasons I&#8217;m so gung-ho on automation. <a href="https://twitter.com/botchagalupe">John Willis</a> really resonated with me in the first Devops Cafe podcast when he talked about the 80/20 split. Traditionally operations staff spends 80% of its time dealing with bullshit fire-fighting muck and 20% actually providing value to the business. The idea that we can flip that and become contributing members of our respective companies is amazing.</p>

<p>Don&#8217;t worry. I&#8217;ll address development down below but I felt it was important to set my perspective down before going any further.</p>

<h1>Technical Debt and Risk Management</h1>

<p>Glancing back to my list of take-aways from the last post, I make a pretty bold (to some people) statement. When I say that deploy risk is not the deploy itself but everything up to that point, I&#8217;m talking about technical debt.</p>

<p>Technical debt takes many forms and is the result of both concious, deliberate choices as well as unintended side-effects. Some examples of that are:</p>

<ul>
<li>Lack of or insufficient testing and associated</li>
<li>Overreliance on time consuming manual processes</li>
<li>Shortcuts to meet deadlines - both artifical and real</li>
<li>Violation of the 10-minute maxim</li>
<li>Technological choices</li>
<li>Cultural choices</li>
<li>Fiscal limitations</li>
</ul>


<p>All of these things can lead to technical debt - the accumulation of dead bodies in the room as a byproduct of how we work. At best, someone at least acknowledges they exist. At worst, we stock up on clothespins, pinch our nostrils shut and hope no one notices the stench. Let&#8217;s address a couple of foundational things before we get into the fun stuff.</p>

<h2>Testing</h2>

<p>Test coverage is one of the easiest ways to manage risk in software development. One of the first things to go in a pinch is testing. Even that assumes that testing was actually a priority at some point. I&#8217;m not going to harp on things like 100% code coverage. As I said previously, humans tend to overcompensate. Test coverage is also, however, one of the easiest places to get your head above water. If you don&#8217;t have a culture of committment to testing, it&#8217;s hard but not impossible to get started. You don&#8217;t have to shutdown development for a week.</p>

<ol>
<li>Start by having a commitment to write tests for any new code going forth.</li>
<li>As bugs arise in untested code, make a test case for the bug a requirement to close the bug.</li>
<li>Find a small victory in existing code. Create test coverage for low hanging fruit.</li>
<li>Plan for a schedule to cover any remaining code</li>
</ol>


<p>The key here is baby steps. Small victories. Think Fezzik in &#8216;The Princess Bride&#8217; - <em>&#8220;I want you to feel like you&#8217;re winning&#8221;.</em></p>

<p>Testing is one of the foundations you have to have to reach deploy nirvana. System administrators have a big responsiblity here. Running tests has to be painless, unobstrusive and performant. You should absolutely stand up something like Jenkins that actually runs your test suite on check-in. As that test suite grows, you&#8217;ll need to be able to provide the capacity to grow with it. That&#8217;s where the next point can be so important.</p>

<h2>Manual processes</h2>

<p>Just as testing is a foundation on the code side, operations has a commensurate responsibility to reduce the number of human hands involved with creating systems. We humans, despite the amazing potential that our brains provide, are generally stupid. We make mistakes. Repeatability is not something we&#8217;re good at. Some sort of automated and repeatable configuration management strategy needs to be adopted. As with testing, you can make some amazing progress in baby steps by introducing some sort of proper configuration management going forward. I don&#8217;t recommend you attempt to retrofit complex automation on top of existing systems beyond some basics. Otherwise you&#8217;ll be spending too much time trying to differentiate between &#8220;legacy&#8221; and &#8220;new&#8221; servers roles. If you are using some sort of virtualization or cloud provider like EC2, this is a no brainer. It&#8217;s obviously a bit harder when you&#8217;re using physical hardware but still doable.</p>

<p>Have you ever played the little travel puzzle game where you have a grid of moving squares? The idea is the same. You need just ONE empty system that you can work with to automate. Pick your simplest server role such as an apache webserver. Using something like Puppet or Chef, write the &#8216;code&#8217; that will create that role. Don&#8217;t get bogged down in the fancy stuff the tools provide. Keep it simple at first. Once you think you&#8217;ve got it all worked out, blow the server away and apply that code from bootstrap. Red, green, refactor. Once you&#8217;re comfortable that you can reprovision that server from bare metal, move it into service. Make sure you have your own set of &#8216;test cases&#8217; that ensure the provisioned state is the correct one. This will become important later on.</p>

<p>Take whatever server it&#8217;s replacing and do the same for the next role. When I came on board with my company I spent many useless cycles trying to retrofit an automation process on top of existing systems. In the end, I opted to take a few small victories (using Chef in this case):</p>

<ol>
<li>Create a base role that is non-destructive to existing configuration and systems. In my case, this was managing yum repos and user accounts.</li>
<li>Pick the &#8216;simplest&#8217; component in our infrastructure and start creating a role for it.</li>
<li>Spin up a new EC2 instance and test the role over and over until it works.</li>
<li>Terminate the instance and apply the role on top with a fresh one.</li>
<li>Replace the old instances providing that role with the new ones and move to the next role.</li>
</ol>


<p>Using this strategy, I was able to replace all of our legacy instances for the first and second tiers of our stack in a couple of months time. We are now at the point where, assuming Amazon plays nice with instance creation, we can have any role in those tiers recreated at a moment&#8217;s notice. Again, this will directly contribute to how we mitigate risk later on.</p>

<h2>10 minute maxim</h2>

<p>I came up with this from first principles so I&#8217;m sure there&#8217;s a better name for it. The idea is simply this:</p>

<blockquote><p>Any problem that has to be solved in five minutes can be afforded 10 minutes to think about the solution.</p></blockquote>

<p>System Administrators often pride ourselves on how cleverly and quickly we can solve a problem. It&#8217;s good for our egos. It&#8217;s not, however, good for our company. Take a little extra time and consider the longer term impact of what solution you&#8217;re about to do. Step away from the desk and move. Consult peers. Many times I&#8217;ve come to the conclusion that my first instinct was the right one. However more often than not, I&#8217;ve come across another solution that would create less technical debt for us to deal with later.</p>

<p>A correlary to this is the decision to &#8216;fix it or kick it&#8217;. That is &#8216;Do we spend an unpredictable amount of time trying to solve some obscure issue or do we simply recreate the instance providing the service from our above configuration management&#8217;. If you&#8217;ve gone through the previous step, you have should have amazing code confidence in your infrastructure. This is very important to have with Amazon EC2 where you can have an instance perform worse overtime thanks to the wonders of oversubscription and noisy neighbors.</p>

<p>Fuck that. Provision a new instance and run your smoke tests (I/O test for instance). If the smoke tests fail, throw it away and start a new one. It&#8217;s amazing the freedom of movement afforded by being able to describe your infrastructure as code.</p>

<h1>Getting back to deploys</h1>

<p>I would say that without the above, most of the stuff from here on out is pretty pointless. While you <strong>CAN</strong> do automated and non-offhour deploys without the above, you&#8217;re really setting yourself up for failure. Whether it&#8217;s a system change or new code, you need to be able to ensure that that some baseline criteria can be met. Now that we&#8217;ve got the foundation though, we can build on it and finally adopt some distinct strategies for releases.</p>

<h1>Building on the foundation</h1>

<p>The next areas you need to work on are a bit harder.</p>

<h2>Metrics and monitoring</h2>

<p>Shooting in the dark sucks. Without some sort of baseline metric, you authoritatively say whether or not a deploy was &#8216;good&#8217;. If it moves, graph it. If it moves, monitor it. You need to leverage systems like <a href="https://github.com/etsy/statsd">statsd</a> (available in non-node.js flavors as well) that can accept metrics easily from your application and make them availabile in the amazing <a href="http://graphite.wikidot.com/">graphite</a>.</p>

<p>The key here is that getting those metrics be as frictionless as possible. To fully understand this, watch <a href="http://pivotallabs.com/talks/139-metrics-metrics-everywhere">this presentation from Coda Hale of Yammer</a>. Coda has also created a kick-ass metrics library for the JVM and others have duplicated his efforts in their respective languages.</p>

<h2>Backwards compatibility</h2>

<p>You need to adopt a culture of backwards compatibility between releases. This is not Microsoft levels we&#8217;re talking about. This affects interim releases. As soon as you have upgraded all the components, you clean up the cruft and move on. This is critical to getting to zero/near-zero downtime deploys.</p>

<h2>Reduce interdependencies</h2>

<p>I won&#8217;t go into the whole SOA buzzword bingo game here except to say that treating your internal systems like a third party vendor can have some benefits. You don&#8217;t need to isolate the teams but you need to stop with shit like RMI. Have an established and versioned interface between your components. If component A needs to make a REST call to component B, upgrades to the B API should be versioned. A needs version 1 of B&#8217;s api. Meanwhile new component C can use version 2 of the API.</p>

<h2>Automation as a default</h2>

<p>While this ties a lot into the testing and configuration management topics, the real goal here is that you adopt a posture of automation by default. The reason for this should be clear in <a href="http://www.startuplessonslearned.com/2009/07/how-to-conduct-five-whys-root-cause.html">Eric Ries&#8217; &#8220;Five Whys&#8221; post</a>:</p>

<blockquote><p>Five Whys will often pierce the illusion of separate departments and discover the human problems that lurk beneath the surface of supposedly technical problems.</p></blockquote>

<p>One of the best ways to eliminate human problems is to take the human out of the problem. Machines are very good at doing things repeatedly and doing them the same way every single time. Humans are not good at this. Let the machines do it.</p>

<h1>Deploy Strategies</h1>

<p>Here are some of the key strategies that I (and others) have found effective for making deploys a non issue.</p>

<h2>Dark Launches</h2>

<p>The idea here is that for any new code path you insert in the system, you actually exercise it before it goes live. Let&#8217;s face it, you can never REALLY simulate production traffic. The only way to truly test if code is performant or not is to get it out there. With a dark launch, you&#8217;re still making new database calls but using your handy dandy metrics culture above, you now know how performant it really is. When it gets to acceptable levels, make it visible to the user.</p>

<h2>Feature flags</h2>

<p>Feature flags are amazing and one of the neat tricks that people who perform frequent deploys leverage. The idea is that you make aspects of your application into a series of toggles. In the event that some feature is causing issues, you can simply disable it through some admin panel or API call. Not only does this let you degrade gracefully but it also provides for a way to A/B test new features. With a bit more thought put into the process, you can enable a new feature for a subset of users. People love to feel special. Being a part of something like a &#8220;beta&#8221; channel is an awesome way to build advocates of your system.</p>

<h2>Smoke testing at startup</h2>

<p>This is one that I really like. The idea is simply that your application has a basic set of &#8216;tests&#8217; it runs through at startup. If any of those tests fail, the code is rolled back.</p>

<p>Now this is where someone will call me a hypocrite because I said you should and can never really roll back. You&#8217;re partially right. In my mind, however, it&#8217;s not the same thing. I consider code deployed once it&#8217;s taken production traffic. Up until that point, it&#8217;s just &#8216;pre-work&#8217; essentially. Let&#8217;s take a random API service in our stack. I&#8217;m assuming you have two API servers in this case.</p>

<ul>
<li>Take one out of service</li>
<li>Deploy code</li>
<li>Smoke tests run</li>
<li>If smoke tests fail, stop new code and start old code</li>
<li>If smoke tests pass, start sending production traffic to server</li>
<li>If acceptable, push to other server</li>
<li>profit!</li>
</ul>


<p>Now you might see a bit of gotcha there. I intentionally left out a step. This is a bit different than how shops like Wealthfront do it. They actually <strong>DO</strong> roll back if production monitoring fails. My preference is to use something similar to <a href="https://github.com/igrigorik/em-proxy">em-proxy</a> to do a sort of mini-dark launch before actually turning it over to end-users. You don&#8217;t have to actually use em-proxy. You could write your own or use something like RabbitMQ or other messaging system. This doesn&#8217;t always work depending on the service the component is providing but it does provide another level of &#8216;comfort&#8217;.</p>

<p>Of course this only works if you maintain backwards compatibility.</p>

<h2>Backwards Compatibility</h2>

<p>This is probably the hardest of all to accomplish. You may be limited by your technology stack or even some poor decision made years ago. Backwards compatibility also applies to more than just your software stack. This is pretty much a critical component of managing database changes with zero downtime.</p>

<h2>Code related</h2>

<p>Your code needs to understand &#8216;versions&#8217; of what it needs. If you leverage some internal API, you need to maintain support for an older version of that API until all users are upgrade. Always be deprecating and NEVER EVER redefine what something means. Don&#8217;t change a property or setting that means &#8220;This is my database server hostname&#8221; to &#8220;This is my mail server hostname&#8221;. Instead create a new property, start using it and remove the old on in a future release. Don&#8217;t laugh, I&#8217;ve seen this done. As much as I have frustrations with Java, constructor overloading is a good example of backwards compatibility.</p>

<h3>Database related</h3>

<p>Specifically as it relates to databases, consider some of the following approaches:</p>

<ul>
<li>Never perform backwards incompatible schema changes.</li>
<li>Don&#8217;t perform ALTERs on really large tables. Create a new table that updated systems use and copy on read to the new table. Migrate older records in the background.</li>
<li>Consider isolating access to a given table via a service. Instead of giving all your applications access to the &#8216;users&#8217; table, create a users service that does that.</li>
<li>Start exercising code paths to new tables early by leveraging dark launches</li>
</ul>


<p>Some of these techniques would make Codd spin in his grave.</p>

<p>We&#8217;re undergoing a similar situation right now. We originally stored a large amount of &#8216;blob&#8217; data in Voldemort. This was a bit perplexing as we were already leveraging S3 for similar data. To migrate that data (several hundred gigs) we took the following approach:</p>

<ul>
<li>Deploy a minor release that writes and new data to both Voldemort and S3.</li>
<li>Start a &#8216;copy&#8217; job in the background to migrate older data</li>
<li>Continue to migrate data</li>
<li>When the migration is finished, we&#8217;ll deploy a new release that uses S3 exclusively</li>
<li>Profit (because we get to terminate a few m1.large EC2 instances)</li>
</ul>


<p>This worked really well in this scenario. These aren&#8217;t new techniques either. Essentially, we&#8217;re doing a variation of a two-phase commit.</p>

<p>Now you might think that all this backwards compatibility creates cruft. It does. Again, this is something that requires a cultural shift. When things are no longer needed, you need to clean up the code. This prevents bloat and makes understanding it down the road so much easier.</p>

<h1>Swinging like a boss</h1>

<p>Here&#8217;s another real world example:</p>

<p>Our code base originally used a home-rolled load balancing technique to communicate with one of our internal services. Additionally, all communication happened over RPC using Hessian. Eventually this became untenable and we decided to move to RabbitMQ and JSON. This was a pretty major change but at face value, we should have been able to manage with dual interfaces on the provider of the service. That didn&#8217;t happen.</p>

<p>You see, to be able to use the RabbitMQ libraries, we had to upgrade our version of Spring. Again, not a big deal. However our version of Hessian was so old that the version of Hessian we would have to use with the new version of Spring was backwards incompatible. This is yak shaving at its finest, folks. So basically we had to upgrade 5 different components all at once just to get to where we wanted and NEEDED to be for the long term.</p>

<p>Since I had already finished coding our chef cookbooks, we went down the path of duplicating our entire front-end stack. What made this even remotely possible was the fact that we were using configuration management in the first place. Here&#8217;s how it went down:</p>

<ul>
<li>Duplicate the various components in a new Chef environment called &#8216;prodB&#8217;</li>
<li>Push new code to these new components</li>
<li>Add the new components to the ELBs and internal load balancers for a very short 5-10 minute window. Sort of a mini-A/B test.</li>
<li>Check the logs for anything that stood out. Validated the expected behavior of the new systems. Thsi also gave us a chance to &#8216;load-test&#8217; our rabbitmq setup. We actually did catch a few small bugs this way.</li>
</ul>


<p>Once we were sure that things looked good, we swung all the traffic to the new instances and pulled the old ones out. We never even bothered to upgrade the old instances. We just shut them down.</p>

<p>Obviously this technique doesn&#8217;t work for everyone. If you&#8217;re using physical hardware, it&#8217;s much more costly and harder to pull off. Even internally, however, you can leverage virtualization to make these kinds of things possible.</p>

<h2>Bitrot</h2>

<p>What should be the real story in this is that bitrot happens. Don&#8217;t slack on keeping third-party libraries current. If a third-party library introduces a breaking change and it affects more than one part of your stack, you probably have a bit too tight of a coupling between resources.</p>

<h1>Wrap up/Take away</h1>

<p>This post came out longer than I had planned. I hope it&#8217;s provided you with some information and things to consider. Companies of all shapes, markets and sizes are doing continuous deployment, zero downtime deploys and all sorts of things that we never considered possible. Look at companies like Wealthfront, IMVU, Flickr and Etsy. Google around for phrases like &#8216;continuous deployment&#8217; and &#8216;continuous delivery&#8217;.</p>

<p>I&#8217;m also painfully aware that even with these tricks, some folks simply cannot do them. There are many segments of industry that might not even allow for this. That doesn&#8217;t mean that some of these ideas can&#8217;t be implemented on a smaller scale.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Rollbacks and other deployment myths]]></title>
    <link href="http://lusis.github.com/blog/2011/10/18/rollbacks-and-other-deployment-myths/"/>
    <updated>2011-10-18T00:35:00-04:00</updated>
    <id>http://lusis.github.com/blog/2011/10/18/rollbacks-and-other-deployment-myths</id>
    <content type="html"><![CDATA[<p>I came across an interesting post today via HN. I&#8217;m surprised (only moderately) that I missed it the first time around since this is right up my alley:</p>

<p><a href="http://briancrescimanno.com/2011/09/29/why-are-you-still-deploying-overnight/">Why are you still deploying overnight?</a></p>

<p>I thought this post was particularly apropos for several reasons. I just got back from DevOpsDays EU <strong>AND</strong> I&#8217;m currently in the process of refactoring our deploy process.</p>

<p>I&#8217;m breaking this up into two parts since it&#8217;s a big topic. The first one will cover the more &#8220;theoretical&#8221; aspects of the issue while the second will provide more concrete information.</p>

<!--more-->


<h1>Myths, Lies and other bullshit</h1>

<p>Before I go any further, we should probably clear up a few things.</p>

<p>Understand, first and foremost, that I&#8217;m no spring chicken in this business. I&#8217;ve worked in what we now call web operations and I&#8217;ve worked in traditional financial environments (multiple places). If it CAN go wrong, it has gone wrong for me. Shit, I&#8217;ve been the guy who dictated that we had to deploy after hours.</p>

<p>Also, this is not a &#8220;tell you what to do&#8221; post.</p>

<p>So what are some of the myths and other crap people like to pull out when having these discussions?</p>

<ul>
<li>Change == Risk</li>
<li>Deploys are risky</li>
<li>Rollbacks</li>
<li>Nothing fails</li>
<li>SLAs</li>
</ul>


<p>There&#8217;s plenty more but these are some of the key ones that I hear.</p>

<h2>Change is change</h2>

<p>There is nothing inherent in change that makes it risky, dangerous or anything more than change. Change is neither good or bad. It&#8217;s just change.</p>

<p>The idea that change has a risk associated with it is entirely a human construct. We have this false assumption that if we don&#8217;t change something then nothing can go wrong.
At first blush that would make sense, right? If it ain&#8217;t broke, don&#8217;t fix it.</p>

<p>Why do we think this? It&#8217;s mainly because we&#8217;re captives to our own fears. We changed something once, somewhere, and everything went tango uniform. The first reaction after a bad experience is never to do whatever caused that bad experience again. This makes sense in quite a few cases. Touch fire, get burned. Don&#8217;t touch fire, don&#8217;t get burned!</p>

<p>However this pain response tends to bleed over into areas. We deployed code one time that took the site down. We changed something and bad things happened. Engage overcompensation - We should never change anything.</p>

<h2>Deploys are not risky</h2>

<p>As with change, a deploy (a change in and of itself) is not inherently risky. Is there a risk associated with a deploy? Yes but understand that the risk associated with pushing out new code is the culmination of everything you&#8217;ve done up to that point.</p>

<p>I can&#8217;t even begin to count the number of ways that a deploy or release has gone wrong for me. Configuration settings were missed. Code didn&#8217;t run properly. The wrong code was deployed. You name it, I&#8217;ve probably seen it.</p>

<p>The correct response to this is <strong>NOT</strong> to stop doing deploys, do them off-hours or do them less often. Again with the overcompensation.</p>

<p>The correct way to handle deployment problems is to do MORE deploys. Practice. Paraphrasing myself here from an HN comment:</p>

<blockquote><p>Make deploys trivial, automated and tolerant to failure because everything fails.</p></blockquote>

<p>&#8220;Release early, release often&#8221; isn&#8217;t just about time to market. The way to reduce risk is not to add more risky behavior (introducing more vectors for shit to go wrong). The way to reduce the risk associated with deploys is to break them into smaller chunks.</p>

<p>You need to stop thinking like Subversion and start thinking like Git.</p>

<p>One of the reasons people don&#8217;t feel comfortable performing deploys during the day is because deploys are such a big deal. You&#8217;ve got to make deploys a non-issue.</p>

<h2>Rollbacks are a myth</h2>

<p>Yes, it&#8217;s true. You can never roll back. You can&#8217;t go back in time. You can fake it but understand that it&#8217;s typically more risky to rollback than rolling forward. Always be rolling forward.</p>

<p>The difficulty in rolling forward is that it requires a shift in how you think. You need to create a culture and environment that enables, encourages and allows for small and frequent changes.</p>

<h2>Everything fails. Embrace failure.</h2>

<p>It amazes me that in this day and age people seem to think you can prevent failure. Not only can you not prevent it, you should embrace it. Learn to accept that failure will happen.  Often spending your effort decreasing MTTR (mean time to recovery) as opposed to increasing MTBF (mean time between failures) is a much better investment. Failure is not a question of &#8216;if&#8217; but a question of &#8216;when&#8217;.</p>

<p>Systems should be designed to be tolerant of failure. This is not easy, it&#8217;s not always cheap and it can be quite painful at first. Failure sucks. Especially as systems administrators, we tend to personalize a failure in our systems as a personal failure.</p>

<p>The best way to deal with failure is to make failure a non-issue. If it&#8217;s going to happen and you can&#8217;t prevent it, why stress over trying to prevent it? This absolutely doesn&#8217;t mean that you should do some level of due dilligence. I&#8217;m not saying that you should give up. What I&#8217;m saying is that you should design a robust system that handles failures gracefully and returns you to service as quickly as possible. It&#8217;s called fault TOLERANCE for a reason.</p>

<h2>SLAs are not about servers</h2>

<p>SLAs are in general fairly silly things. Before you get all twisted and ranty, let me clarify. SLAs have value but the majority of that value is to the provider of the SLA and not the person on the other end. SLAs are a lot like backup policies.</p>

<p>Look at it this way. I&#8217;m giving you an SLA for four nines of availability. That allows me to take around 50 minutes of downtime a year. Of course you assume that means 50 minutes spread over a year. What you fail to realize is that I can take all 50 minutes at once and still meet my SLA. Taking 50 minutes at one time is much more impacting than taking ten 5-minute outages. What&#8217;s worse is I can take that downtime not only in one chunk but I can take it at the worst possible time for you.</p>

<p>The other side of SLAs is that we tend to equate them with servers as opposed to services. The SLA is a <em>Service Level Agreement</em>. Not a <em>Server Level Agreement</em>. Services are what matters, not servers.</p>

<p>When you start to equate an SLA with a specific server, you&#8217;ve already lost.</p>

<h1>Wrap up and part 2</h1>

<p>As I said, this topic is too big to fit in one post. The next post will go into specifics about strategies and techniques that will hopefully give you ideas on how to make deploys less painful.</p>
]]></content>
  </entry>
  
</feed>
