blog dot lusis

development, operations and everything in between

So You Wanna Go On-prem Do Ya

| Comments

If you run a successful SaaS platform, at some point someone is going to come to you with the question: can I run it myself? If you’re considering offering a private version of your SaaS, this post might be for you.

At this point, I’ve worked for a few companies that are SaaS vendors. In most cases, we’ve also done some sort of private install.

Drivers for private installs

There are a lot of reasons people want private installs. Depending on the nature of your product, some of them are even legitimate.

  • If your product is in the dependency chain of your customer’s release management
  • If your product is security in nature
  • If your product is a core dependency for normal operation of a product
  • If your product is metered based on storage or transit OR if your product has high storage or transit requirements
  • If your product has stability issues due to multi-tenancy
  • Customer requirements around tenancy
  • Customer requirements around visibility of data
  • Varying degrees of industry requirements
  • Control of upgrade cycle
  • Payment restrictions for monthly services

These are all valid reasons that your customers will want a dedicated installation of your product. I’ve even been on the asking side.

That doesn’t mean, however, that you should DO them.

Types of private installs

Before we go into more information it’s worth categorizing the types of “on-prem” installs:

  • managed hosted
  • fully on-premise

For the purposes of this discussion, we’ll define them like so:

  • managed hosted: You host, manage and run the environment entirely on behalf of the customer. You are still a SaaS but reduced to single tenancy for a given customer
  • fully on-premise: Your software is installed fully on the customer’s environment - hardware, location, cloud, whatever

Both of these options “suck” for you to varying degrees.

So why do them?

If they suck so much, why do them?

Money.

The companies that want private installs are the kind of companies that can pay well for them. The problem is that the companies that can pay well for private installs are, in general, the hardest companies to do business with - “enterprises” (and worse - government)

Enterprise sidebar

Let’s clarify. I don’t hate enterprises. I know it’s easy to think that from my general commentary. The problem isn’t enterprises (what does that term even mean these days anyway?)

The problem is how those organizations operate at an IT level in general, how they interact with “vendors” and what they expect.

To most enterprises, software is a purchase with a tangible deliverable. A thing that has a license and a lifecycle.

Even a semi-enlightened enterprise like <insert large Texas-based computer manufacturer here> still has trouble with SaaS products, the billing therein and how to categorize them. Sometimes it’s just “easier” to go on-prem than fight your AP department.

Don’t do it IF you can

Startups aren’t easy. It’s hard to say no to a customer who’s willing to “pay anything” for a private install of your SaaS platform. However, as I get to below, if you have ANY options at all, you shouldn’t do it. I can’t speak for your situation or your board though. You may not be given a choice. It may be the only way to get a cash influx to keep things going.

Just be aware that by doing it, you could be signing on for a cost and time sink for the next 12 months.

So what do you need to know before diving into this madness?

Operational Excellence/Support Excellence

I’ve used this phrase A LOT in the past. I’ve refered to it when talking about OpenStack, private PaaS and other things. What is important to understand is that when you move to any sort of on-prem model, your operational and support problems are multiplied.

If you have not yet leveled up your existing operational experience, then you might not get the chance.

Take a look at your current SaaS platform you run. Now break down the following:

  • Release cadence
  • Level of automation
  • Failure modes
  • Scaling model
  • Footprint
  • Support model
  • Logging
  • Monitoring
  • Supportability

Now multiply this times 10. times 50. times 100.

Now find your security blanket and go weep softly in the corner. Realize that the multiplier is actually an unbounded number.

And this is just a small fraction of the things you have to consider.

Are you having trouble staffing/keeping ops people? Well you’re going to need a LOT more.

Changing focus

When you move to the on-prem world, depending on which type of model you go with you move from being a SaaS provider into either:

  • an ops organization
  • a helpdesk organization

if not both.

Either you’re spending all your time operating the environments or you’re spending all your time supporting those environments.

Operational supportability

This is a phrase that we used pretty heavily when I was involved with Enstratus. This concern is cross-cutting regardless of which model you go with:

In a managed hosted model, you have to have almost zero touch administration of your stack.

How many people does it take to manage your SaaS right now? Exclude human factors for a minute (i.e. we have two people so one actually gets to sleep and take vacation). If you need at least two people to manage the steady state of your stack full time, get ready to hire two more for each private install.

You MUST be able to reach a greater operational density. You need numbers like 2 operators for every 10 (or more) private installs.

It gets even worse for you in a fully on-prem model. You have to have absolute zero touch administration of your stack. Your stack may not be able to get to the internet. Depending on the customer, you won’t be able to get to it to support it. Every support issue is going to be you describing to a customer (either through documentation or on the phone) how to fix any given issue.

Here’s an example:

  • How many times have you had to manually touch a system to resolve an issue
  • How many times have you had to manually massage data to fix an issue

You can’t do that anymore. With a SaaS you can chalk that up to tech debt. Something you’ll pay down in the future but you tolerate for now because it doesn’t happen THAT often. With an on-prem model “THAT often” becomes much more frequent and it bleeds your staff dry.

I can clearly remember cases of having to support a production system via Microsoft Lync over a customer’s workstation (like the actual one the user used on a daily basis) via an SSH client. These are the customers you’re getting.

Loss of Flexibility

Once you decide to go to supporting on-prem (again we’re lumping this term to mean a dedicated install of your SaaS platform for a single tenant), you now have to deal with one of the most difficult questions. Maybe you’ve decided to brute force your operational concerns with a meat cloud. Maybe you’re embracing a toxic hero culture. Fine. Everyone is overworked trying to keep that snowflake up and running.

Now you’ve tied your hands for software development. Let me paint a pretty picture. This is TOTALLY hypothetical and never happened to any cloud management software that I was ever involved with.

  • MySQL scaling and availability is a pain point
  • Decision is made to move to Riak
  • Migration to Riak gets partially done
  • Bits and bobs are moved from MySQL to riak as time allows. All new features use Riak though

Now this is a common TOTALLY THEORETICAL thing. Migrating data store technology is a definite process.

So this is in your SaaS and you still have private install customers being signed up. These make money after all.

What was a 4 node environment (HA frontend pair, HA MySQL) now has a to have a 3 (or if you’re doing it right - 5) node Riak cluster.

Now let’s go back to the customers. Depending on your private model, you’re faced with two different scenarios:

  • Deal with customers who can barely support MySQL (at least they can google it) but now have to manage a quorum-based distributed data store
  • Add more machines to the environment you are hosting for your customer

The first one, as you can predict, is pretty much a no-go. This hypothetical cloud management product will now have to deal with something the customer has never seen before and is essentially support for not only the actual product you write but also the underlying data store.

In both the second one AND the first one you now have to justify to your customer why they need 9 systems to run this application when only two of them are running the ACTUAL application. Either way it’s a cost issue and that directly affects your bottom line.

So if you want to do private deploys, EVERY decision has to be couched in the context of how it affects your private environments.

  • Wanna migrate to Mesos? You now have to run n mesos clusters
  • Wanna roll out ES? You now have to run n ES clusters
  • Wanna move to “microservices”? Now you have nn problems.

What’s really fun is when you’ve already MADE these decisions and realize it’s barely supportable in one environment much less 10 or 20 or 100 of them.

A really good rule of thumb is that if you’re adding or moving to something that has quorum requirements, you may have to rethink that decision or worse consider bifurcating your product.

Don’t bifurcate your product. Down that path lies madness. Instead you’re stuck considering the least common denominator and defering to the private deploy model because that’s where the money comes from right?

X in a Box

If anyone watches Silicon Valley on HBO, you might have seen the recent thread of episodes about installing “Pied Piper” in a box at datacenters.

Let’s assume for a minute that your product is easily “boxable”. By that I mean:

  • small footprint (single or small number of artifacts/services)
  • minimum/pluggable external dependencies (think databases)

This is actually how we got to something resembling a sane private deployment model with a certain cloud management platform. We built our “foo in a box” as a way to give developers environments closer to production. Because of the way this was built, it actually became the installer for our private deploys.

But even this was barely supportable. Near the end we started down the path of appliances (benefit of working for a hardware company - they understand physical boxes).

Sidebar on Appliances

Appliances are actually a pretty solid way to manage your stack if you can pull it off. Virtual appliances CAN work but you’ve still not addressed the clustering issue.

Physical appliances, however, work a little differently. There’s a different cognative load in the mind of the customer when it comes to an appliance. Most IT staff don’t care about what OS is running on the headends of the NAS they bought. It’s a novelty if it’s running Linux but they’re not trying to manage it like a Linux server. They’re managing it like an appliance.

To that end, you have to invest considerable (but very effective) time in wrapping your ‘appliance’ in something that looks like a Netgear home router. If you have physical hardware, you can work with an LCD panel on the front where they can set the network specifics and then the rest of the configuration is done via a management interface.

As much as I have issues with Docker, containers are actually a great way to handle treating your application components as “firmware”. The primitives are there for essentially partitioning the bits of your hardware for your services.

If you’re being as stateless as possible, a “firmware upgrade” for your “appliance” is a docker pull foo:nextver; docker stop foo; docker start foo. You even get rollback baked in. Taking backups is as “simple” as snapshotting your data volume. Use a dedicated container with a privileged shared socket mounted to proxy OS commands to your base host.

But as I said, this still isn’t a free lunch. You have to develop your admin UI, the shim for host OS interaction (upgrades, backups, network config).

Monitoring and Logging

This is where cost can go pear-shaped REALLY fast.

Unless you are willing to bake the cost of every third-party SaaS you use into each private/on-prem install, you will not be able to afford to keep using those products.

I don’t care WHAT the product is: APM, Logging, Monitoring, Metrics. You can’t afford it anymore. Multiply your current bill for those products by the a conservative 10 private installs.

New Relic alone would bankrupt you.

This leaves you with a few options:

  • bifurcate these parts of your infrastructure with homegrown tools in each environment
  • become a logging saas, monitoring saas, apm saas, alerting saas

Let’s be clear. There is a million lightyear difference between running your own ELK stack for your production infrastucture and either:

  • running an ELK SaaS
  • maintaining an unbounded number of ELK stacks

Either way you just became a fucking SME on running ELK at scale. Maybe you should pivot into being a logging SaaS.

Again it goes back to operational excellence.

There are “hybrid” options here but it comes down to what business you want to be in.

Now I can hear people saying “this isn’t a big deal. We run qa, staging, prod environments now. What’s a few more”. Those people are being insanely (and probably intentionally) naive. That’s great. Let’s assume that you’re willing to eat the cost.

Now tell me, genius. How do you eat that cost when the environment cannot access the internet at all?

“Well I…ummm….”

Yeah I thought so.

So at this point you now have to decide if having hard requirements and the lost contracts are worth it. You still have to come up with a solution.

“Well we just make the customer monitor it…”

That won’t work either and let me tell you why. Your 20 node stack is going to be shoved onto the absolute minimum resources it can fit on. All 5 of your zookeeper nodes will MOST likely live on the same physical servers running as VMs (lolantiaffinitywut). Those VMs will be periodically “frozen” for snapshotting. The OS will not be monitored because adding 20 HP OpenView licenses is cost prohibitive. So here’s what you’re going to have to support in about a week:

  • My disk filled up and your data node crashed. Have fun with regular recoveries.
  • We keep having periodic timeouts (Go find out what happens to your carefully tuned distributed system has its entire OS paused while the hypervisor buffers all memory and network as the VM is live migrating)

Yes, I have seen all of these.

So you can decide to make the customer monitor things and deal with the fallout of periodic full system recoveries or you can ship a monitoring solution with your product. And we’re just talking about monitoring.

Your customer is going to understand ONE paradigm for monitoring - syslog. And depending on the component you’re probably going to have them end up not having any logs you need when you support them.

So really you need to ship a logging solution too.

And since APM is prohibitively expensive at this volume (assuming you can even get to the internet), you’re probably going to want to write your own New Relic clone and ship it as well.

But even if you can use third party systems and get to the internet, you now have to deal with managing security credentials for those services.

  • Do they allow unlimited, limited scope api tokens?
  • Can you automate provisioning of those tokens via an API?
  • Can you automate provisioning of everything else via an API?

For instance we use SignalFX, Loggly and OpsGenie. Looking at the above concerns, we have the following gaps:

Loggly

Loggly allows two submission tokens. That’s it. Two. You are now stuck with sharing your security credentials to loggly across n number of customer’s systems. Loggly has indicated time and time again they have zero interest in allowing you to create any additional keys. They seem to think people want to use these for routing (not true) and that you should use other mechanisms for routing. Whatever.

SignalFX

SignalFX has a single token at the moment. The upshot over Loggly is that they’re actively working on it. SignalFX is also much younger than Loggly so gaps are to be expected. SignalFX has a nice and constantly evolving API, however, that API is lacking automation for “integrations”. If you’re deploying your customer infra to AWS, you still have the manual step to enable the AWS monitoring and cross-account access.

OpsGenie

OpsGenie let’s you create all the API keys in the world you want. You can create all the integrations you want but you cannot yet automate the creation of the integrations via the API.

In the end, you may be forced to choose tooling and services that offer nested multi-tenancy which are, sadly, few and far between.

Maintenance

For the sake of moving the discussion forward, let’s make one final assumption. You’ve worked out all the kinks with monitoring. You’ve got your install game on fleek. It’s all looking good.

Now you have to upgrade.

How long does it take to upgrade your current stack?

Maybe you’ve moved to a continuous delivery nirvana. It’s all wired up to a handy chatbot to do the magic.

Throw it all out. All of it. You get to start over.

Even if your customer can get out to the internet (though you may have to go through a proxy - all your tools support connectivity via a proxy right?) and even if you can get into the customer environment on-demand (though you might have to fire up a vpn client or go through a jump box), you may have one final blocker:

The customer says no.

This statement is going to be somewhat “controversial” but if your customer could handle software that has a rolling release schedule and the operational excellence that is required for it, they probably wouldn’t be using your software (unless you are literally the only person in the world offering it).

What I mean by this is that most traditional enterprises upgrade software on a cadence measured in quarters if not years. The level of effort to get to that point brings along a maturity and mindset from the effort. If you’re already writing software in house, have the expertise and maturity to do continuous delivery and are doing it then you’ll probably either write whatever it is yourself or you’re perfectly okay with using the public SaaS version of whatever said product is.

Depending on the number of customers and your own level of automation around upgrades, you could literally be spending months upgrading all the customers on their own schedule. Again, that’s assuming you’re the one allowed to do the upgrades. Does that mean having to go on-site? You could conceivably being doing the upgrade over the previously mentioned RDP session….

One option you can consider is a rollup release on a schedule for your customers. You can do this quarterly or biannual but as I said before, you’re trading an ops problem for a support problem. You now have to provide support for a 6 month old version of your product. If you’re doing CD in your SaaS environment, that version could literally be 182 versions behind what your SaaS is running (assuming 6 month rollup release with SaaS releases once a day). Do you even remember what state your codebase was in 6 months ago? How many bugs were fixed? How many features added? You could be looking at someone using a version with an entirely different UI.

What if you’ve added an entirely new component to your stack? How was that added at the SaaS level? Did you rush it out with a bit of meat cloud special sauce to get it out there? Now you have to apply that meat cloud special sauce across n environments just to be able to upgrade.

All hope is lost

Not really. It’s not all hopeless but you’re going to have to make a tradeoff and you’re going to have to live with permanent bifurcation regardless of the path you go down.

I’m going to drop a few ideas/thoughts from my experience that may help. I’ve still yet to find nirvana for this. The reason is that the needs and goals of a SaaS are directly opposed to those of someone who would be an on-prem customer. Consider this a brain dump of things we’ve tried over the years. Maybe you can make it work.

True on-premise vs managed hosted

If you have the power, limit your deploys to the “managed hosted” types. You’ll have to bake some non-negotiable things into your contract but the upshot is you eliminate an entire cross-section of problems described above. When you go fully on-prem (like being installed in a Boeing datacenter that requires Top Secret clearance) you lose ALL flexibility.

You also get the benefit of being able to fake it for a bit. You may not have all the automation problems solved but you have the flexibility to brute force it a bit while you work out how the hell you’re going to do this.

Frankly speaking, if your product depends heavily on a cloud specific technology (read: any AWS service), then your only target is AWS. Do you want to spend (albeit probably valuable) time refactoring to uncouple your product from AWS-specific technology? Probably not at first.

Take the case of RDS. It has issues but it’s a solid way to punt on the operational load of maintaining a highly available MySQL instance for a time. If you spend the effort to move off RDS to go fully on-prem, you now have the additional load of trying to approximate RDS on-prem.

Basically you have to chose:

  • spend time making your SaaS look like true COTS software and staff up a helpdesk
  • manage and run the environment yourself and staff up operations

There’s no free lunch here.

One tactic that really started to work for us with the cloud manager was a pure by product of being bought by D***. They had the help desk game covered. It was old hat. We did fight with how the ticket systems worked but before I left the model worked something like this:

  • Helpdesk was front line support
  • Our customer team was second level
  • Ops maintained the SaaS platform entirely in isolation
  • Improvements in operational aspects from running the SaaS filtered down to the on-prem

Becoming a customer of yourself

As I said, bifurcation is a real issue. You can mitigate this a bit by a small mental trick of treating the SaaS version of your product as just another on-prem install.

Mentally, however, you’ve at least unified the deployment model a bit. It also can free up the development team a bit in not having to worry about any discrepancies between the on-prem world and the SaaS. Your ops team running the SaaS become the best advocates for improving the on-prem experience AND if you’re lucky, running the product at scale.

Complimentary to this, you should have a dedicated team of people with ops experience handling the on-prem install and lifecycle role apart from the SaaS. You’re all the same organization. Ideas will flow bidirectionally between the teams but it can eliminate the cognative load of your SaaS ops team trying to align how they run a SaaS with the limitations of private installations.

The ops team may use OpsGenie, SignalFX and Loggly to meet their requirements while your on-prem team has decided to to package up a small footprint nagios + elk setup. You might use Chef in the SaaS but the customer team wants to use Ansible or Puppet.

The reason you hire ops people with experience running web applications at scale and not experience running a Windows AD farm is not just one of tools. A “webscale” operations person is going to be approaching every private install through the lens of a highly available “webscale” saas. This isn’t always appropriate for your on-prem customers.

Put it in the contract

The upgrade issue is real. The capacity issue is real. You need to, for your own protection and the sanity of your employees, bake things INTO the contract. Don’t be scared. The things you’re asking for are not onerous compared to most enterprise contracts.

This is general language. I’m not a lawyer. Run it by counsel for appropriate verbage and what not.

Required upgrade cycle to maintain support

“Active support is only allowed for previous two versions of software stack. Customer may be required to upgrade current to be eligible for support. Security releases are required upgrades for ongoing support”

What you are doing here is setting the bar for what you can support. I realize that in a CD world, versions are kind of pointless. One challenge here is how you determine what is a “previous version”. You can move to something like a quarterly rollup of your stack and give it a customer facing version. The goal is to try and at least keep to something released in the current year.

You also explicitly state that security releases are required. You’re going to have fixes required for security reasons and this is a way to short circuit any risk to you because the customer failed to patch as well as force upgrades at other points.

One thing you’ll need to work out is how you handle new customers. Do you give them the latest quarterly release (which may be 3 months old at this point if you’re at the end of the quarter) or do you always go with “latest” release. I can’t answer that for you but I can say you want to keep ALL of your customers on the same version if possible.

Maybe you simply state that customers are required to upgrade every quarter for continued support. Continued support is the key here. If the customer chooses not to upgrade, that’s fine but they lose all support from you. It’s their call.

Supported configurations and reference architecture

This is a big one. I mentioned before that your customer may shove your entire distributed stack on a single hypervisor thus undoing every bit reliability in your product.

You MUST define explicitly what a supported configuration is. With the DCM product, initially we were installing our stack in whatever configuration the customer would do to get the contract signed. You wanted to run it on 2 machines? We would figure out a break down that worked. This meant that no two installs were the same. We had to have extensive documentation on what was running where because nothing was standardized.

Our world got much better when we created and documented standard configurations. These came in three sizes:

  • single node (acceptable for POC and testing)
  • four node (minimum production viable configuration - active/passive model)
  • nine node (fully distributed HA model with scaling capabilities)

Only the last two options were eligible for any on-call support from us. Additionally we explicitly called out things like the hypervisor placement issue. If you ran the two backend nodes on the same esx host and you had an outage, that was your fault. We also stated that NONE of our nodes were allowed to be live migrated as that cut out an entire swath of problems and calls.

You may also want to add to this a baseline set of requirements for the customer to monitor (disk space is a BIG one) to be eligible for support.

Your customers will like this document anyway as it helps them decide how to provision for your product. It’s a win/win.

Make “operational supportability” a class of bug in your issue tracker

This is REALLY REALLY important. You should be doing this anyway but you need a place for people to report issues with the stack that directly affect the operability/maintenance of the stack. These deserve equal footing with any feature release or security fix as they directly contribute to the economies of scale in private deploys.

Use science

You’re going to want to invest some serious time and effort into sizing your initial offerings. If you don’t have a dedicated team for the on-prem stuff and it instead falls to your SaaS Ops folks, “you’re gonna have a bad time”. I alluded to this in my burnout blog post. Part of the challenge of sizing private deploys is knowing how much load a given size can actually handle. This is a slow process.

It’s made worse by the fact that if you can’t artificially throttle load for a private deploy, it can quickly become undersized and you’re in a firefighting mode trying to scale it under duress. Example:

  • a X node configuration can handle 100k frobnitz/sec
  • a Y node configuration can handle 500k frobnitz/sec

You present this to the customer and they opt for the X node configuration. Within a week, the customer is pushing 500k frobnitz/sec. Because the on-prem world is a “white glove” service, you can’t take the same approach you do with you SaaS. The reason people asked for a private install is to reduce the possible issues of outages from a multi-tenant system right?

Think back to your last capacity related outage. How did you address that? How much meat cloud was required? Multiply it.

You need science to determine these sizings and you need to allow yourself some wiggleroom for burst and give yourself enough time to scale that environment. This isn’t new but many times in a SaaS-only world we can punt a bit because we DON’T know how much load we’re going to have. Capacity planning was never something you should have ignored but I’m damn positive you pushed it off a bit.

Packaging and distribution

Packaging here is important. Are you still using tarballs with execute resources to extract them everywhere? You should probably start using FPM or something higher like Omnibus to package those as system packages. Hell, use Docker images if you need to. The benefit here is not so much one of packaging but in the things you get for free.

System packages give you auditability but they also give you a distribution mechanism for free. Docker images do the same to some degree. The logic downloading, checksumming and installation is somewhat solved for you for those.

Consider a platform

This is my current thinking that I’ve been investigating. Take a look at the requirements of your customers. Do they want a dedicated install because they need to minimize the effect of multi-tenancy in your SaaS? Do they need data partition guarantees?

Maybe you want to build a platform that allows you to deploy single tenant instances of some of your components while leveraging shared resources. Yes, this could mean something like CloudFoundry or Mesos or k8s. The benefit here is that you can slice and dice the resources you deploy along customer requirement boundaries.

Maybe 80% of your private deploy customers want to have reduced tenancy. You can deploy dedicated “web nodes” for them to your platform. If a customer requires data isolation, you can deploy dedicated web nodes and dedicated datastores.

Mind you this doesn’t mean you can’t do FULLY on-prem installs but like I said you probably don’t want to do those ANYWAY.

Charge appropriately

Pricing is always a balance game but don’t be afraid to charge what it’s REALLY going to cost you. You have to take economies of scale as a factor but don’t price a footprint for a managed install that is going to eat up 90% of your support time fighting fires.

You can also use pricing as a way to filter out driveby customers and really drive home that the SaaS is the way you want customers to go.

Wrap up

As I said, if you can avoid doing on-prem you should. If you have to do it, the managed hosted option is the one that buys you the most flexibility. Listen to your operations team. If you’re going to rely on them to handle the private installs, you HAVE to spend time on the unautomated bits they have concerns over. You have MAYBE about 3 private deploys in addition to your SaaS duties before they will be so overwhelmed that they won’t be able to put any focus on automation anymore.

I would love to hear feedback from folks on strategies that have worked for them.

Special Thanks

I want to send a special shoutout to the following people who offered feedback while I was writing this:

and everyone else who offered. It was invaluable.

Comments