At we’re currently in the process of reconfiguring our monitoring, logging and alerting setup.
Obviously this is something near and dear to my heart. At my previous employer we did everything in house due to various constraints. I’ve got a rich set of experiences in this space so my first inclination was to build not buy. However, due to OTHER constraints, building was not a practical solution at present.
After much research and evaluation, we finally settled on the following stack:
I’m not going into reasons for switching in this post or details about WHY these providers were chosen. This is to discuss a specific AWS service and how it can really change the nature of ChatOps and more.
Our monitoring and metrics provider, SignalFx, is still building out its integrations. They have a rich set already and are iterating very quickly but coupled with the migration to OpsGenie, we needed to work out a solution for wiring the two together. Using the power of google, I came across a great post from R.I. Pienaar about leveraging Lambda and the AWS API Gateway products.
In his post, he didn’t really go over the API Gateway configuration so I figured I would document what I “figured out” for others.
The above post covers most of it but the general idea is:
- SignalFx Alerting webhook fires
- AWS API Gateway gets webhook and fires off a Lambda task
- Lambda task translates webhook into OpsGenie create alert api call
- OpsGenie wakes you from a dead sleep
Now you might wonder why go through all this hassle? OpsGenie can create email integrations and SignalFx can send alerts to email address.
It’s about context.
When email alerts come in to OpsGenie, they’re just “pings”:
Hey this thing sent us an email and here’s the subject. Log in to the website to see the body
Frankly there’s not much else they can do. SignalFx will also send another email when an alert auto-clears but to OpsGenie, that’s another “ping”. It’s unrelated to the previous email.
As you can see above this could get painful and I already have issues with “alert fatigue”.
Additional, even if you DID log into the OpsGenie website to look at the details, there’s really not much there:
Yeah that’s helpful…
So we need to accomplish two things:
- Correlate alerts and auto-closes from SignalFx
- Get some damn context into the alert so we can act intelligently on it
The end result, gives us this:
and that’s MUCH more useful.
How we did it
AWS API Gateway is a PRETTY intimidating thing. There’s lots of very specific terminology and frankly I found the docs not so useful. The general idea is this:
- Create an api
- Create a “resource” this is just a route that the api gateway responds to i.e.
- Create a “model”: This is a json schema that describes what the incoming
POSTbody looks like and its content type
- Create an “integration request”: What do you do when you get it? In this case call a Lambda task
There are also concepts like “stages” which frankly were a bit useless in my case. In the end, after you get all of this wired up, you’ll be given a url that looks something like this:
https://<random id>.execute-api.<region>.amazonaws.com/<stage name>/<resource name>
and that’s your webhook url.
In the case of SignalFx, a webhook post looks like this:
1 2 3 4 5 6 7 8 9 10 11
which translates to the following model:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
The lambda function
I’m still cleaning up the banged out code for the Lambda function to post on github however here’s the relevant bits from the
opsgenie.js file I added to R.I.’s code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
This is the meat of the translation. You can customize this as needed but note how we immediately add a
note to the incident with the link to the graph in SignalFX.
When you create a detector in SignalFx, any groupings you create in the signal function are put into the
sources key as comma-separated values.
We take these and make them
tags in OpsGenie.
We also leverage the OpsGenie
alias to create our own ID for the event. Normally you would store this id somewhere for reference later but instead we use the unique id from SignalFX.
This make correlating a previous alert dead simple as you can see below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
SignalFX has 4 status it can set for an alert:
Here we check if the status is
ok and if so, we call a different function to generate the request object to a different path (the
close alert path).
Since we created the alert using an
alias of the SignalFx
incidentId, we don’t even need to do any more parsing - just
POST to the
close resource with the
As you saw above we get more more data and context about the alert and this is all also visible in the OpsGenie mobile app.
Honestly the biggest benefit is not having to wait on service providers to create native integrations. Almost every service I’ve used over the past several years offers an outgoing webhook capability for their system. API Gateway solves the part of getting those webhooks and its native Lambda support means I don’t need to leave anything “running” to maintain, upgrade and support just to do something with that webhook.
However one other huge benefit that is also hinted in the other post is that using this model, you get a simple abstraction from your service provider. If we wanted to move to some other alerting provider, we just change the lambda function to post there instead. No need to redo all the integrations in SignalFx. My plan is to work towards a model where we try and utilize the API gateway as the webhook endpoint for various services and translate to our other providers from there.
I really enjoyed working with the gateway. Testing wasn’t too painful as it has a “mock” mode as well where it behaves similarly to requestb.in. It supports a couple of different authentication methods well (though sadly they couldn’t be leveraged in this case due to provider webhook formats (note if you offer webhook support, let your users define a custom header as part of the webhook!).
Loggly has HipChat support already but it posts the messages in HipChat as a raw json dump which is useless:
by migrating it to use API Gateway and Lambda, we get the following instead:
Obviously that’s an MVP
We’ll probably go down this route farther for other integrations and especially when HipChat Connect goes GA. Then we’ll likely start posting richer messages using “cards” similar to my previous experience with Slack’s API.
One thing that’s really neat is that the api gateway approach can let you make some REALLY simple tools for ChatOps using outgoing webhooks from your chat system.
However it will REALLY open up when Lambda gets VPC support. Imagine a Lambda function that can fire inside your VPC and interact with all your private resources. It’s both terrifying and thrilling assuming all the appropriate controls are in place (using api keys, rbac in your outgoing webhook).
Thanks for reading. I hope it was valuable to you.