#4 Build availability monitoring platform : jira-projects/MISC#4

btasker Permalink
27-May-22 19:07

assigned to @btasker

btasker Permalink
27-May-22 19:07

assigned to @btasker

btasker Permalink
27-May-22 19:08

My current line of thinking is to use AWS's Elastic Container Service to run a batch task periodically.

That task will be to run Telegraf with something like

telegraf --once --config https://projectsstatic.bentasker.co.uk/MISC/telegraf.conf

So that it runs its checks, reports in and then exits.

btasker Permalink
27-May-22 19:44

OK, after a bit of fiddling around, I think I've got the beginnings of this.

In Cloud2, created a telegraf config to request against my sites

Screenshot_20220527_204003

In AWS, went into Elastic Container Service and created an ECS cluster (equipped with a single t3.micro)

Screenshot_20220527_204125

Created a task TelegrafWebsiteMonitor and added a container to it with the following settings

image: telegraf:latest
command: telegraf,--once,--config,https://eu-central-1-1.aws.cloud2.influxdata.com/api/v2/telegrafs/<blah>
Env vars: INFLUX_TOKEN: [my token]

I should probably have used secret storage for the token, but that'll do for now.

Running the task works

Screenshot_20220527_204356

btasker Permalink
27-May-22 19:47

So, we've got minimum viable product, what we need to work out now (well, tomorrow probably) is

How to tell AWS to automate the task
How best to have telegraf communicate the AWS region
Whether this actually makes sense (does the billing work out less than just running a t3.micro with telegraf on?)

Might also be interesting to look at whether we can achieve similar with Google Cloud Run

btasker Permalink
27-May-22 19:58

Figured I'd schedule some runs before wandering off and enjoying Friday night.

Following the docs to schedule with Eventbridge

I defined a rule using type schedule

Screenshot_20220527_204957

Then set the schedule as every 5 mins

Screenshot_20220527_205030

Then defined the target as

AWS service
Target type ECS task
Selected my cluster
Selected my Task definition
Left count at 1

Screenshot_20220527_205114

It looks like the rule first on creation, as I got a datapoint at 19:52 UTC - be interesting to see whether the next is then 19:57 (5 minutes after start, so every 5 minutes) or at 19:55 (so at each 5 minute). I assume the former.

Screenshot_20220527_205742

Every 5 minutes it is then

btasker Permalink
27-May-22 20:06

Quickly then, creating a note book

Created an API token

Screenshot_20220527_205914

Build a notebook

Screenshot_20220527_210559

btasker Permalink
28-May-22 07:57

So that's working

Screenshot_20220528_084803

I want to get it to declare it's region now though, so adding the following to the Telegraf config in cloud:

[global_tags]
  region = "${TEST_REGION}"

Created a new revision of the task specifying a container with

image: telegraf:latest
command: telegraf,--once,--config,https://eu-central-1-1.aws.cloud2.influxdata.com/api/v2/telegrafs/<blah>
Env vars: INFLUX_TOKEN: [my token]
Env vars: TEST_REGION: eu-west

When I created the schedule, it was set to use the latest revision of the task definition, so in theory we should see that tag start appearing in Cloud.

btasker Permalink
28-May-22 08:10

And there it is

Screenshot_20220528_090715

The query underlying each of those graphs is

from(bucket: "Systemstats")
    |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r["_measurement"] == "http_response")
  |> filter(fn: (r) => r["_field"] == "response_time")
  |> filter(fn: (r) => r["server"] == "https://www.bentasker.co.uk")
  |> group(columns: ["region", "server"])
  |> aggregateWindow(every: v.windowPeriod, fn: mean, createEmpty: false)
  |> map(fn: (r) => ({
      _time: r._time,
      _field: r._field,
      _value: r._value * 1000.0,
      region: r.region,
      server: r.server
  }))

So, the next thing to do would be to set exactly the same thing up again, in a different AWS region

btasker Permalink
28-May-22 08:37

Just going back to the billing though, it's probably worth seeing whether we can get this working with Fargate - that way you only pay for the time the container is running.

For comparison's sake, costs are

EC2 based cluster with 1 t3.micro: $0.0104/hr ($7.20 month)
Fargate: $0.04048 per vpcu/hr + $0.004445 GB RAM/hr

Data transfer rates aren't included, but are the same for both.

Working out Fargate pricing is obviously a little more complex...

Fargate bills to the second but has a 1 minute minimum, and we know we only need part of a vCPU. There's 20GB of storage baked in, we'll never need more than that for this.

If we assume

average run will always be < 1 minute (it looks more like it's actually 2 seconds, so we should be safe)
we use 0.25vcpu and 0.5GB RAM
we run at 5 minute intervals

Then, we'd be billed for 12 minutes per hour:

cpu_seconds = 720
num_cpus = 0.25
gb_ram = 0.5
hourly_cpu_cost = 0.04048
hourly_ram_cost = 0.004445


cpu_cost_second = hourly_cpu_cost / 3600
billed_cpu_cost = (cpu_seconds * cpu_cost_second) * num_cpus

ram_cost_second = hourly_ram_cost / 3600
billed_ram_cost = (ram_cost_second * cpu_seconds) * gb_ram

total_cost = billed_ram_cost + billed_cpu_cost

That would make the cost of fargate $0.0024684999999999998/hr ($1.777/month)

btasker Permalink
28-May-22 09:08

So:

Go to Elastic Containers Service

Create cluster, "Networking only" (called mine FargateCluster) Screenshot_20220528_093810

That's all the options there are....

Creating a new task definition

Screenshot_20220528_094227

Creating the container

image: telegraf:latest
command: telegraf,--once,--config,https://eu-central-1-1.aws.cloud2.influxdata.com/api/v2/telegrafs/<blah>
Env vars: INFLUX_TOKEN: [my token]
Env vars: TEST_REGION: eu-west

Created and giving it a test run

Screenshot_20220528_094342

It worked.

Next is scheduling it,

Go to Clusters
Click into the Fargate cluster
Click Scheduled Tasks
Click Create

Screenshot_20220528_100643

I've disabled the schedule I created earlier so that the ECS based cluster is no longer running the task

The first run of the task has started

Screenshot_20220528_100751

btasker Permalink
28-May-22 09:24

The fargate runs are reporting in happily, so I'm going to kill off the ECS cluster and associated stuff.

Note: deleting a cluster in ECS is unbelievably slow...

btasker Permalink
28-May-22 09:29

Next step then, is to switch region and see whether the runbook above gets me there.

I think we go for Eastern USA next - the UptimeRobot stuff that triggered this experiment in the first place was hitting a datacentre in that region, so it'll help get us towards replicating that.

btasker Permalink
28-May-22 10:09

It looks like the task scheduling works a little different in AWS Singapore: unlike other regions, the task doesn't run at time of first creation.

btasker Permalink
28-May-22 10:55

We can tabulate aggregate response time stats across regions with the following Flux query

data = from(bucket: "Systemstats")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r["_measurement"] == "http_response")
  |> filter(fn: (r) => r["_field"] == "response_time")
  |> filter(fn:(r) => r["server"] == "https://www.bentasker.co.uk")
  |> group(columns: ["region"])
  |> map(fn:(r) => ({r with _value: r._value * 1000.0}))
  |> keep(columns: ["_value", "region"])

mean = data
  |> mean()

max = data
  |> max()

min = data
  |> min()

p95 = data
  |> quantile(q:0.95)


j1 = join(tables: { mean: mean, max: max }, 
      on: ["region"],
      method: "inner" 
   )

j2 = join(tables: { min: min, j1: j1 }, 
      on: ["region"],
      method: "inner" 
    )

join(tables: { min: j2, perc: p95 }, 
      on: ["region"],
      method: "inner" 
    )   
    |> map(fn: (r) => ({
      region: r.region,
      min: r._value_min,
      max: r._value_max,
      mean: r._value_mean,
      p95: r._value_perc
    }))
    |> group()

Screenshot_20220528_114718

btasker Permalink
28-May-22 15:36

I've documented this process here: https://www.bentasker.co.uk/posts/blog/general/website-availability-monitoring-with-telegraf-fargate-and-influxdb.html

jira-projects/MISC#4: Build availability monitoring platform

Issue Information

Activity