project jira-projects / Miscellaneous avatar

jira-projects/MISC#4: Build availability monitoring platform

Issue Information

Issue Type: issue
Status: closed
Reported By: btasker
Assigned To: btasker

Created: 27-May-22 19:07


I've been investigating an increase (trebling) of latency to my website reported by UptimeRobot.

The increase it shows doesn't appear in any of the other latency metrics I have, so I suspect it's probably on UR's side.

But, I thought it'd be interesting to look at building a platform which periodically runs a test, and then reports into a Cloud2 instance, with a Notebook used as a sort of status page

Toggle State Changes


assigned to @btasker

assigned to @btasker

My current line of thinking is to use AWS's Elastic Container Service to run a batch task periodically.

That task will be to run Telegraf with something like

telegraf --once --config

So that it runs its checks, reports in and then exits.

OK, after a bit of fiddling around, I think I've got the beginnings of this.

In Cloud2, created a telegraf config to request against my sites


In AWS, went into Elastic Container Service and created an ECS cluster (equipped with a single t3.micro)


Created a task TelegrafWebsiteMonitor and added a container to it with the following settings

  • image: telegraf:latest
  • command: telegraf,--once,--config,<blah>
  • Env vars: INFLUX_TOKEN: [my token]

I should probably have used secret storage for the token, but that'll do for now.

Running the task works


So, we've got minimum viable product, what we need to work out now (well, tomorrow probably) is

  • How to tell AWS to automate the task
  • How best to have telegraf communicate the AWS region
  • Whether this actually makes sense (does the billing work out less than just running a t3.micro with telegraf on?)

Might also be interesting to look at whether we can achieve similar with Google Cloud Run

Figured I'd schedule some runs before wandering off and enjoying Friday night.

Following the docs to schedule with Eventbridge

I defined a rule using type schedule


Then set the schedule as every 5 mins


Then defined the target as

  • AWS service
  • Target type ECS task
  • Selected my cluster
  • Selected my Task definition
  • Left count at 1


It looks like the rule first on creation, as I got a datapoint at 19:52 UTC - be interesting to see whether the next is then 19:57 (5 minutes after start, so every 5 minutes) or at 19:55 (so at each 5 minute). I assume the former.


Every 5 minutes it is then

Quickly then, creating a note book

Created an API token


Build a notebook


So that's working


I want to get it to declare it's region now though, so adding the following to the Telegraf config in cloud:

  region = "${TEST_REGION}"

Created a new revision of the task specifying a container with

  • image: telegraf:latest
  • command: telegraf,--once,--config,<blah>
  • Env vars: INFLUX_TOKEN: [my token]
  • Env vars: TEST_REGION: eu-west

When I created the schedule, it was set to use the latest revision of the task definition, so in theory we should see that tag start appearing in Cloud.

And there it is


The query underlying each of those graphs is

from(bucket: "Systemstats")
    |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r["_measurement"] == "http_response")
  |> filter(fn: (r) => r["_field"] == "response_time")
  |> filter(fn: (r) => r["server"] == "")
  |> group(columns: ["region", "server"])
  |> aggregateWindow(every: v.windowPeriod, fn: mean, createEmpty: false)
  |> map(fn: (r) => ({
      _time: r._time,
      _field: r._field,
      _value: r._value * 1000.0,
      region: r.region,
      server: r.server

So, the next thing to do would be to set exactly the same thing up again, in a different AWS region

Just going back to the billing though, it's probably worth seeing whether we can get this working with Fargate - that way you only pay for the time the container is running.

For comparison's sake, costs are

  • EC2 based cluster with 1 t3.micro: $0.0104/hr ($7.20 month)
  • Fargate: $0.04048 per vpcu/hr + $0.004445 GB RAM/hr

Data transfer rates aren't included, but are the same for both.

Working out Fargate pricing is obviously a little more complex...

Fargate bills to the second but has a 1 minute minimum, and we know we only need part of a vCPU. There's 20GB of storage baked in, we'll never need more than that for this.

If we assume

  • average run will always be < 1 minute (it looks more like it's actually 2 seconds, so we should be safe)
  • we use 0.25vcpu and 0.5GB RAM
  • we run at 5 minute intervals

Then, we'd be billed for 12 minutes per hour:

cpu_seconds = 720
num_cpus = 0.25
gb_ram = 0.5
hourly_cpu_cost = 0.04048
hourly_ram_cost = 0.004445

cpu_cost_second = hourly_cpu_cost / 3600
billed_cpu_cost = (cpu_seconds * cpu_cost_second) * num_cpus

ram_cost_second = hourly_ram_cost / 3600
billed_ram_cost = (ram_cost_second * cpu_seconds) * gb_ram

total_cost = billed_ram_cost + billed_cpu_cost

That would make the cost of fargate $0.0024684999999999998/hr ($1.777/month)


Go to Elastic Containers Service

Create cluster, "Networking only" (called mine FargateCluster) Screenshot_20220528_093810

That's all the options there are....

Creating a new task definition


Creating the container

  • image: telegraf:latest
  • command: telegraf,--once,--config,<blah>
  • Env vars: INFLUX_TOKEN: [my token]
  • Env vars: TEST_REGION: eu-west

Created and giving it a test run


It worked.

Next is scheduling it,

  • Go to Clusters
  • Click into the Fargate cluster
  • Click Scheduled Tasks
  • Click Create


I've disabled the schedule I created earlier so that the ECS based cluster is no longer running the task

The first run of the task has started


The fargate runs are reporting in happily, so I'm going to kill off the ECS cluster and associated stuff.

Note: deleting a cluster in ECS is unbelievably slow...

Next step then, is to switch region and see whether the runbook above gets me there.

I think we go for Eastern USA next - the UptimeRobot stuff that triggered this experiment in the first place was hitting a datacentre in that region, so it'll help get us towards replicating that.

It looks like the task scheduling works a little different in AWS Singapore: unlike other regions, the task doesn't run at time of first creation.

We can tabulate aggregate response time stats across regions with the following Flux query

data = from(bucket: "Systemstats")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r["_measurement"] == "http_response")
  |> filter(fn: (r) => r["_field"] == "response_time")
  |> filter(fn:(r) => r["server"] == "")
  |> group(columns: ["region"])
  |> map(fn:(r) => ({r with _value: r._value * 1000.0}))
  |> keep(columns: ["_value", "region"])

mean = data
  |> mean()

max = data
  |> max()

min = data
  |> min()

p95 = data
  |> quantile(q:0.95)

j1 = join(tables: { mean: mean, max: max }, 
      on: ["region"],
      method: "inner" 

j2 = join(tables: { min: min, j1: j1 }, 
      on: ["region"],
      method: "inner" 

join(tables: { min: j2, perc: p95 }, 
      on: ["region"],
      method: "inner" 
    |> map(fn: (r) => ({
      region: r.region,
      min: r._value_min,
      max: r._value_max,
      mean: r._value_mean,
      p95: r._value_perc
    |> group()


mentioned in issue CDN#19

mentioned in issue CDN#57