README.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241

# ContainerSpy

ContainerSpy is a lightweight daemon that connects to Docker, collects metrics (CPU%, RAM, etc.), and outputs those
via OpenTelemetry OTLP.

You can then send that to a metrics store such as Prometheus, a collection agent such as Grafana Alloy,
or a cloud observability platform.

Note that containerspy currently targets only Docker, not Kubernetes or any other orchestration systems.
It outputs the same traces as cAdvisor for drop-in compatibility with existing data series and dashboards.

## Why make/use this?

ContainerSpy is intended to replace [cAdvisor](https://github.com/google/cadvisor) in a Prometheus/Grafana monitoring
setup. It takes inspiration from [Beszel](https://www.beszel.dev/) in its approach.
The main reason for this to exist is my personal difficulties deploying cAdvisor.

cAdvisor is rather RAM-heavy, and it really does not need to be so.
It also requires a plethora of different mounts to get working inside of a container, including /sys, or even the entire
/ filesystem, and in some cases must be ran as a privileged user!

This is mostly because cAdvisor actually collects statistics on *cgroups*, not specifically on docker containers.
It does have specific integration for containerd, docker, and podman, but it will also happily report statistics about
systemd services to Prometheus too!
If you only want to support Docker, you need not bother with cgroups, as the Docker Engine can report all you need.

I have previously used Beszel for my monitoring, and it's agent runs as an unprivileged user,
needs access to only the docker socket, collects all data out of the box, and has a lightweight footprint.

ContainerSpy aims to do what beszel-agent does, but instead of outputting to an opinionated AIO system,
outputs to (e.g.) Prometheus for a more heavyweight setup.

**I can highly recommend Beszel** as an easy to setup monitoring solution for Docker.
It will give you CPU use, RAM use, disk and bandwidth use, swap use, both system-wide and per-container, with
configurable email alerting OOTB with very little setup. It is a great piece of software.

My motivation to move to a Prometheus/Grafana setup is that I want the centralised rich logging that Loki can give me.

## Setup Instructions

See the following section for detailed instructions on configuring containerspy.

<details>
<summary>Running the binary directly</summary>

If you are running an instance of the [OpenTelemetry Collector](https://opentelemetry.io/docs/collector/)
on localhost, then you can simply
```bash
./containerspy
```

To pass configuration options you can either (recommended for quick testing) use env vars:
```bash
CSPY_OLTP_PROTO=grpc ./containerspy
```

or you can create a config file. If you create it in `/etc/containerspy/config.json`, it will be picked up
automatically, and you can just run `./containerspy` as before. If you create it anywhere else, you can specify its
location as so:
```bash
CSPY_CONFIG=./config.json ./containerspy
```
</details>

<details>
<summary>Docker</summary>

You can use either env vars or a config.json file to configure containerspy:

```bash
docker run \
	-v /var/run/docker.sock:/var/run/docker.sock:ro \
	-v ./config.json:/etc/containerspy/config.json:ro \
	ghcr.io/uwu/containerspy
```

```bash
docker run \
	-v /var/run/docker.sock:/var/run/docker.sock:ro \
	-e CSPY_XXX=YYY \ # see the configuring instructions below
	ghcr.io/uwu/containerspy
```
</details>

<details>
<summary>Docker Compose</summary>

```yml
services:
	containerspy:
		image: ghcr.io/uwu/containerspy
		volumes:
			- /var/run/docker.sock:/var/run/docker.sock:ro
		environment:
			CSPY_OTLP_ENDPOINT: http://collector:4318
#			CSPY_OTLP_INTERVAL: 30000 # 30s
		networks: [otlpnet]

# OTLP collector (you can use any OTLP receiver such as Alloy, Mimir, Prometheus)
#	collector:
#		image: otel/opentelemetry-collector-contrib
#		networks: [otlpnet]
#		...

networks:
	otlpnet:
```
</details>

It is also possible to run containerspy as a service for your preferred init system, if that's your preference.
No service files / units are provided here, please research your init system.

## How to configure

| `config.json`          | env var              | description                                                       | default                                              |
|------------------------|----------------------|-------------------------------------------------------------------|------------------------------------------------------|
| `docker_socket`        | `CSPY_DOCKER_SOCKET` | The docker socket / named pipe to connect to                      | default docker socket for host OS                    |
| `otlp_protocol`        | `CSPY_OTLP_PROTO`    | Whether to use httpbinary, httpjson, or grpc to send OTLP metrics | httpbinary                                           |
| `otlp_endpoint`        | `CSPY_OTLP_ENDPOINT` | Where to post metrics to                                          | OTLP spec default endpoint                           |
| `otlp_export_interval` | `CSPY_OTLP_INTERVAL` | How often to report metrics, in milliseconds                      | value of `OTEL_METRIC_EXPORT_INTERVAL` or 60 seconds |

You can set configuration in the config file specified in the `CSPY_CONFIG` env variable
(`/etc/containerspy/config.json` by default), which supports JSON5 syntax, or configure via the `CSPY_` env vars.

If a docker socket path is not set, ContainerSpy will try to connect to
`/var/run/docker.sock` on *NIX or `//./pipe/docker_engine` on Windows.

If an endpoint is not set, CSpy will try to post to the default ports and endpoints for an OTLP collector running on
the chosen protocol (`http://localhost:4318` for HTTP, `http://localhost:4317` for gRPC, see
[here](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/protocol/exporter.md) and
[here](https://github.com/open-telemetry/opentelemetry-rust/blob/bc82d4f6/opentelemetry-otlp/src/exporter/mod.rs#L60)).

Note: to send directly to Prometheus (with `--enable-feature=otlp-write-receiver`), use
http://localhost:9090/api/v1/otlp/v1/metrics as your endpoint, swapping `localhost:9090` for your Prometheus `host:port`.

## TODO

ContainerSpy is now ready for deployment, but is WIP. The planned features are:
 - implement cpu and fs metric labels
 - implement any metrics that should be available on Windows but aren't
 - automatically load configs from ./config.json too
 - (maybe?) add `--config` as another way to specify the location of the config file
 - use structured (json or syslog) logging to integrate nicely with log aggregation systems like Loki
 - (maybe?) read swap metrics if /sys is mounted (technically out of scope but might add anyway, not sure...)

## Supported metrics

!!! CONTAINERSPY DOES NOT SUPPORT CGROUPS V1 !!!
*Most* RAM metrics will be unavailable on cgoups v1 and any v1-only metrics are excluded.
ContainerSpy only officially supports Windows and Linux on cgroups v2. It will, however, not break on cgroups v1 hosts
and should just have missing metrics.
Yes, I know that implementing RAM metrics for cgroups is totally possible, and in fact more data is available in many
cases, but I have no system to test on, and you really should be using v2 by now.

This is intended to be a dropin replacement for cAdvisor, which lists its supported metrics
[here](https://github.com/google/cadvisor/blob/master/docs/storage/prometheus.md).

All generic labels attached to all metrics are implemented, and the status of labels applied only to specific metrics
is listed below ("N/A" if there are none).

The list of ContainerSpy's currently supported items from this list is:

| Name                                               | Metric-specific labels  | Notes                          |
|----------------------------------------------------|-------------------------|--------------------------------|
| `container_cpu_usage_seconds_total`                | TODO: `cpu`             |                                |
| `container_cpu_user_seconds_total`                 | N/A                     |                                |
| `container_cpu_system_seconds_total`               | N/A                     |                                |
| `container_cpu_cfs_periods_total`                  |                         |                                |
| `container_cpu_cfs_throttled_periods_total`        |                         |                                |
| `container_cpu_cfs_throttled_seconds_total`        |                         |                                |
| `container_fs_reads_bytes_total`                   | TODO: `device`          | Not reported on Windows (TODO) |
| `container_fs_writes_bytes_total`                  | TODO: `device`          | Not reported on Windows (TODO) |
| `container_last_seen`                              | N/A                     |                                |
| `container_memory_cache`                           | N/A                     | Not reported on Windows        |
| `container_memory_failures_total`                  | `failure_type`, `scope` | Not reported on Windows        |
| `container_memory_mapped_file`                     | N/A                     | Not reported on Windows        |
| `container_memory_rss`                             | N/A                     | Not reported on Windows        |
| `container_memory_usage_bytes`                     | N/A                     | Not reported on Windows        |
| `container_memory_working_set_bytes`               | N/A                     | Not reported on Windows        |
| `container_network_receive_bytes_total`            | `interface`             |                                |
| `container_network_receive_errors_total`           | `interface`             | Not reported on Windows        |
| `container_network_receive_packets_dropped_total`  | `interface`             |                                |
| `container_network_receive_packets_total`          | `interface`             |                                |
| `container_network_transmit_bytes_total`           | `interface`             |                                |
| `container_network_transmit_errors_total`          | `interface`             | Not reported on Windows        |
| `container_network_transmit_packets_dropped_total` | `interface`             |                                |
| `container_network_transmit_packets_total`         | `interface`             |                                |
| `container_start_time_seconds`                     | N/A                     |                                |

Additional TODO: figure out which of these metrics are or are not reportable on Windows.

The list of known omitted metrics are:

| Name                                             | Reason                                                      |
|--------------------------------------------------|-------------------------------------------------------------|
| `container_cpu_load_average_10s`                 | Not reported by Docker Engine API                           |
| `container_cpu_schedstat_run_periods_total`      | Not reported by Docker Engine API                           |
| `container_cpu_schedstat_runqueue_seconds_total` | Not reported by Docker Engine API                           |
| `container_cpu_schedstat_run_seconds_total`      | Not reported by Docker Engine API                           |
| `container_file_descriptors`                     | Not reported by Docker Engine API                           |
| `container_fs_inodes_free`                       | Not reported by Docker Engine API                           |
| `container_fs_inodes_total`                      | Not reported by Docker Engine API                           |
| `container_fs_io_current`                        | Not reported by Docker Engine API                           |
| `container_fs_io_time_seconds_total`             | Only reported on cgroups v1 hosts                           |
| `container_fs_io_time_weighted_seconds_total`    | Not reported by Docker Engine API                           |
| `container_fs_limit_bytes`                       | Not reported by Docker Engine API                           |
| `container_fs_read_seconds_total`                | Only reported on cgroups v1 hosts                           |
| `container_fs_reads_merged_total`                | Only reported on cgroups v1 hosts                           |
| `container_fs_reads_total`                       | Not reported by Docker Engine API                           |
| `container_fs_sector_reads_total`                | Only reported on cgroups v1 hosts                           |
| `container_fs_write_seconds_total`               | Only reported on cgroups v1 hosts                           |
| `container_fs_writes_merged_total`               | Only reported on cgroups v1 hosts                           |
| `container_fs_writes_total`                      | Not reported by Docker Engine API                           |
| `container_fs_sector_writes_total`               | Only reported on cgroups v1 hosts                           |
| `container_fs_usage_bytes`                       | Requires SystemDataUsage API                                |
| `container_hugetlb_failcnt`                      | Not reported by Docker Engine API                           |
| `container_hugetlb_max_usage_bytes`              | Not reported by Docker Engine API                           |
| `container_hugetlb_usage_bytes`                  | Not reported by Docker Engine API                           |
| `container_llc_occupancy_bytes`                  | Not reported by Docker Engine API                           |
| `container_memory_bandwidth_bytes`               | Not reported by Docker Engine API                           |
| `container_memory_bandwidth_local_bytes`         | Not reported by Docker Engine API                           |
| `container_memory_failcnt`                       | Only reported on cgroups v1 hosts                           |
| `container_memory_kernel_usage`                  | Undocumented, cspy has it, but i'm unsure my math's right!  |
| `container_memory_max_usage_bytes`               | Only reported on cgroups v1 hosts                           |
| `container_memory_migrate`                       | Not reported by Docker Engine API (or cA on my pc!)         |
| `container_memory_numa_pages`                    | Difficult to collect, not reported by cA on my pc           |
| `container_memory_swap`                          | Not reported by Docker Engine API                           |
| `container_network_advance_tcp_stats_total`      | Not reported by Docker Engine API                           |
| `container_network_tcp6_usage_total`             | Not reported by Docker Engine API                           |
| `container_network_tcp_usage_total`              | Not reported by Docker Engine API                           |
| `container_network_udp6_usage_total`             | Not reported by Docker Engine API                           |
| `container_network_udp_usage_total`              | Not reported by Docker Engine API                           |
| `container_oom_events_total`                     | Not reported by Docker Engine API                           |
| `container_perf_*`, `container_uncore_perf_*`    | Not reported by Docker Engine API                           |
| `container_processes`                            | Not reported by Docker Engine API (only threads, not procs) |
| `container_referenced_bytes`                     | Collection affects paging and causes mem latency            |
| `container_sockets`                              | Not reported by Docker Engine API                           |
| `container_spec_*`                               | Not reported by Docker Engine API                           |
| `container_tasks_state`                          | Not reported by Docker Engine API                           |
| `container_ulimits_soft`                         | Not reported by Docker Engine API                           |
| `machine_*`                                      | Out of scope, liable to be incorrect when containerised     |