How to self-host Plausible Analytics on Kubernetes
Google Analytics will decommission Universal Analytics starting from July 1st 2023. This made me think about whether I would change my Analytics properties to version 4 or look for alternatives.
I ended up with Plausible Analytics, a privacy-friendly and open source alternative to Google Analytics. In this blog I will explain how to self-host this on Kubernetes.
Plausible Analytics is not for everyone, you don't get the details and reporting that you might get in Google Analytics or Matomo, but if your main goal is to track basic metrics in a privacy-friendly way, give Plausible a try.
Read the full documention about seff hosting Plausible here.
My Kubernetes cluster is hosted on Digital Ocean but it would be very similar to any other k8s instance. We will be creating a Postgres DB, Clickhouse DB, and ofcourse the Plausible instance.
1: Create a namespace
apiVersion: v1
kind: Namespace
metadata:
name: plausible
2: Define Plausible configuration
Most of these parameters are self explanatory, the SMTP settings are only required if you want to have reporting. And the Google variables are only required if you want Plausible to link with Google Search Console.
apiVersion: v1
kind: Secret
metadata:
name: plausible
namespace: plausible
type: Opaque
stringData:
SECRET_KEY_BASE: <long random string>
ADMIN_USER_NAME: admin
ADMIN_USER_EMAIL: mail@sboersma.nl
ADMIN_USER_PWD: <password>
GOOGLE_CLIENT_ID: <google_cliend_id>
GOOGLE_CLIENT_SECRET: <google_client_secret>
DATABASE_URL: postgres://postgres:password@postgres-service:5432/postgres
CLICKHOUSE_DATABASE_URL: http://clickhouse-service:8123/plausible
BASE_URL: https://plausible.sboersma.nl
MAILER_EMAIL: plausible@sboersma.nl
SMTP_HOST_ADDR: <todo:replaceme>
SMTP_HOST_PORT: "465"
SMTP_HOST_SSL_ENABLED: "true"
SMTP_USER_NAME: <todo:replaceme>
SMTP_USER_PWD: <todo:replaceme>
DISABLE_REGISTRATION: "true"
3: Plausible deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: plausible
namespace: plausible
spec:
selector:
matchLabels:
app: plausible
template:
metadata:
labels:
app: plausible
spec:
containers:
- name: plausible
image: plausible/analytics:latest
command: ["/bin/sh"]
args:
[
"-c",
"sleep 10 && /entrypoint.sh db createdb && /entrypoint.sh db migrate && /entrypoint.sh db init-admin && /entrypoint.sh run"
]
ports:
- containerPort: 8000
envFrom:
- secretRef:
name: plausible
4: Service for Plausible
apiVersion: v1
kind: Service
metadata:
name: app
namespace: plausible
spec:
selector:
app: plausible
ports:
- name: http
protocol: TCP
port: 80
targetPort: 8000
5: Postgres variables
apiVersion: v1
kind: Secret
metadata:
name: postgres-secret
namespace: plausible
type: Opaque
stringData:
POSTGRES_DB: postgres
POSTGRES_USER: postgres
POSTGRES_PASSWORD: password
6: Postgres Statefulset
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgres-statefulset
namespace: plausible
labels:
app: postgres
spec:
serviceName: "postgres"
replicas: 1
selector:
matchLabels:
app: postgres
template:
metadata:
labels:
app: postgres
spec:
containers:
- name: postgres
image: postgres:12
envFrom:
- secretRef:
name: postgres-secret
ports:
- containerPort: 5432
name: postgresdb
volumeMounts:
- name: postgres-data
mountPath: /var/lib/postgresql
volumeClaimTemplates:
- metadata:
name: postgres-data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 2Gi
storageClassName: do-block-storage
7: Service for Postgres
apiVersion: v1
kind: Service
metadata:
name: postgres-service
namespace: plausible
labels:
app: postgres
spec:
ports:
- port: 5432
name: postgres
type: ClusterIP
selector:
app: postgres
8: ClickHouse config
apiVersion: v1
kind: ConfigMap
metadata:
name: clickhouse-config
namespace: plausible
data:
config.xml: |
<?xml version="1.0"?>
<yandex>
<logger>
<level>trace</level>
<log>/var/log/clickhouse-server/clickhouse-server.log</log>
<errorlog>/var/log/clickhouse-server/clickhouse-server.err.log</errorlog>
<!-- Built-in method of logs rotation.
Threshold is big enough because 'logrotate' is used instead of built-in method.
So, this threshold setting just protects from cases when 'logrotate' get called too late.
Why logrotate? Becase built-in method doesn't allow to specify
that two last files must be uncompressed and other compressed
(important for some administration scripts).
-->
<size>1000M</size>
<count>10</count>
</logger>
<http_port>8123</http_port>
<tcp_port>9000</tcp_port>
<!-- Port for communication between replicas. Used for data exchange. -->
<interserver_http_port>9009</interserver_http_port>
<!-- Hostname that is used by other replicas to request this server.
If not specified, than it is determined analoguous to 'hostname -f' command.
This setting could be used to switch replication to another network interface.
-->
<!--
<interserver_http_host>example.yandex.ru</interserver_http_host>
-->
<!-- Listen specified host. :: - is wildcard IPv6 address, allows to accept connections both with IPv4 and IPv6 from everywhere. -->
<listen_host>::</listen_host>
<max_connections>4096</max_connections>
<keep_alive_timeout>3</keep_alive_timeout>
<!-- Maximum number of concurrent queries. -->
<max_concurrent_queries>100</max_concurrent_queries>
<!-- Size of cache of uncompressed blocks of data, used in tables of MergeTree family.
In bytes. Cache is single for server. Memory is allocated only on demand.
Cache is used when 'use_uncompressed_cache' user setting turned on (off by default).
Uncompressed cache is advantageous only for very short queries and in rare cases.
-->
<uncompressed_cache_size>8589934592</uncompressed_cache_size>
<!-- Approximate size of mark cache, used in tables of MergeTree family.
In bytes. Cache is single for server. Memory is allocated only on demand.
You should not lower this value.
-->
<mark_cache_size>5368709120</mark_cache_size>
<!-- Path to data directory, with trailing slash. -->
<path>/opt/clickhouse/</path>
<!-- Path to temporary data for processing hard queries. -->
<tmp_path>/opt/clickhouse/tmp/</tmp_path>
<!-- Path to configuration file with users, access rights, profiles of settings, quotas. -->
<users_config>users.xml</users_config>
<!-- Default profile of settings.. -->
<default_profile>default</default_profile>
<!-- Default database. -->
<default_database>default</default_database>
<!-- Configuration of clusters that could be used in Distributed tables.
https://clickhouse.yandex/reference_en.html#Distributed
<remote_servers incl="clickhouse_remote_servers" />
-->
<!-- If element has 'incl' attribute, then for it's value will be used corresponding substitution from another file.
By default, path to file with substitutions is /etc/metrika.xml. It could be changed in config in 'include_from' element.
Values for substitutions are specified in /yandex/name_of_substitution elements in that file.
-->
<!-- ZooKeeper is used to store metadata about replicas, when using Replicated tables.
Optional. If you don't use replicated tables, you could omit that.
See https://clickhouse.yandex/reference_en.html#Data%20replication
-->
<zookeeper incl="zookeeper-servers" optional="true" />
<!-- Substitutions for parameters of replicated tables.
Optional. If you don't use replicated tables, you could omit that.
See https://clickhouse.yandex/reference_en.html#Creating%20replicated%20tables
-->
<macros incl="macros" optional="true" />
<!-- Reloading interval for embedded dictionaries, in seconds. Default: 3600. -->
<builtin_dictionaries_reload_interval>3600</builtin_dictionaries_reload_interval>
<!-- Sending data to Graphite for monitoring. -->
<use_graphite>false</use_graphite>
<!-- Uncomment if use_graphite.
<graphite>
<host>127.0.0.1</host>
<port>42000</port>
<root_path>one_min</root_path>
<timeout>0.1</timeout>
</graphite>
-->
<!-- Query log. Used only for queries with setting log_queries = 1. -->
<query_log>
<!-- What table to insert data. If table is not exist, it will be created.
When query log structure is changed after system update,
then old table will be renamed and new table will be created automatically.
-->
<database>system</database>
<table>query_log</table>
<!-- Interval of flushing data. -->
<flush_interval_milliseconds>7500</flush_interval_milliseconds>
</query_log>
<!-- Parameters for embedded dictionaries, used in Yandex.Metrica.
See https://clickhouse.yandex/reference_en.html#Internal%20dictionaries
-->
<!-- Path to file with region hierarchy. -->
<!-- <path_to_regions_hierarchy_file>/opt/geo/regions_hierarchy.txt</path_to_regions_hierarchy_file> -->
<!-- Path to directory with files containing names of regions -->
<!-- <path_to_regions_names_files>/opt/geo/</path_to_regions_names_files> -->
<!-- Configuration of external dictionaries. See:
https://clickhouse.yandex/reference_en.html#External%20Dictionaries
-->
<dictionaries_config>*_dictionary.xml</dictionaries_config>
<!-- Uncomment if you want data to be compressed 30-100% better.
Don't do that if you just started using ClickHouse.
-->
<compression incl="clickhouse_compression">
<!--
<!- - Set of variants. Checked in order. Last matching case wins. If nothing matches, lz4 will be used. - ->
<case>
<!- - Conditions. All must be satisfied. Some conditions may be omitted. - ->
<min_part_size>10000000000</min_part_size> <!- - Min part size in bytes. - ->
<min_part_size_ratio>0.01</min_part_size_ratio> <!- - Min size of part relative to whole table size. - ->
<!- - Какой метод сжатия выбрать. - ->
<method>zstd</method> <!- - Keep in mind that zstd compression library is highly experimental. - ->
</case>
-->
</compression>
<!-- Settings to fine tune MergeTree tables. See documentation in source code, in MergeTreeSettings.h -->
<!--
<merge_tree>
<max_suspicious_broken_parts>5</max_suspicious_broken_parts>
</merge_tree>
-->
<!-- Example of parameters for GraphiteMergeTree table engine -->
<graphite_rollup_example>
<pattern>
<regexp>click_cost</regexp>
<function>any</function>
<retention>
<age>0</age>
<precision>3600</precision>
</retention>
<retention>
<age>86400</age>
<precision>60</precision>
</retention>
</pattern>
<default>
<function>max</function>
<retention>
<age>0</age>
<precision>60</precision>
</retention>
<retention>
<age>3600</age>
<precision>300</precision>
</retention>
<retention>
<age>86400</age>
<precision>3600</precision>
</retention>
</default>
</graphite_rollup_example>
</yandex>
users.xml: |
<?xml version="1.0"?>
<yandex>
<!-- Profiles of settings. -->
<profiles>
<!-- Default settings. -->
<default>
<!-- Maximum memory usage for processing single query, in bytes. -->
<max_memory_usage>10000000000</max_memory_usage>
<!-- Use cache of uncompressed blocks of data. Meaningful only for processing many of very short queries. -->
<use_uncompressed_cache>0</use_uncompressed_cache>
<!-- How to choose between replicas during distributed query processing.
random - choose random replica from set of replicas with minimum number of errors
nearest_hostname - from set of replicas with minimum number of errors, choose replica
with minimum number of different symbols between replica's hostname and local hostname
(Hamming distance).
in_order - first live replica is chosen in specified order.
first_or_random - if first replica one has higher number of errors, pick a random one from replicas with minimum number of errors.
-->
<load_balancing>random</load_balancing>
<readonly>0</readonly>
</default>
</profiles>
<!-- Users and ACL. -->
<users>
<!-- If user name was not specified, 'default' user is used. -->
<default>
<password></password>
<networks incl="networks" replace="replace">
<ip>::/0</ip>
</networks>
<!-- Settings profile for user. -->
<profile>default</profile>
<!-- Quota for user. -->
<quota>default</quota>
<!-- For testing the table filters -->
</default>
</users>
<!-- Quotas. -->
<quotas>
<!-- Name of quota. -->
<default>
<!-- Limits for time interval. You could specify many intervals with different limits. -->
<interval>
<!-- Length of interval. -->
<duration>3600</duration>
<!-- No limits. Just calculate resource usage for time interval. -->
<queries>0</queries>
<errors>0</errors>
<result_rows>0</result_rows>
<read_rows>0</read_rows>
<execution_time>0</execution_time>
</interval>
</default>
</quotas>
</yandex>
9: ClickHouse Statefulset
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: clickhouse-statefulset
namespace: plausible
labels:
app: clickhouse
spec:
serviceName: "clickhouse"
replicas: 1
selector:
matchLabels:
app: clickhouse
template:
metadata:
labels:
app: clickhouse
spec:
containers:
- name: clickhouse
image: yandex/clickhouse-server
ports:
- containerPort: 9000
name: native
- containerPort: 8123
name: http
volumeMounts:
- name: clickhouse-data
mountPath: /var/lib/clickhouse
- name: clickhouse-config-vol
mountPath: "/etc/clickhouse-server"
readOnly: true
volumes:
- name: clickhouse-config-vol
configMap:
name: clickhouse-config
items:
- key: "config.xml"
path: "config.xml"
- key: "users.xml"
path: "users.xml"
volumeClaimTemplates:
- metadata:
name: clickhouse-data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
storageClassName: do-block-storage
10: ClickHouse service
apiVersion: v1
kind: Service
metadata:
name: clickhouse-service
namespace: plausible
labels:
app: clickhouse
spec:
ports:
- port: 9000
name: native
- port: 8123
name: http
type: ClusterIP
selector:
app: clickhouse
11: Ingress and SSL certificate
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: app
namespace: plausible
annotations:
cert-manager.io/cluster-issuer: "letsencrypt"
kubernetes.io/ingress.class: "nginx"
spec:
tls:
- hosts:
- plausible.sboersma.nl
secretName: app-tls
rules:
- host: plausible.sboersma.nl
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: app
port:
number: 80
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: app
namespace: plausible
spec:
# Secret names are always required.
secretName: app-tls
duration: 2160h # 90d
renewBefore: 360h # 15d
isCA: false
privateKey:
algorithm: RSA
encoding: PKCS1
size: 2048
usages:
- server auth
- client auth
dnsNames:
- plausible.sboersma.nl
issuerRef:
name: letsencrypt
kind: ClusterIssuer
Exclude yourself from being tracked
Plausible suggest when you want to opt-out from being tracked to make changes to your adblocker. I chose a different approach. I created an extra Ingress (stats.sboersma.nl in addition to plausible.sboersma.nl) that I used for the tracking script. This way it's easy to block this domain in your hosts-file or Pi-hole.
Check my complete deployment YAML file here one GitLab