How to self-host Plausible Analytics on Kubernetes

Google Analytics will decommission Universal Analytics starting from July 1st 2023. This made me think about whether I would change my Analytics properties to version 4 or look for alternatives.

I ended up with Plausible Analytics, a privacy-friendly and open source alternative to Google Analytics. In this blog I will explain how to self-host this on Kubernetes.

Plausible Analytics is not for everyone, you don't get the details and reporting that you might get in Google Analytics or Matomo, but if your main goal is to track basic metrics in a privacy-friendly way, give Plausible a try.

Read the full documention about seff hosting Plausible here.

My Kubernetes cluster is hosted on Digital Ocean but it would be very similar to any other k8s instance. We will be creating a Postgres DB, Clickhouse DB, and ofcourse the Plausible instance.

1: Create a namespace

apiVersion: v1
kind: Namespace
metadata:
  name: plausible

2: Define Plausible configuration

Most of these parameters are self explanatory, the SMTP settings are only required if you want to have reporting. And the Google variables are only required if you want Plausible to link with Google Search Console.

apiVersion: v1
kind: Secret
metadata:
  name: plausible
  namespace: plausible
type: Opaque
stringData:
  SECRET_KEY_BASE: <long random string>
  ADMIN_USER_NAME: admin
  ADMIN_USER_EMAIL: mail@sboersma.nl
  ADMIN_USER_PWD: <password>
  GOOGLE_CLIENT_ID: <google_cliend_id>
  GOOGLE_CLIENT_SECRET: <google_client_secret>
  DATABASE_URL: postgres://postgres:password@postgres-service:5432/postgres
  CLICKHOUSE_DATABASE_URL: http://clickhouse-service:8123/plausible
  BASE_URL: https://plausible.sboersma.nl
  MAILER_EMAIL: plausible@sboersma.nl
  SMTP_HOST_ADDR: <todo:replaceme>
  SMTP_HOST_PORT: "465"
  SMTP_HOST_SSL_ENABLED: "true"
  SMTP_USER_NAME: <todo:replaceme>
  SMTP_USER_PWD: <todo:replaceme>
  DISABLE_REGISTRATION: "true"

3: Plausible deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: plausible
  namespace: plausible
spec:
  selector:
    matchLabels:
      app: plausible
  template:
    metadata:
      labels:
        app: plausible
    spec:
      containers:
        - name: plausible
          image: plausible/analytics:latest
          command: ["/bin/sh"]
          args:
            [
              "-c",
              "sleep 10 && /entrypoint.sh db createdb && /entrypoint.sh db migrate && /entrypoint.sh db init-admin && /entrypoint.sh run"
            ]
          ports:
            - containerPort: 8000
          envFrom:
            - secretRef:
                name: plausible

4: Service for Plausible

apiVersion: v1
kind: Service
metadata:
  name: app
  namespace: plausible
spec:
  selector:
    app: plausible
  ports:
    - name: http
      protocol: TCP
      port: 80
      targetPort: 8000

5: Postgres variables

apiVersion: v1
kind: Secret
metadata:
  name: postgres-secret
  namespace: plausible
type: Opaque
stringData:
  POSTGRES_DB: postgres
  POSTGRES_USER: postgres
  POSTGRES_PASSWORD: password

6: Postgres Statefulset

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres-statefulset
  namespace: plausible
  labels:
    app: postgres
spec:
  serviceName: "postgres"
  replicas: 1
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      containers:
        - name: postgres
          image: postgres:12
          envFrom:
            - secretRef:
                name: postgres-secret
          ports:
            - containerPort: 5432
              name: postgresdb
          volumeMounts:
            - name: postgres-data
              mountPath: /var/lib/postgresql
  volumeClaimTemplates:
    - metadata:
        name: postgres-data
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 2Gi
        storageClassName: do-block-storage

7: Service for Postgres

apiVersion: v1
kind: Service
metadata:
  name: postgres-service
  namespace: plausible
  labels:
    app: postgres
spec:
  ports:
    - port: 5432
      name: postgres
  type: ClusterIP
  selector:
    app: postgres

8: ClickHouse config

apiVersion: v1
kind: ConfigMap
metadata:
  name: clickhouse-config
  namespace: plausible
data:
  config.xml: |
    <?xml version="1.0"?>
    <yandex>
        <logger>
            <level>trace</level>
            <log>/var/log/clickhouse-server/clickhouse-server.log</log>
            <errorlog>/var/log/clickhouse-server/clickhouse-server.err.log</errorlog>
            <!-- Built-in method of logs rotation.
                Threshold is big enough because 'logrotate' is used instead of built-in method.
                So, this threshold setting just protects from cases when 'logrotate' get called too late.
                Why logrotate? Becase built-in method doesn't allow to specify
                  that two last files must be uncompressed and other compressed
                  (important for some administration scripts).
              -->
            <size>1000M</size>
            <count>10</count>
        </logger>
        <http_port>8123</http_port>
        <tcp_port>9000</tcp_port>
        <!-- Port for communication between replicas. Used for data exchange. -->
        <interserver_http_port>9009</interserver_http_port>
        <!-- Hostname that is used by other replicas to request this server.
            If not specified, than it is determined analoguous to 'hostname -f' command.
            This setting could be used to switch replication to another network interface.
          -->
        <!--
        <interserver_http_host>example.yandex.ru</interserver_http_host>
        -->
        <!-- Listen specified host. :: - is wildcard IPv6 address, allows to accept connections both with IPv4 and IPv6 from everywhere. -->
        <listen_host>::</listen_host>
        <max_connections>4096</max_connections>
        <keep_alive_timeout>3</keep_alive_timeout>
        <!-- Maximum number of concurrent queries. -->
        <max_concurrent_queries>100</max_concurrent_queries>
        <!-- Size of cache of uncompressed blocks of data, used in tables of MergeTree family.
            In bytes. Cache is single for server. Memory is allocated only on demand.
            Cache is used when 'use_uncompressed_cache' user setting turned on (off by default).
            Uncompressed cache is advantageous only for very short queries and in rare cases.
          -->
        <uncompressed_cache_size>8589934592</uncompressed_cache_size>
        <!-- Approximate size of mark cache, used in tables of MergeTree family.
            In bytes. Cache is single for server. Memory is allocated only on demand.
            You should not lower this value.
          -->
        <mark_cache_size>5368709120</mark_cache_size>
        <!-- Path to data directory, with trailing slash. -->
        <path>/opt/clickhouse/</path>
        <!-- Path to temporary data for processing hard queries. -->
        <tmp_path>/opt/clickhouse/tmp/</tmp_path>
        <!-- Path to configuration file with users, access rights, profiles of settings, quotas. -->
        <users_config>users.xml</users_config>
        <!-- Default profile of settings.. -->
        <default_profile>default</default_profile>
        <!-- Default database. -->
        <default_database>default</default_database>
        <!-- Configuration of clusters that could be used in Distributed tables.
            https://clickhouse.yandex/reference_en.html#Distributed
        <remote_servers incl="clickhouse_remote_servers" />
          -->
        <!-- If element has 'incl' attribute, then for it's value will be used corresponding substitution from another file.
            By default, path to file with substitutions is /etc/metrika.xml. It could be changed in config in 'include_from' element.
            Values for substitutions are specified in /yandex/name_of_substitution elements in that file.
          -->
        <!-- ZooKeeper is used to store metadata about replicas, when using Replicated tables.
            Optional. If you don't use replicated tables, you could omit that.
            See https://clickhouse.yandex/reference_en.html#Data%20replication
          -->
        <zookeeper incl="zookeeper-servers" optional="true" />
        <!-- Substitutions for parameters of replicated tables.
              Optional. If you don't use replicated tables, you could omit that.
            See https://clickhouse.yandex/reference_en.html#Creating%20replicated%20tables
          -->
        <macros incl="macros" optional="true" />
        <!-- Reloading interval for embedded dictionaries, in seconds. Default: 3600. -->
        <builtin_dictionaries_reload_interval>3600</builtin_dictionaries_reload_interval>
        <!-- Sending data to Graphite for monitoring. -->
        <use_graphite>false</use_graphite>
        <!-- Uncomment if use_graphite.
        <graphite>
            <host>127.0.0.1</host>
            <port>42000</port>
            <root_path>one_min</root_path>
            <timeout>0.1</timeout>
        </graphite>
        -->
        <!-- Query log. Used only for queries with setting log_queries = 1. -->
        <query_log>
            <!-- What table to insert data. If table is not exist, it will be created.
                When query log structure is changed after system update,
                  then old table will be renamed and new table will be created automatically.
            -->
            <database>system</database>
            <table>query_log</table>
            <!-- Interval of flushing data. -->
            <flush_interval_milliseconds>7500</flush_interval_milliseconds>
        </query_log>
        <!-- Parameters for embedded dictionaries, used in Yandex.Metrica.
            See https://clickhouse.yandex/reference_en.html#Internal%20dictionaries
        -->
        <!-- Path to file with region hierarchy. -->
        <!-- <path_to_regions_hierarchy_file>/opt/geo/regions_hierarchy.txt</path_to_regions_hierarchy_file> -->
        <!-- Path to directory with files containing names of regions -->
        <!-- <path_to_regions_names_files>/opt/geo/</path_to_regions_names_files> -->
        <!-- Configuration of external dictionaries. See:
            https://clickhouse.yandex/reference_en.html#External%20Dictionaries
        -->
        <dictionaries_config>*_dictionary.xml</dictionaries_config>
        <!-- Uncomment if you want data to be compressed 30-100% better.
            Don't do that if you just started using ClickHouse.
          -->
        <compression incl="clickhouse_compression">
            <!--
                <!- - Set of variants. Checked in order. Last matching case wins. If nothing matches, lz4 will be used. - ->
                <case>
                    <!- - Conditions. All must be satisfied. Some conditions may be omitted. - ->
                    <min_part_size>10000000000</min_part_size>		<!- - Min part size in bytes. - ->
                    <min_part_size_ratio>0.01</min_part_size_ratio>	<!- - Min size of part relative to whole table size. - ->
                    <!- - Какой метод сжатия выбрать. - ->
                    <method>zstd</method>	<!- - Keep in mind that zstd compression library is highly experimental. - ->
                </case>
            -->
        </compression>
        <!-- Settings to fine tune MergeTree tables. See documentation in source code, in MergeTreeSettings.h -->
        <!--
        <merge_tree>
            <max_suspicious_broken_parts>5</max_suspicious_broken_parts>
        </merge_tree>
        -->
        <!-- Example of parameters for GraphiteMergeTree table engine -->
        <graphite_rollup_example>
            <pattern>
                <regexp>click_cost</regexp>
                <function>any</function>
                <retention>
                    <age>0</age>
                    <precision>3600</precision>
                </retention>
                <retention>
                    <age>86400</age>
                    <precision>60</precision>
                </retention>
            </pattern>
            <default>
                <function>max</function>
                <retention>
                    <age>0</age>
                    <precision>60</precision>
                </retention>
                <retention>
                    <age>3600</age>
                    <precision>300</precision>
                </retention>
                <retention>
                    <age>86400</age>
                    <precision>3600</precision>
                </retention>
            </default>
        </graphite_rollup_example>
    </yandex>
  users.xml: |
    <?xml version="1.0"?>
    <yandex>
        <!-- Profiles of settings. -->
        <profiles>
            <!-- Default settings. -->
            <default>
                <!-- Maximum memory usage for processing single query, in bytes. -->
                <max_memory_usage>10000000000</max_memory_usage>
                <!-- Use cache of uncompressed blocks of data. Meaningful only for processing many of very short queries. -->
                <use_uncompressed_cache>0</use_uncompressed_cache>
                <!-- How to choose between replicas during distributed query processing.
                    random - choose random replica from set of replicas with minimum number of errors
                    nearest_hostname - from set of replicas with minimum number of errors, choose replica
                      with minimum number of different symbols between replica's hostname and local hostname
                      (Hamming distance).
                    in_order - first live replica is chosen in specified order.
                    first_or_random - if first replica one has higher number of errors, pick a random one from replicas with minimum number of errors.
                -->
                <load_balancing>random</load_balancing>
                <readonly>0</readonly>
            </default>
      </profiles>
        <!-- Users and ACL. -->
        <users>
            <!-- If user name was not specified, 'default' user is used. -->
            <default>
              <password></password>
              <networks incl="networks" replace="replace">
                    <ip>::/0</ip>
                </networks>
                <!-- Settings profile for user. -->
                <profile>default</profile>
                <!-- Quota for user. -->
                <quota>default</quota>
                <!-- For testing the table filters -->
          </default>
      </users>
        <!-- Quotas. -->
        <quotas>
            <!-- Name of quota. -->
            <default>
                <!-- Limits for time interval. You could specify many intervals with different limits. -->
                <interval>
                    <!-- Length of interval. -->
                    <duration>3600</duration>
                    <!-- No limits. Just calculate resource usage for time interval. -->
                    <queries>0</queries>
                    <errors>0</errors>
                    <result_rows>0</result_rows>
                    <read_rows>0</read_rows>
                    <execution_time>0</execution_time>
                </interval>
            </default>
        </quotas>
    </yandex>

9: ClickHouse Statefulset

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: clickhouse-statefulset
  namespace: plausible
  labels:
    app: clickhouse
spec:
  serviceName: "clickhouse"
  replicas: 1
  selector:
    matchLabels:
      app: clickhouse
  template:
    metadata:
      labels:
        app: clickhouse
    spec:
      containers:
        - name: clickhouse
          image: yandex/clickhouse-server
          ports:
            - containerPort: 9000
              name: native
            - containerPort: 8123
              name: http
          volumeMounts:
            - name: clickhouse-data
              mountPath: /var/lib/clickhouse
            - name: clickhouse-config-vol
              mountPath: "/etc/clickhouse-server"
              readOnly: true
      volumes:
        - name: clickhouse-config-vol
          configMap:
            name: clickhouse-config
            items:
              - key: "config.xml"
                path: "config.xml"
              - key: "users.xml"
                path: "users.xml"
  volumeClaimTemplates:
    - metadata:
        name: clickhouse-data
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 10Gi
        storageClassName: do-block-storage

10: ClickHouse service

apiVersion: v1
kind: Service
metadata:
  name: clickhouse-service
  namespace: plausible
  labels:
    app: clickhouse
spec:
  ports:
    - port: 9000
      name: native
    - port: 8123
      name: http
  type: ClusterIP
  selector:
    app: clickhouse

11: Ingress and SSL certificate

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: app
  namespace: plausible
  annotations:
    cert-manager.io/cluster-issuer: "letsencrypt"
    kubernetes.io/ingress.class: "nginx"
spec:
  tls:
    - hosts:
        - plausible.sboersma.nl
      secretName: app-tls
  rules:
    - host: plausible.sboersma.nl
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: app
                port:
                  number: 80
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: app
  namespace: plausible
spec:
  # Secret names are always required.
  secretName: app-tls
  duration: 2160h # 90d
  renewBefore: 360h # 15d
  isCA: false
  privateKey:
    algorithm: RSA
    encoding: PKCS1
    size: 2048
  usages:
    - server auth
    - client auth
  dnsNames:
    - plausible.sboersma.nl
  issuerRef:
    name: letsencrypt
    kind: ClusterIssuer

Exclude yourself from being tracked

Plausible suggest when you want to opt-out from being tracked to make changes to your adblocker. I chose a different approach. I created an extra Ingress (stats.sboersma.nl in addition to plausible.sboersma.nl) that I used for the tracking script. This way it's easy to block this domain in your hosts-file or Pi-hole.

Check my complete deployment YAML file here one GitLab