ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.
At PostHog we use it to store metadata information for ClickHouse and Kafka.
Failure modes
Disk space usage increases rapidly
It has been observed that Zookeepers can suddenly increase it's disk usage, after being in a stable state for some time. This can sometimes be resolved by ensuring that old Zookeeper snapshots are cleared. If you experience this issue you can validate this solution by running zkCleanup.sh:
This will remove all snapshots aside from the last three, printing out the disk usage before and after.
In newer versions of our Helm chart we
run snapshot cleanups periodically every hour. If you experience Zookeeper space issues
and are on chart 18.2.0 or below, you can update to a later version to enable this.
Alternatively you can specify the Helm value zookeeper.autopurge.purgeInterval=1 which
will cause the clean up job to run every hour.
If you wish to further debug what is being added to your cluster, you can inspect a snapshot diff by running zhSnapshotComparer.sh e.g.:
This will give you a breakdown of the number of nodes in each snapshot, as well as the exact node difference between the two. For example: