Making sure PostHog operates fast at scale is key to our success.
This document outlines some best practices to archive good query performance at scale, as well as describing tools and procedures to discover and fix performance issues.
PostHog uses two different datastores:
PostgreSQL: row-oriented OLTP database, mainly used to store and query datasets using predictable clause(s). It is likely your best choice if:
- the query pattern to access your dataset is predictable
- the dataset will likely not grow overtime above (<= 1 TB)
- the dataset needs to mutate often (
DELETE
/UPDATE
) - the query pattern requires joins across multiple tables
ClickHouse: column-oriented OLAP database, used to store large datasets and run on them analytical queries. It is likely your best choice if:
- the query pattern to access your dataset is unpredictable
- the dataset will likely grow overtime (> 1 TB)
- the dataset doesn't need to mutate often (
DELETE
/UPDATE
) - the query pattern doesn't requires joins across multiple tables
PostgreSQL
Coding best practices
only ask for the field(s) you need:
SELECT name, surname
is better thanSELECT *
(the latter is only helpful in few edge cases)only ask for the row(s) you need: use a
LIMIT
condition at the end of your query(if possible) avoid explicit transactions: if you can't, keep them small since transactions lock the processing tables data and may result in deadlocks (super discouraged to use them in application hot paths)
(if possible) avoid
JOIN
avoid the use of subqueries: a subquery is a
SELECT
statement that is embedded in a clause of another SQL statement. It's easier to write, butJOIN
s are usually better-optimized for the database engines.use appropriate data type(s): not all the types occupy the same, and when we use a concrete data type, we can also limit its size according to what we store. For example,
VARCHAR(4000)
is not the same asVARCHAR(40)
. We always have to adjust to what we will store in our fields not to occupy unnecessary space in our database (and we should enforce this limit in the application code to avoid query errors).use the
LIKE
operator only if necessary: if you know what you are looking for use the=
operator
Note: for the Django app we currently rely on the Django-ORM as interface between our data and the relational database. While we don't directly write SQL queries in this case, the following best practices should be considered anyway.
If you want to print executed queries (while running with DEBUG
) you can run:
from django.db import connectionprint(connection.queries)
while for an individual query you can run:
print(Model.objects.filter(name='test').query)
Indexing
If you are programmatically ordering, sorting, or grouping by a column, you should probably have an index on it. The caveat is that indexing slows down writes to the table and takes disk space (please drop unused indexes).
Composite indices are useful when you want to optimize querying on multiple non-conditional columns. For more info on indices and multi-column indices see the official docs.
How-to find slow queries
To find and debug slow queries in production you have a few options available:
- Browse to the Diagnose tab in Heroku Data's dashboard. You can break queries down by:
- Most time consuming
- Most frequently invoked
- Slowest execution time
- Slowest I/O
- You can also use Heroku's Diagnose feature by running
heroku pg:diagnose
to get a breakdown of long running queries, long transactions, among other diagnostics. - For a more raw approach you can access real time logs from Heroku by executing
heroku logs --app posthog --ps postgres
- With any logs pulled from PostgreSQL you can use pgbadger to find exactly the queries that are consuming the most time and resources.
How-to fix slow queries
Fixing a slow query is usually a 3 steps process:
identify which part of the codebase is generating it (adding the stacktrace as query comments is usually helpful to map query <-> code).
re-run the query with
EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS, FORMAT JSON)
as prefix to get the query plan. Query plans aren't the easiest thing to read. They're packed with information and it's closer to being machine parsable than human readable. Postgres Explain Viewer 2 (aka pev2, see GitHub repository) is a tool to simplify reading query plans. It provides a horizontal tree with each node representing a node in the query plan. It includes timing information, the error amount in the planned versus actual times, and badges for interesting nodes like "costliest" or "bad estimate".fix the query that should now generate a less costly
EXPLAIN
plan.
How-to reduce IO
- Indices require IO, we can get rid of some IO by removing unused indices
- Can check writes IO with something like:
SELECT total_time, blk_write_time, calls, queryFROM pg_stat_statementsORDER BY (blk_write_time) DESCLIMIT 10;
- SELECTs can cause writes IO: https://blog.okmeter.io/postgresql-exploring-how-select-queries-can-produce-disk-writes-f36c8bee6b6f
Removing unused indices on foreign key fields
Say you have an index on team_id
, person_id
. If team_id
and person_id
are Django foreign keys, it’s going to have created indices on team_id
and
person_id
. We can still use the composite index for both team_id
and
person_id
lookups, as mentioned on
https://www.postgresql.org/docs/11/indexes-multicolumn.html , thus we can avoid
having to write the other two indices by adding db_index=False
Removing foreign key fields
We don’t want to remove immediately as this is backwards incompatible. Do this as a deprecation first. Let's get the gains of not having an index and constraint first.
Rename e.g. foreign_key_field
to __deprecated_foreign_key_field
, add db_column= foreign_key_field
such that attempts to reference from outside the model will require full qualification (we keep the field around such that Django doesn’t try to create deletion migrations)
Wait for one release of field deprecation to have been in place.
TODO: Somehow make select queries not request this field (i.e. to make it such
that we can drop the column).
Remove field completely in next release, add note that users should update through deprecation version such that running code is compatible
Finding and removing unused indices
How do you know if they are unused? Do something like
SELECT s.schemaname,s.relname AS tablename,s.indexrelname AS indexname,pg_relation_size(s.indexrelid) AS index_sizeFROM pg_catalog.pg_stat_user_indexes sJOIN pg_catalog.pg_index i ON s.indexrelid = i.indexrelidWHERE s.idx_scan = 0 -- has never been scannedORDER BY pg_relation_size(s.indexrelid) DESC;
If indices are unused, it should be safe to remove via removing db_index=False
and running ./manage.py makemigration
This will generate a migration, however, if you look at the ./manage.py sqlmigrate
output it may not be dropping the index concurrently, so will be a blocking
operation. To get around this we need to modify the migration:
- use
SeparateDatabaseAndState
to allow django to keep track of the state of the model in the db, but let us modify how the index is created. - use
RemoveIndexConcurrently
to drop the index without blocking.
Avoiding locking on related tables
When e.g. bulk inserting, we can end up needing to select a lot of keys from
referenced tables. When we don't actually care about this, we can specify
db_constraint=False
, along with making any required migrations if we're
updating an existing field.
ClickHouse
How-to find slow queries
To find and debug slow queries in production you have several options available
Grafana
The Clickhouse queries - by endpoint dashboard gives a breakdown of how things are looking reliability and performance-wise. Highly used and slow/unreliable endpoints often indicate issues with queries.
PostHog instance/status
dashboard
Under https://app.posthog.com/instance/status/internal_metrics you will find various metrics and query logs. Note: if you are a staff user you can also analyze queries by clicking on them (or copying your own queries).
This analysis will output:
- Query runtime
- Number of rows read / Bytes read
- Memory used
- Flamegraphs for CPU, time and memory
These can be useful for figuring out why certain queries are performing slow.
Metabase
Need more granular access to queries than these dashboards provide? Take a look at this Metabase query. The ClickHouse system
tables (e.g. system.query_log
) provide a lot of useful information for identifying and diagnosing slow queries.
How-to fix slow queries
TODO
If you have a fix, you need to make sure it works and doesn't regress.
Under the main repo, you can find a query performance benchmark suite at ee/benchmarks
. These are run nightly on master. You can refresh this manually by running the Benchmark workflow, which might be handy for chained PRs.
After fixing an issue:
- Make sure there's an appropriate benchmark for the fix
- Add
performance
label to your PR to get information about the performance changes
Only changes > 20% are reported. ClickHouse queries are run on a private ClickHouse installation with pre-populated data.