PostgreSQL count_estimate() function for large tables and views

I once found all admin pages of a web app were taking 100ms longer than they should due to a count query on a huge table. I couldn’t remove the feature but as admins only needed a rough idea of how many rows there were I could substitute an estimate 🔮

COUNT(*) is slow!

In most relational databases count(*) is surprisingly slow due to MVCC consistency. PostgreSQL allows concurrent transactions (which could each be modifying rows) so there’s no canonical state of the data and each transaction sees its own version. Count must scan the whole table (or suitable index); literally counting all rows visible to the current transaction. Counting a subset (with WHERE) performs better because Postgres can scan fewer rows, but performance still hurts for big tables.

Side-note about count(1) vs count(*). You might guess count(1) would be fastest because count(*) appears to be inefficiently saying “fetch every column of every row and count the result”. But the opposite happens, as per the SQL standard “If COUNT(*) is specified, then the result is the cardinality of T” so no columns need to be loaded as the * here means cardinality not “all columns”. Some databases optimise count(1) in the same way but PostgreSQL does not; it must check every row to confirm that the argument 1 is not NULL. The query plans look identical but the extra NULL check is slower, and benchmarks prove this. Follow the spec and use count(*)!

Estimates are fast!

COUNT(*) is slow because it’s “correct”, but if an estimate is good enough (exploratory queries, vanity metrics, etc) we can use PostgreSQL’s planner stats for a near instant answer:

Stats table query

You can query PosgreSQL’s catalogue table pg_class for stats about a whole table. A use case would be admin panels / database clients which show whole table row counts.

SELECT reltuples::bigint AS count_estimate
FROM pg_class
WHERE oid = to_regclass('public.large_table');

Or for more accuracy you can mimic what PostgreSQL’s query planner does:

SELECT
  (reltuples / relpages) * (
    pg_relation_size('items') /
      (current_setting('block_size')::integer)
  )
FROM pg_class
WHERE oid = to_regclass('public.large_table');

But at this point you might as well use the planner’s estimate directly…

EXPLAIN estimate function

This function works for estimating the count of any query (including WHERE clauses) by parsing EXPLAIN output. It's still just an estimate, but it's fast and good enough for big approximate counts.

I based it off a well circulated function with these improvements:

CREATE FUNCTION count_estimate(query text) RETURNS bigint AS $$
DECLARE
  rec record;
BEGIN
  EXECUTE 'EXPLAIN (FORMAT json) ' || query INTO rec;
  RETURN rec."QUERY PLAN"->0->'Plan'->'Plan Rows';
END;
$$ LANGUAGE plpgsql VOLATILE STRICT;

Use it without a count in the query as the function fetches the count for you (otherwise you’ll get the count of a count which is just 1 aggregate row).

SELECT count_estimate('SELECT * FROM socks WHERE plain = false');

How bad can it be?

The result can be exactly correct! But I’ve often enough seen it be up to 10x off. When estimating a very complex (hundreds of lines of SQL) view count it was 10,000x off! So be aware that the more complex a query the more wrong the estimate is likely to be.

That said there are a couple things you can do to improve accuracy:

Vacuum frequency

The PostgreSQL planner stats are updated by the stats collector and the autovacuum daemon. If vacuum has just run and data hasn’t changed since, then count_estimate should be spot on. Thus you can increase the frequency vacuum runs (but be careful you don’t overwhelm the system).

Additional statistics

Since v10 you can give additional statistics to PostgreSQL to improve query plan estimates; generally where one column has some relationship with another. It will require some research to get right and still won’t give perfect results, but worth a look to get the most out of estimates.