Aggregating indeterminate JSON structures in Postgres

Postgres JSON types are great, but what do you do when your users fill them with garbage and you want to do maths.

The problem I was presented with was to find a way to aggregate user-defined (and structured) data that was stored in a serialized Ruby hash in MySQL, and could come in pretty much any shape or form.

I initially thought about expanding our serialized hash store into a key value table, however, this raised a number of problems:

  • We'd need to use a TEXT/BLOB type for the values, they could be anything – strings, integers, floats, booleans, who knows
  • Non-numeric values would still be mixed up amongst numeric values, meaning we'd have to cast and/or filter in order to use MySQL's aggregation functions, support for which appeared less than satisfactory
  • Indexing restrictions apply to BLOB (and therefore TEXT) data types in MySQL anyway
  • We would still be unable to aggregate multidimensional/nested data

I toyed with the idea of denormalizing the data structure and having columns like value_to_s, value_to_i and value_to_f as per the Ruby vernacular, and casting string values into those columns if they match a relevant regex for the type.

However, whilst this would still only allow us to aggregate top-level values, make writes pretty complicated, synchronisation of updates would be an issue, and engineering this seemed like a headache. In short, I was largely dissatisfied with this as a solution, I didn't want to build that. Fuck that.

I quickly came to the conclusion that:

  • I wanted to store this data as documents, JSON, ideally
  • This seemed like a difficult problem to solve in MySQL
  • I didn't really want to bolt another database onto the side of the application, yet another thing to keep in sync
  • I didn't have much faith that the JSON data type added in MySQL 5.7.8 would make the situation much better
  • MariaDB didn't make me any more confident either
  • Postgres' JSON types, type casts, and regex support appeared to solve all of my problems

After more research I concluded that migrating the database from MySQL to Postgres and fixing the application issues that doing that raised would be a less painful experience than trying to implement this feature in MySQL – and that in addition to that, we'd get a truck load of benefits outside of this problem as well.

So, what did those queries look like once I'd done that?

Usage examples

Selecting properties

Selecting from JSON types is pretty easy once you've got your head around the syntax, it's basically columname->'propname' to select a property and retrieve it as a JSON type (allowing further traversal) – and if you want to retrieve its text representation use a ->>.

It's quite common to end up using -> for everything up until the last property, i.e. traverse the structure then grab the value e.g. reviews->'author'->>'name'.

More examples can be found in the functions and operators documentation.

Aggregating known types

The most basic aggregations, where you know for sure what the data looks like (e.g. all integers) are pretty simple:

id, title, custom  
1, "Entry 1", {"rating": 0}  
2, "Entry 2", {"rating": 100}  
SELECT AVG(custom->>'rating') FROM entries  

=> 50

Aggregating strings via casting

The dataset I had was user supplied, often from form submissions, so integers and floats were usually stored as strings, which isn't ideal, but casting the values to numerics gets around it:

1, "Entry 1", {"rating": "0"}  
2, "Entry 2", {"rating": "100"}  
SELECT AVG((custom->>'rating')::NUMERIC) FROM entries  

=> 50

Note: an extra pair of parenthesis are required to wrap the property selection otherwise Postgres will try to cast the string 'rating' rather than your property value.

Aggregating strings that might be numeric

As this data was structured by end users, properties were often reassigned or repurposed, meaning the result set often contained a variety of data types for the same key:

1, "Entry 1", {"rating": 0}  
2, "Entry 2", {"rating": "100"}  
3, "Entry 3", {"rating": ":("}  

If you try and query data that looks like this with aggregation functions, you'll get all sorts of casting errors.

Not a massive deal though, Postgres has pretty robust regex support. The following example will match strings which Postgres can cast to numeric, which includes integers and floats (with optional leading/trailing dots and zeros) i.e. 1, 0.1, .1 and 1..

SELECT AVG((custom->>'rating')::NUMERIC)  
FROM entries  
WHERE (custom->>'rating')::TEXT ~ '^([0-9]+\.?[0-9]*|\.[0-9]+)$'  

=> 50

What this means is that the non-numeric rows will simply be ignored, and won't form part of the aggregation, which is exactly what I wanted, incorrectly entered data is not aggregated.

If you want to include those rows, and count them as zeros,you might want to move this line into a CASE statement instead, and try aggregating that.

In addition to handling uncastable values, this also filters out rows that omit that property or have mismatching JSON types e.g. booleans. Which means your data can look like this:

1, "Entry 1", {"rating": 0}  
2, "Entry 2", {"rating": "100"}  
3, "Entry 3", {"rating": ":("}  
4, "Entry 4", {"rating": true}  
5, "Entry 5", {"other": false}  
6, "Entry 6", NULL  

=> 50

Any rows (e.g. 4-6) that don't have numeric values are simply ignored for the purposes of aggregation, which is ideal.

Aggregating nested properties

Since Postgres has a dedicated syntax for JSON types, nested properties are no big deal at all, just get the operators right:

5, "Entry 5", {"ratings": {"score": 0}}  
6, "Entry 6", {"ratings": {"score": "100"}}  
7, "Entry 7", {"ratings": {"score": ":("}}  
8, "Entry 8", {"ratings": {"score": true}}
SELECT AVG((custom->'ratings'->>'score')::NUMERIC)  
FROM entries  
WHERE (custom->'ratings'->>'score')::TEXT ~ '^([0-9]+\.?[0-9]*|\.[0-9]+)$'  

=> 50

Aggregation values within arrays

Arrays are supported too:

5, "Entry 5", {"ratings": [0, 1000]}  
6, "Entry 6", {"ratings": ["100", 2000]}  
7, "Entry 7", {"ratings": [":("]}  
8, "Entry 8", {"ratings": [true]}  
SELECT AVG((custom->'ratings'->0->>'score')::NUMERIC)  
FROM entries  
WHERE (custom->'ratings'->0->>'score')::TEXT ~ '^([0-9]+\.?[0-9]*|\.[0-9]+)$'  

=> 50

SELECT AVG((custom->'ratings'->1->>'score')::NUMERIC)  
FROM entries  
WHERE (custom->'ratings'->1->>'score')::TEXT ~ '^([0-9]+\.?[0-9]*|\.[0-9]+)$'  

=> 1500

Aggregating values within objects within arrays

5, "Entry 5", {"ratings": [{"score": 0}, {"score": 1000}]}  
6, "Entry 6", {"ratings": [{"score": "100"), {"score": 2000}}]}  
7, "Entry 7", {"ratings": [{"score": ":("}]}  
8, "Entry 8", {"ratings": [{"score": true}]}  
SELECT AVG((custom->'ratings'->0->>'score')::NUMERIC)  
FROM entries  
WHERE (custom->'ratings'->0->>'score')::TEXT ~ '^([0-9]+\.?[0-9]*|\.[0-9]+)$'  
=> 50
SELECT AVG((custom->'ratings'->1->>'score')::NUMERIC)  
FROM entries  
WHERE (custom->'ratings'->1->>'score')::TEXT ~ '^([0-9]+\.?[0-9]*|\.[0-9]+)$'  
=> 1500

Bonus points: identifying keys

As part of the work I was doing I needed to be able to identify what keys had been used so that users could select the properties they've stored via a UI in order to build out aggregations.

Selecting a list of top level keys is pretty trivial, there's a basic function for that:

SELECT JSON_OBJECT_KEYS(custom) FROM entries  

However, doing this recursively to get nested keys is significantly more troublesome:

WITH RECURSIVE doc_key_and_value_recursive(key, value) AS (  
  SELECT
    t.key,
    t.value
  FROM entries, JSON_EACH(entries.custom) AS t

  -- WHERE something = 123 # apply filtering here

  UNION ALL

  SELECT
    CONCAT(doc_key_and_value_recursive.key, '.', t.key),
    t.value
  FROM doc_key_and_value_recursive,
    JSON_EACH(
      CASE
        WHEN JSON_TYPEOF(doc_key_and_value_recursive.value) <> 'object' THEN '{}' :: JSON
        ELSE doc_key_and_value_recursive.value
      END
    ) AS t
)
SELECT DISTINCT key  
FROM doc_key_and_value_recursive  
WHERE JSON_TYPEOF(doc_key_and_value_recursive.value) NOT IN ('object')  
ORDER BY key  

http://stackoverflow.com/a/30133181/160821

What this returns is a list of dot delimited paths to retrieve a property e.g. custom.ratings.score – similar to how you'd access it if the JSON was loaded into JavaScript.

Surprisingly, this executes in a relatively reasonable timeframe. When testing against half a million rows I was finding it takes around 20s – but given I only ever needed to query a fraction of that dataset, a few hundred/thousand, the query time was sufficiently negligable.

Array support (i.e. keys of objects within arrays) is feasible with enough tinkering too:

WITH RECURSIVE doc_key_and_value_recursive(key, value) AS (  
  SELECT
    CASE WHEN JSON_TYPEOF(t.value) = 'array' THEN CONCAT(t.key,'[]') ELSE t.key END AS key,
    CASE WHEN JSON_TYPEOF(t.value) = 'array' THEN JSON_ARRAY_ELEMENTS(t.value) ELSE t.value END AS value,
    CASE WHEN JSON_TYPEOF(t.value) = 'array' THEN JSON_ARRAY_LENGTH(t.value) ELSE NULL END AS i
  FROM entries, JSON_EACH(entries.custom) AS t

  -- WHERE something = 123 # apply filtering here

  UNION ALL

  SELECT
    CASE WHEN JSON_TYPEOF(t.value) = 'array' THEN CONCAT(doc_key_and_value_recursive.key, '.', t.key, '[]') ELSE CONCAT(doc_key_and_value_recursive.key, '.', t.key) END AS key,
    CASE WHEN JSON_TYPEOF(t.value) = 'array' THEN JSON_ARRAY_ELEMENTS(t.value) ELSE t.value END AS value,
    CASE WHEN JSON_TYPEOF(t.value) = 'array' THEN JSON_ARRAY_LENGTH(t.value) ELSE NULL END AS i
  FROM doc_key_and_value_recursive,
    JSON_EACH(
      CASE
        WHEN JSON_TYPEOF(doc_key_and_value_recursive.value) <> 'object' THEN '{}' :: JSON
        ELSE doc_key_and_value_recursive.value
      END
    ) AS t
)
SELECT key, MAX(i)  
FROM doc_key_and_value_recursive  
GROUP BY key  
ORDER BY key  

What this returns is a little/a lot weird, it's a similar dot delimited notation to describe paths to fields, with [] slotted in wherever an array was found i.e. custom.ratings[].score, as well as a max column, that describes the largest number of items found in that array for any record.

Whilst this is definitely not what I wanted, I could process that data into what I did want, i.e. I could infer that a result of custom.ratings[].score, 2 meant that both custom.ratings.0.score and custom.ratings.1.score were therefore possible, and use that to build out a list of keys for user to select. Crazy, I know.

Summary

Hopefuly this will help you get started aggregating your JSON data in Postgres. Whilst some/most of these queries are pretty lengthy – Postgres does a pretty good job of handling them in small to medium size datasets, and I expect they'll scale pretty far into larger ones too.

Best of luck!