Handling Broken JSON with jq

April 19, 2017

Problem

$ cat broken.json
{
"color": "red"
}
{
"color": "green"
{
"color": "blue"
}

There’s a missing } after green.

$ cat broken.json | jq .
{
"color": "red"
}
parse error: Expected separator between values at line 6, column 1

When jq hits invalid JSON, it completely stops processing the stream.
That’s not always great.

Some Unhelpful Solutions

People will be quick to “fix” your problem:

why don’t you fix the JSON at the source?

If you can do this, that’s the cleanest way out of this.

But… real life is messy. You don’t always control the JSON you have to process.
I ran into this recently: I extracted JSON logs from a system that decided to truncate some lines, some of the time.

why don’t you “check” your JSON before you … “check” your JSON?

Yes – you could do some minimal regex-based checks, possibly with AWK or grep.
But you know what’s already great at handling JSON? jq.

Using --seq

That’s what you’ll find if you keep searching. The documentation says:

    --seq:

    Use the application/json-seq MIME type scheme for separating JSON texts in jq’s
    input and output. This means that an ASCII RS (record separator) character is
    printed before each value on output and an ASCII LF (line feed) is printed
    after every output. Input JSON texts that fail to parse are ignored (but warned
    about), discarding all subsequent input until the next RS. This mode also
    parses the output of jq without the --seq option.

but an example might be clearer:

$ cat broken-with-rs.json
<RS>
{
"color": "red"
}
<RS>
{
"color": "green"
<RS>
{
"color": "blue"
}

Just put RS (ASCII character 0x1e) in front of each record. See below for an example.
(or check the internet draft for more details)

$ cat broken-with-rs.json | jq --seq .
{
"color": "red"
}
{
"color": "blue"
}

The broken “green” entry is skipped…

Why isn’t there a better solution?

When a JSON parser finds a problem, what’s the best solution?

I don’t think there is one.

If the JSON is invalid … how much of it needs to be thrown out?

The answer is probably application-specific. And wouldn’t it be worse if jq silently skipped invalid JSON? How long would it take to debug that?!

With RS delimiters, you’re explicitly boxing failures: on a parse error, it skips the current record and forwards to the next RS.

An Example

$ cat broken.json | sed -e 's/^{/'$(printf "\x1e")'{/' | jq --seq .
{
"color": "red"
}
{
"color": "blue"
}

Catch lines beginning with { and insert a RS there.

DISCLAIMER again: the broken “green” entry is skipped… in this case it’s silent, but I’ve seen other broken cases where a message is shown on STDERR. Use responsibly.

Discuss on Twitter