Skip to content

Add line and column information to errors#410

Open
jakubadamw wants to merge 6 commits into01mf02:mainfrom
jakubadamw:line-column-information
Open

Add line and column information to errors#410
jakubadamw wants to merge 6 commits into01mf02:mainfrom
jakubadamw:line-column-information

Conversation

@jakubadamw
Copy link

So I’ve been using jaq as a replacement for jq a bit, and one thing I noticed is that for an invalid JSON file it did not provide me with the location of the parsing error.

$ cat invalid.json | jaq
Error: failed to parse: comma or end of sequence expected

This PR adds line & column information to errors such that the output is now:

$ cat invalid.json | jaq
Error: failed to parse: at line 5, column 5: comma or end of sequence expected

Unlike jq, the offset calculation is character-based rather than byte-based, which I believe is an improvement.

jq’s output for reference:

$ cat invalid.json | jq
jq: parse error: Expected separator between values at line 5, column 8

I also added a number of tests.

For iterator-based parsing, this works through a wrapper iterator
that tracks the line/column position by counting bytes.

However, when the original sequence of bytes remains available on error
(an unhappy path, so to say), we calculate the line & column values from the byte offset,
by rescanning the sequence.
I introduced a new test macro and made the `golden_test()` function take
an extra argument denoting whether the input is expected to parse successfully.
@01mf02
Copy link
Owner

01mf02 commented Mar 12, 2026

Thanks for your PR!

The parsing part looks quite good, I'm not sold yet for the reading part --- I'm particularly worried about the usage of Rc<Cell<_>> in a hot path.
Could you benchmark reading performance (cat big_file_with_single_value | jaq empty) before and after your PR?

@jakubadamw
Copy link
Author

jakubadamw commented Mar 12, 2026

@01mf02, hello, thanks for taking a look and for hinting at the performance implications of this!

I generated a 4GiB JSON file consisting of a single array of consecutive numbers. Here are the numbers that I am seeing:

$ hyperfine --runs 3 'cat numbers.json | ./target/release/jaq empty' --reference 'cat numbers.json | jaq empty'
Benchmark 1: cat numbers.json | jaq empty
  Time (mean ± σ):     20.311 s ±  0.498 s    [User: 17.702 s, System: 4.057 s]
  Range (min … max):   19.772 s … 20.755 s    3 runs
 
Benchmark 2: cat numbers.json | ./target/release/jaq empty
  Time (mean ± σ):     23.575 s ±  0.320 s    [User: 21.485 s, System: 3.771 s]
  Range (min … max):   23.265 s … 23.904 s    3 runs

So you’re right, this can lead to a 15-20% performance regression. I suppose it’s not worth doing it for the stream-based path, then? What do you think?

@01mf02
Copy link
Owner

01mf02 commented Mar 12, 2026

I generated a 4GB JSON file consisting of a single array of consecutive numbers.

It would be nice to have instructions to reproduce your experiment.
I think that such an array of consecutive numbers might even be quite nice in your case --- you could also try just generating a lot of whitespace. I think that would be the worst case for the change that you propose, because parsing whitespace is normally the fastest and would thus show the impact most faithfully.

So you’re right, this can lead to a 15-20% performance regression. I suppose it’s not worth doing it for the stream-based path, then? What do you think?

I agree with you. It's not worth it, also because you normally cannot just look up the line number of something that you piped into jaq. For that, you will probably look into a file. And if you are already doing that, then you should be using jaq ... file anyway, for performance. ;)

// TODO: replace with `!byte.is_utf8_continuation()` once stable.
let column = line_start
.iter()
.filter(|&&byte| byte & 0xC0 != 0x80)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a source to this magic trick?

#[derive(Debug)]
pub struct Error(usize, hifijson::Error);
pub struct Error {
line: usize,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make a type LineCol = (usize, usize) and store that here, instead of having separate line and column fields. That would simplify the call sites of byte_offset_to_position below.

paste::paste! {
#[test]
fn [<$name _stdin>]() -> io::Result<()> {
golden_test($args, $input, false, $stderr)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would actually just add golden_test_file after golden_test in the test! macro, and delete test_error.

@jakubadamw
Copy link
Author

It would be nice to have instructions to reproduce your experiment. I think that such an array of consecutive numbers might even be quite nice in your case --- you could also try just generating a lot of whitespace. I think that would be the worst case for the change that you propose, because parsing whitespace is normally the fastest and would thus show the impact most faithfully.

Of course! This is the script I was using. I added an option to generate a file filled with a specific ratio of whitespaces as well.

#!/usr/bin/env -S cargo +nightly -Zscript
---
[dependencies]
bytesize = "1"
structopt = "0.3"
---

use std::io::{BufWriter, Write};
use structopt::StructOpt;

#[derive(StructOpt)]
struct Args {
    /// Total output size, e.g. 4GiB.
    size: bytesize::ByteSize,
    /// Fraction of bytes that should be spaces, e.g. 0.4.
    #[structopt(default_value = "0")]
    whitespace_byte_count_ratio: f64,
}

fn write_spaces(out: &mut impl Write, count: usize) -> std::io::Result<()> {
    use std::io::Read;
    std::io::copy(&mut std::io::repeat(b' ').take(count as u64), out)?;
    Ok(())
}

fn main() -> std::io::Result<()> {
    let args = Args::from_args();
    let desired_total_byte_count = args.size.as_u64() as usize;
    let whitespace_byte_count_ratio = args.whitespace_byte_count_ratio;

    assert!(
        (0.0..1.0).contains(&whitespace_byte_count_ratio),
        "whitespace_byte_count ratio must be in [0, 1)"
    );

    let desired_non_whitespace_byte_count_byte_count =
        (desired_total_byte_count as f64 * (1.0 - whitespace_byte_count_ratio)) as usize;

    let stdout = std::io::stdout();
    let mut out = BufWriter::new(stdout.lock());

    out.write_all(b"[")?;

    let mut non_whitespace_byte_count_byte_count = 1usize; // Counts the opening '['.
    let mut whitespace_byte_count = 0usize;

    for i in 0u64.. {
        let number = i.to_string();
        let entry_byte_count = number.len() + if i == 0 { 0 } else { 1 }; // +1 for ','.

        if non_whitespace_byte_count_byte_count + entry_byte_count + 1 > desired_non_whitespace_byte_count_byte_count {
            // +1 for the closing ']'.
            break;
        }

        if i > 0 {
            out.write_all(b",")?;
            if i % 10 == 0 {
                out.write_all(b"\n")?;
                whitespace_byte_count += 1;
            }
        }

        // Add spaces proportionally to maintain the whitespace_byte_count ratio.
        let desired_whitespace_byte_count = (non_whitespace_byte_count_byte_count as f64 * whitespace_byte_count_ratio / (1.0 - whitespace_byte_count_ratio)) as usize;
        let spaces_to_write = desired_whitespace_byte_count.saturating_sub(whitespace_byte_count);
        write_spaces(&mut out, spaces_to_write)?;
        whitespace_byte_count += spaces_to_write;

        out.write_all(number.as_bytes())?;
        non_whitespace_byte_count_byte_count += entry_byte_count;
    }

    // Pad with remaining whitespace to reach desired_total_byte_count.
    let trailing_spaces = desired_total_byte_count.saturating_sub(non_whitespace_byte_count_byte_count + whitespace_byte_count + 1);
    write_spaces(&mut out, trailing_spaces)?;
    whitespace_byte_count += trailing_spaces;

    out.write_all(b"]")?;
    non_whitespace_byte_count_byte_count += 1;

    assert_eq!(
        non_whitespace_byte_count_byte_count + whitespace_byte_count,
        desired_total_byte_count,
        "total bytes written does not match desired total"
    );

    Ok(())
}

So with a test file generated with:

./generate 4GiB 0.9 > numbers.json

(around 90% of whitespaces)

and the benchmark run:

hyperfine --warmup 1 --runs 5 'cat numbers.json | jaq empty' --reference 'cat numbers.json | ./target/release/jaq empty'

I am seeing these:

Benchmark 1: cat numbers.json | ./target/release/jaq empty
  Time (mean ± σ):     11.647 s ±  0.140 s    [User: 10.755 s, System: 2.665 s]
  Range (min … max):   11.404 s … 11.742 s    5 runs
 
Benchmark 2: cat numbers.json | jaq empty
  Time (mean ± σ):      9.787 s ±  0.043 s    [User: 9.079 s, System: 2.461 s]
  Range (min … max):    9.741 s …  9.835 s    5 runs
 
Summary
  cat numbers.json | ./target/release/jaq empty ran
    1.19 ± 0.02 times slower than cat numbers.json | jaq empty

so, again, a 20% slowdown.

So I’ll exclude the line/column information from the stream-based mode and make the other requested changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants