Add line and column information to errors#410
Add line and column information to errors#410jakubadamw wants to merge 6 commits into01mf02:mainfrom
Conversation
For iterator-based parsing, this works through a wrapper iterator that tracks the line/column position by counting bytes. However, when the original sequence of bytes remains available on error (an unhappy path, so to say), we calculate the line & column values from the byte offset, by rescanning the sequence.
I introduced a new test macro and made the `golden_test()` function take an extra argument denoting whether the input is expected to parse successfully.
…umn calculation there.
|
Thanks for your PR! The parsing part looks quite good, I'm not sold yet for the reading part --- I'm particularly worried about the usage of |
|
@01mf02, hello, thanks for taking a look and for hinting at the performance implications of this! I generated a 4GiB JSON file consisting of a single array of consecutive numbers. Here are the numbers that I am seeing: $ hyperfine --runs 3 'cat numbers.json | ./target/release/jaq empty' --reference 'cat numbers.json | jaq empty'
Benchmark 1: cat numbers.json | jaq empty
Time (mean ± σ): 20.311 s ± 0.498 s [User: 17.702 s, System: 4.057 s]
Range (min … max): 19.772 s … 20.755 s 3 runs
Benchmark 2: cat numbers.json | ./target/release/jaq empty
Time (mean ± σ): 23.575 s ± 0.320 s [User: 21.485 s, System: 3.771 s]
Range (min … max): 23.265 s … 23.904 s 3 runsSo you’re right, this can lead to a 15-20% performance regression. I suppose it’s not worth doing it for the stream-based path, then? What do you think? |
It would be nice to have instructions to reproduce your experiment.
I agree with you. It's not worth it, also because you normally cannot just look up the line number of something that you piped into jaq. For that, you will probably look into a file. And if you are already doing that, then you should be using |
| // TODO: replace with `!byte.is_utf8_continuation()` once stable. | ||
| let column = line_start | ||
| .iter() | ||
| .filter(|&&byte| byte & 0xC0 != 0x80) |
There was a problem hiding this comment.
Can you add a source to this magic trick?
| #[derive(Debug)] | ||
| pub struct Error(usize, hifijson::Error); | ||
| pub struct Error { | ||
| line: usize, |
There was a problem hiding this comment.
I would make a type LineCol = (usize, usize) and store that here, instead of having separate line and column fields. That would simplify the call sites of byte_offset_to_position below.
| paste::paste! { | ||
| #[test] | ||
| fn [<$name _stdin>]() -> io::Result<()> { | ||
| golden_test($args, $input, false, $stderr) |
There was a problem hiding this comment.
I would actually just add golden_test_file after golden_test in the test! macro, and delete test_error.
Of course! This is the script I was using. I added an option to generate a file filled with a specific ratio of whitespaces as well. #!/usr/bin/env -S cargo +nightly -Zscript
---
[dependencies]
bytesize = "1"
structopt = "0.3"
---
use std::io::{BufWriter, Write};
use structopt::StructOpt;
#[derive(StructOpt)]
struct Args {
/// Total output size, e.g. 4GiB.
size: bytesize::ByteSize,
/// Fraction of bytes that should be spaces, e.g. 0.4.
#[structopt(default_value = "0")]
whitespace_byte_count_ratio: f64,
}
fn write_spaces(out: &mut impl Write, count: usize) -> std::io::Result<()> {
use std::io::Read;
std::io::copy(&mut std::io::repeat(b' ').take(count as u64), out)?;
Ok(())
}
fn main() -> std::io::Result<()> {
let args = Args::from_args();
let desired_total_byte_count = args.size.as_u64() as usize;
let whitespace_byte_count_ratio = args.whitespace_byte_count_ratio;
assert!(
(0.0..1.0).contains(&whitespace_byte_count_ratio),
"whitespace_byte_count ratio must be in [0, 1)"
);
let desired_non_whitespace_byte_count_byte_count =
(desired_total_byte_count as f64 * (1.0 - whitespace_byte_count_ratio)) as usize;
let stdout = std::io::stdout();
let mut out = BufWriter::new(stdout.lock());
out.write_all(b"[")?;
let mut non_whitespace_byte_count_byte_count = 1usize; // Counts the opening '['.
let mut whitespace_byte_count = 0usize;
for i in 0u64.. {
let number = i.to_string();
let entry_byte_count = number.len() + if i == 0 { 0 } else { 1 }; // +1 for ','.
if non_whitespace_byte_count_byte_count + entry_byte_count + 1 > desired_non_whitespace_byte_count_byte_count {
// +1 for the closing ']'.
break;
}
if i > 0 {
out.write_all(b",")?;
if i % 10 == 0 {
out.write_all(b"\n")?;
whitespace_byte_count += 1;
}
}
// Add spaces proportionally to maintain the whitespace_byte_count ratio.
let desired_whitespace_byte_count = (non_whitespace_byte_count_byte_count as f64 * whitespace_byte_count_ratio / (1.0 - whitespace_byte_count_ratio)) as usize;
let spaces_to_write = desired_whitespace_byte_count.saturating_sub(whitespace_byte_count);
write_spaces(&mut out, spaces_to_write)?;
whitespace_byte_count += spaces_to_write;
out.write_all(number.as_bytes())?;
non_whitespace_byte_count_byte_count += entry_byte_count;
}
// Pad with remaining whitespace to reach desired_total_byte_count.
let trailing_spaces = desired_total_byte_count.saturating_sub(non_whitespace_byte_count_byte_count + whitespace_byte_count + 1);
write_spaces(&mut out, trailing_spaces)?;
whitespace_byte_count += trailing_spaces;
out.write_all(b"]")?;
non_whitespace_byte_count_byte_count += 1;
assert_eq!(
non_whitespace_byte_count_byte_count + whitespace_byte_count,
desired_total_byte_count,
"total bytes written does not match desired total"
);
Ok(())
}So with a test file generated with:
(around 90% of whitespaces) and the benchmark run:
I am seeing these: so, again, a 20% slowdown. So I’ll exclude the line/column information from the stream-based mode and make the other requested changes. |
So I’ve been using
jaqas a replacement forjqa bit, and one thing I noticed is that for an invalid JSON file it did not provide me with the location of the parsing error.$ cat invalid.json | jaq Error: failed to parse: comma or end of sequence expectedThis PR adds line & column information to errors such that the output is now:
$ cat invalid.json | jaq Error: failed to parse: at line 5, column 5: comma or end of sequence expectedUnlike
jq, the offset calculation is character-based rather than byte-based, which I believe is an improvement.jq’s output for reference:
$ cat invalid.json | jq jq: parse error: Expected separator between values at line 5, column 8I also added a number of tests.