We may need to find something other than SciGen after the ICLR/ICML submissions(s). The SciGen "datasets" are JSON representation of the tables from the paper, which means they are quite noisy: compared to "actual" data, they have lots of spurious content which are artifacts of the translation from visual presentation to JSON. Including visualisations is an important use case for us, but it seems weird to obfuscate the visual version of the table by encoding it as JSON -- why not just send the image to the LLM?
SciGen also only considers the problem of text which explains (visual) tables which summarise data, whereas we also want to consider the situation where the text itself summarises some data.
As an example of the problem, in 1707.05853v2-230 we find:
{
"_method": "average pooling",
"_goals": "63.7 66.460.0",
"_requests": "96.6 96.896.0"
},
with the spaces after 63.7 and 96.6 being illegal characters, and missing spaces after 66.4 and 96.8, because this is actually just a JSON screengrab of the following table:
Not only is "63.7 66.460.0" unparsable, but we also can't reasonably expect an LLM to guess the interpretation in terms of mean, max and min, as given in the table caption.
Other potential problems with SciGen
- Records that represent headings and other visual information, leading to non-homogeneous collections when represented as JSON;
- Text sometimes refers to multiple figures, but only one JSON table (is this a problem in our rendering of SciGen?)
See also:
We may need to find something other than SciGen after the ICLR/ICML submissions(s). The SciGen "datasets" are JSON representation of the tables from the paper, which means they are quite noisy: compared to "actual" data, they have lots of spurious content which are artifacts of the translation from visual presentation to JSON. Including visualisations is an important use case for us, but it seems weird to obfuscate the visual version of the table by encoding it as JSON -- why not just send the image to the LLM?
SciGen also only considers the problem of text which explains (visual) tables which summarise data, whereas we also want to consider the situation where the text itself summarises some data.
As an example of the problem, in
1707.05853v2-230we find:{ "_method": "average pooling", "_goals": "63.7 66.460.0", "_requests": "96.6 96.896.0" },with the spaces after
63.7and96.6being illegal characters, and missing spaces after66.4and96.8, because this is actually just a JSON screengrab of the following table:Not only is
"63.7 66.460.0"unparsable, but we also can't reasonably expect an LLM to guess the interpretation in terms of mean, max and min, as given in the table caption.Other potential problems with SciGen
See also: