Skip to content

Alternative to SciGen #245

@rolyp

Description

@rolyp

We may need to find something other than SciGen after the ICLR/ICML submissions(s). The SciGen "datasets" are JSON representation of the tables from the paper, which means they are quite noisy: compared to "actual" data, they have lots of spurious content which are artifacts of the translation from visual presentation to JSON. Including visualisations is an important use case for us, but it seems weird to obfuscate the visual version of the table by encoding it as JSON -- why not just send the image to the LLM?

SciGen also only considers the problem of text which explains (visual) tables which summarise data, whereas we also want to consider the situation where the text itself summarises some data.

As an example of the problem, in 1707.05853v2-230 we find:

    {
        "_method": "average pooling",
        "_goals": "63.7 66.460.0",
        "_requests": "96.6 96.896.0"
    },

with the spaces after 63.7 and 96.6 being illegal characters, and missing spaces after 66.4 and 96.8, because this is actually just a JSON screengrab of the following table:

Image

Not only is "63.7 66.460.0" unparsable, but we also can't reasonably expect an LLM to guess the interpretation in terms of mean, max and min, as given in the table caption.

Other potential problems with SciGen

  • Records that represent headings and other visual information, leading to non-homogeneous collections when represented as JSON;
  • Text sometimes refers to multiple figures, but only one JSON table (is this a problem in our rendering of SciGen?)

See also:

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    Status

    Proposed

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions