Alternative to SciGen

We may need to find something other than SciGen after the ICLR/ICML submissions(s). The SciGen "datasets" are JSON representation of the _tables from the paper_, which means they are quite noisy: compared to "actual" data, they have lots of spurious content which are artifacts of the translation from visual presentation to JSON. Including visualisations *is* an important use case for us, but it seems weird to obfuscate the visual version of the table by encoding it as JSON -- why not just send the image to the LLM? 

SciGen also *only* considers the problem of text which explains (visual) tables which summarise data, whereas we also want to consider the situation where the text itself summarises some data.

As an example of the problem, in `1707.05853v2-230` we find:

```json
    {
        "_method": "average pooling",
        "_goals": "63.7 66.460.0",
        "_requests": "96.6 96.896.0"
    },
```
with the spaces after `63.7` and `96.6` being illegal characters, and missing spaces after `66.4` and `96.8`, because this is actually just a JSON screengrab of the following table:

<img width="300" alt="Image" src="https://github.com/user-attachments/assets/d546e737-dc18-49ef-a8d5-8f67d9b5214f" />

Not only is `"63.7 66.460.0"` unparsable, but we also can't reasonably expect an LLM to guess the interpretation in terms of mean, max and min, as given in the table caption. 

#### Other potential problems with SciGen
- Records that represent _headings_ and other visual information, leading to non-homogeneous collections when represented as JSON;
- Text sometimes refers to multiple figures, but only one JSON table (is this a problem in our rendering of SciGen?)

See also:
- #91 
- #241 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alternative to SciGen #245

Other potential problems with SciGen

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Alternative to SciGen #245

Description

Other potential problems with SciGen

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions