Raw input column #390

rvashurin · 2025-08-14T11:06:26Z

Adds stripped_input column to most of the datasets
Improves memory footprint of dataset publishing by switching to generator-based iteration
Adds validation script to prevent regressions in datasets
Adds publishing script allowing for batch-publishing of all datasets to specified namespace

ArtemVazh · 2025-08-19T11:17:06Z

dataset_builders/README.md

+  - QA: `Q: <question>\nA:`
+  - Multiple choice: `Q: <question>\nA. <choice0>\nB. <choice1>\n...\nA:`
+  - Translation: `Original:\n<source>\nTranslation:`
+  - Summarization: `Original:\n<source>\nSummary:`


Should we keep this structure? It seems that all methods using stripped_input expect raw input, without prefixes like Original:, Q:, etc

I'm not sure if it's best to mark parts of the task (input/output) for the NLI task. I guess for summarization it makes sense to mark output as a summary, since otherwise it could just be thought of as another sentence from the text. I'll check how it works with and without these markers and report back

ArtemVazh · 2025-08-19T11:22:05Z

dataset_builders/builders/person.py

+        x.append(prompt.format(text=q) if len(prompt) else q)
        y.append("")
-    return x, y
+        s.append(qa_stripped(q))


I’m not sure QA is the best format for this dataset

dataset_builders/builders/trivia_qa.py

rvashurin added 12 commits August 12, 2025 11:21

Add publish, validate scripts. Modify datasets to contain raw input

633faf2

Publish all

bf1e8a0

Try generator dataset to avoid OOM

31e9cf5

Fix samsum reference on hub

37096b2

Fix another samsum

79d38aa

Fix trivia qa

17493f0

Use dataset-specific features when publishing a dataset

5b81d55

Fix final prompt for trivia qa

49c055d

Align xsum with current deployed version

d63168e

Update readme for datasets

9ca0e42

Lint

9481a75

Remove junk line from readme

0447d5d

rvashurin requested review from ArtemVazh, IINemo, alfekka and cant-access-rediska0123 August 14, 2025 11:07

ArtemVazh reviewed Aug 19, 2025

View reviewed changes

dataset_builders/builders/trivia_qa.py Outdated Show resolved Hide resolved

IINemo assigned alfekka Dec 18, 2025

IINemo assigned rvashurin Dec 25, 2025

rvashurin added 2 commits January 9, 2026 14:17

Update stripped formats

db055c4

Merge main

89c93e0

ArtemVazh approved these changes Jan 10, 2026

View reviewed changes

rvashurin merged commit c70e378 into main Jan 10, 2026
1 check passed

rvashurin deleted the raw_input_column branch January 10, 2026 18:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Raw input column #390

Raw input column #390

Uh oh!

rvashurin commented Aug 14, 2025

Uh oh!

ArtemVazh Aug 19, 2025

Uh oh!

rvashurin Aug 21, 2025

Uh oh!

ArtemVazh Aug 19, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Raw input column #390

Raw input column #390

Uh oh!

Conversation

rvashurin commented Aug 14, 2025

Uh oh!

ArtemVazh Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

rvashurin Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

ArtemVazh Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ArtemVazh Aug 19, 2025 •

edited

Loading