Skip to content

Question about F1 Performance Degradation #20

@zziho

Description

@zziho

Hi, thank you for the outstanding study — I’ve been fully engaged in trying to reproduce the results.
However, I encountered a significant performance gap and would like to confirm whether the warnings I'm seeing can affect evaluation quality.

Image

This is my experiment result and my F1 score is about 20 points lower than what is reported in the paper.
I also ran two additional experiments on different datasets and both of them showed a similar result.

WARNING:graphr1:Some nodes are missing, maybe the storage is damaged
WARNING:graphr1:Some edges are missing, maybe the storage is damaged

I would like to know whether these warnings could cause incomplete graph construction or lead to degraded F1/EM scores.

  • Experiment Setup
    Model: Qwen2.5-3B-Instruct
    Datasets: 2WikiMultiHopQA, NQ
    Training & Evaluation: Same hyperparameters and pipeline as the official implementation

Questions:

  1. Do these graphr1 warnings indicate corrupted storage or missing nodes/edges that can affect retrieval or reasoning steps?
  2. Could this be the source of the large performance drop?
  3. Is there any recommended fix

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions