Some doubt about metric chosen in paper

<img width="1141" height="1219" alt="Image" src="https://github.com/user-attachments/assets/291235b1-0e2d-4920-992e-338b79a9e20d" /> In LoCoMo test，why choose F1 and BLEU (traditional metric) while in LongContext QA tasks, the authors choose QA accuracy?
I think F1 and BLEU may not a ideal choice.