FidEx: Fidelity Detection for Web Archives

This repository contains the code and artifact for our NSDI 2026 paper "Detecting and Diagnosing Errors in Serving Archived Web Pages". FidEx reliably detects when an archived page differs from its original version and pinpoints the root cause.

Paper

Detecting and Diagnosing Errors in Serving Archived Web Pages
Jingyuan Zhu, Huanchen Sun, Harsha V. Madhyastha
NSDI 2026

This artifact accompanies our NSDI 2026 paper. FidEx detects fidelity violations in archived web pages by:

Comparing layout trees between live and archived versions
Detecting silent JavaScript errors
Analyzing screenshot differences
Pinpointing root causes of fidelity violations

Prerequisites (If not using Docker)

Python 3.11
Node.js 22
Chrome for Testing (version 127 or later)
We recommend running with at least 4 cores and 8GB of memory (we also test on Ubuntu 24.04)
To run on large datasets, we recommend large storage.

Container Setup

sudo ./run-docker.sh should build and run the container end to end. If manually building and running, follow the instructions below.

Building the Docker Image

The repository includes a Dockerfile that sets up all dependencies. Build the image using:

docker build -t fidex .

This will:

Install Python 3.11 and Node.js 22
Set up pywb (web archive replay system)
Install all Python and Node.js dependencies
Configure the FidEx environment

Running the Container

docker run -it --rm \
    --name fidex \
    -p 5901:5901 \
    -e VNC_DISPLAY=1 \
    -v $(pwd)/fidelity-files/writes:/root/fidelity-files/writes \
    -v $(pwd)/fidelity-files/warcs:/root/fidelity-files/warcs \
    -v $(pwd)/measurement:/root/measurement \
    -v $(pwd)/fidex:/root/fidex \
    fidex

The container exposes:

Port 5901: VNC server for graphical access (optional)
Volume mounts: For persistent data storage

Accessing the Container

Once the container is running, you'll be dropped into a bash shell. The working directory is /root.

Artifact Evaluation

Quick Start: Running a Simple Fidelity Check

Enter the container (if not already inside):
```
docker exec -it fidex /bin/bash
```
Activate the FidEx virtual environment:
```
source /root/venv/fidex/bin/activate
```

Running Full Record/Replay Pipeline

The main evaluation pipeline involves three stages:

Record stage (capture live page):

cd /root/measurement
STAGE=record python auto_record_replay.py --input_file test_urls.json

Proxy stage (replay through proxy):

STAGE=proxy python auto_record_replay.py --input_file test_urls.json

Archive stage (replay from archive):

STAGE=archive python auto_record_replay.py --input_file test_urls.json

Running Fidelity Detection

The layout_diff.py script compares layout trees between live and archived versions:

cd /root/measurement
python layout_diff.py fidelity \
    --base live \
    --comp archive \
    --input_file test_urls.json \
    --collection test

This will:

Compare live (--base live) vs archived (--comp archive) versions
Process URLs from the input file
Generate difference reports in diffs (e.g., diffs/archive/live_archive_test.json)

Running Error Pinpointing

The error_pinpoint.py script pinpoints errors in the archived version:

cd /root/measurement
python error_pinpoint.py \
    --base live \
    --comp archive \
    --input_file diffs/live_archive_test.json \
    --collection test

This will:

Pinpoint errors in the archived version based on the diff file
Generate error reports in pinpoint (e.g., pinpoint/live_archive_test.json)

Expected Outputs

After running evaluations, you should see:

fidelity-files/writes/: Contains recorded page data, screenshots, and instrumentation
fidelity-files/warcs/: Contains WARC files for archived pages
measurement/diffs/: Contains layout difference analysis results in JSON format

Example Output Structure

fidelity-files/
├── writes/
│   └── test/
│       └── google.com_<hash>/
│           ├── live_done
│           ├── archive_done
│           ├── live_screenshot.png
│           ├── archive_screenshot.png
│           └── ...
└── warcs/
    └── test/
        └── ...

measurement/
├── diffs/
├── pinpoint/
├── fidex_result/
└── ...

Configuration

The system uses configuration files located in:

/root/config.json: Main FidEx configuration
/root/fidelity-files/config.yaml: pywb configuration

You can modify these files to adjust behavior, ports, or paths.

Example Dataset

A companion dataset is available for artifact evaluation. See dataset/README.md for details on downloading and using the dataset.

Full results (raw)

Our main full results run (crawl metadata, fidelity detection, error pinpointing and clustering) are available in the measurement/fidex_result directory. For raw dataset including the original WARC files, crawling data (screenshots, Layout trees ,etc.), since they are too large to host, we'll try to host a sample subset of the data in the future.

Directory Structure

FidEx/
├── fidex/              # Main FidEx codebase
│   ├── fidelity_check/ # Fidelity detection logic
│   ├── record_replay/  # Record/replay functionality
│   ├── error_pinpoint/ # Error pinpointing
│   └── tests/          # Test suites
├── measurement/       # Evaluation scripts
├── dataset/            # Dataset information
├── docker-only/        # Docker-specific files
├── Dockerfile          # Container definition
└── run-docker.sh       # Container startup script

Citation

If you use FidEx in your research, please cite our NSDI 2026 paper:

@inproceedings{zhu2026detecting,
  title={Detecting and Diagnosing Errors in Serving Archived Web Pages},
  author={Zhu, Jingyuan and Sun, Huanchen and Madhyastha, Harsha V.},
  booktitle={Proceedings of the 23rd USENIX Symposium on Networked Systems Design and Implementation (NSDI 26)},
  year={2026}
}

Contact

For questions about the artifact or paper, please contact paper authors.

Name		Name	Last commit message	Last commit date
Latest commit History 493 Commits
dataset		dataset
docker-only		docker-only
experimental		experimental
fidex		fidex
measurement		measurement
misc/examples		misc/examples
src_changed		src_changed
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-entrypoint.sh		docker-entrypoint.sh
package.json		package.json
requirements.txt		requirements.txt
run-docker.sh		run-docker.sh
start_chrome.js		start_chrome.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FidEx: Fidelity Detection for Web Archives

Paper

Prerequisites (If not using Docker)

Container Setup

Building the Docker Image

Running the Container

Accessing the Container

Artifact Evaluation

Quick Start: Running a Simple Fidelity Check

Running Full Record/Replay Pipeline

Running Fidelity Detection

Running Error Pinpointing

Expected Outputs

Example Output Structure

Configuration

Example Dataset

Full results (raw)

Directory Structure

Citation

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FidEx: Fidelity Detection for Web Archives

Paper

Prerequisites (If not using Docker)

Container Setup

Building the Docker Image

Running the Container

Accessing the Container

Artifact Evaluation

Quick Start: Running a Simple Fidelity Check

Running Full Record/Replay Pipeline

Running Fidelity Detection

Running Error Pinpointing

Expected Outputs

Example Output Structure

Configuration

Example Dataset

Full results (raw)

Directory Structure

Citation

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages