Skip to content

Handle rate limit headers more reliably #2027

@jcharaoui

Description

@jcharaoui

As one of the maintainers of https://support.torproject.org, I'm looking into options for incorporating dead link checking into our CI pipeline, and this project seems very promising so far.

I've hit a snag however related to crawling links specifically to gitlab.torproject.org, our self-hosted GitLab instance: it seems like when lychee hits that host it immediately goes into retry-backoff mode with a 1 minute delay between each request, which pretty much kills the performance of the whole process. If I exclude all gitlab.torproject.org links (and other links that redirect to that domain), it goes through something like 16 thousands links in a matter of seconds.

Given a bogus test.html file containing a small sample of such links:

<a href="https://gitlab.torproject.org/">foo</a>
<a href="https://gitlab.torproject.org/tpo/applications/vpn/-/issues/new">bar</a>
<a href="https://gitlab.torproject.org/tpo/applications/vpn-leak-test/-/blob/main/app/src/main/java/org/torproject/vpnleaktest/tests/DeviceIDsTest.kt#L235">baz</a>

Running lychee on this file takes several minutes:

$ podman run --init --rm -it -v $PWD:/app --workdir /app docker.io/lycheeverse/lychee:master test.html -vvv --max-concurrency 1
   [DEBUG] Following redirect to https://gitlab.torproject.org/tpo/team
    [WARN] Host gitlab.torproject.org sent an unexpectedly big rate limit backoff duration of 584542046years 1month 2days 15h 52m 15s 615ms. Capping the duration to 1m instead.
   [DEBUG] Host gitlab.torproject.org applying backoff delay of 60000ms due to previous rate limiting or errors
     [200] https://gitlab.torproject.org/ | Redirect: Followed 1 redirect resolving to the final status of: OK. Redirects: https://gitlab.torproject.org/ --[302]--> https://gitlab.torproject.org/tpo/team
   [DEBUG] Following redirect to https://gitlab.torproject.org/users/sign_in
    [WARN] Host gitlab.torproject.org sent an unexpectedly big rate limit backoff duration of 584542046years 1month 2days 15h 52m 15s 615ms. Capping the duration to 1m instead.
   [DEBUG] Host gitlab.torproject.org applying backoff delay of 60000ms due to previous rate limiting or errors
     [200] https://gitlab.torproject.org/tpo/applications/vpn/-/issues/new | Redirect: Followed 1 redirect resolving to the final status of: OK. Redirects: https://gitlab.torproject.org/tpo/applications/vpn/-/issues/new --[302]--> https://gitlab.torproject.org/users/sign_in
    [WARN] Host gitlab.torproject.org sent an unexpectedly big rate limit backoff duration of 584542046years 1month 2days 15h 52m 15s 615ms. Capping the duration to 1m instead.
     [200] https://gitlab.torproject.org/tpo/applications/vpn-leak-test/-/blob/main/app/src/main/java/org/torproject/vpnleaktest/tests/DeviceIDsTest.kt#L235
🔍 3 Total (in 2m 1s) ✅ 1 OK 🚫 0 Errors 🔀 2 Redirects

I'm not really sure what's happening here: where is this "584542046years 1month 2days 15h 52m 15s 615ms" duration coming from? From looking into the code a little but, it seems related to a Retry-After HTTP header, but as far as I can tell, no such header is actually sent from this host. It also mentions "previous rate limiting errors" but it's unclear where these errors were encountered, eg. is the 302 redirect being considered an error here?

I've tried running with RUST_LOG=trace hoping I could look at the actual HTTP headers sent and received by lynchee but I couldn't decipher those logs...

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions