Skip to content

Conversation

@dibahlfi
Copy link
Member

@dibahlfi dibahlfi commented Jan 14, 2026

When a partition split occurs in Azure Cosmos DB, the SDK was encountering a timeout issue caused by infinite recursion during the 410 (Gone) error retry logic resulting in query timeouts.
TimeoutError after asyncio.exceptions.CancelledError
-> routing_map_provider.py: get_overlapping_ranges
-> routing_map_provider.py: init_collection_routing_map_if_needed
-> base_execution_context.py: _fetch_items_helper_with_retries
-> (recursive loop)

user query fails because the partition it was targeting no longer exists (split into smaller partitions)
SDK tries to recover by refreshing its internal map of partitions
To refresh the map, SDK needs to query Cosmos DB for the new partition list
That query also fails (partition info is also affected by the split)
SDK tries to recover again -> triggers another refresh -> triggers another query -> infinite loop

Two changes were made to fix the issue:

  1. Added a special internal flag (_internal_pk_range_fetch) that marks partition-info queries as "internal". When this flag is set, the SDK skips the retry logic, preventing the infinite loop.
  2. After a partition split error, the SDK now properly resets its internal state before retrying, ensuring the retry actually fetches data instead of returning empty results.
    Additional Improvements:
    Comprehensive logging added throughout the partition split handling code to enable faster diagnosis of any future issues

@dibahlfi
Copy link
Member Author

/azp run python - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@dibahlfi
Copy link
Member Author

/azp run python - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@dibahlfi
Copy link
Member Author

/azp run python - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@dibahlfi
Copy link
Member Author

/azp run python - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@dibahlfi
Copy link
Member Author

/azp run python - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Member

@tvaron3 tvaron3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dibahlfi
Copy link
Member Author

/azp run python - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@dibahlfi
Copy link
Member Author

/azp run python - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@simorenoh simorenoh merged commit d764da3 into hotfix/azure-cosmos_4.14.4 Jan 15, 2026
21 checks passed
@simorenoh simorenoh deleted the users/dibahl/partition-recrsion-fix branch January 15, 2026 22:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants