|
| 1 | +# Advanced Usage |
| 2 | + |
| 3 | +## Choosing a Connector Type |
| 4 | + |
| 5 | +| Connector | Data Client | Best For | |
| 6 | +|---|---|---| |
| 7 | +| `BaseDatasourceConnector` | `BaseDataClient` | Small-to-medium datasets that fit in memory | |
| 8 | +| `BaseStreamingDatasourceConnector` | `BaseStreamingDataClient` | Large datasets with sync/paginated APIs | |
| 9 | +| `BaseAsyncStreamingDatasourceConnector` | `BaseAsyncStreamingDataClient` | Large datasets with async APIs (aiohttp, httpx async) | |
| 10 | + |
| 11 | +### BaseDatasourceConnector |
| 12 | + |
| 13 | +**Use when:** |
| 14 | + |
| 15 | +- All data fits comfortably in memory |
| 16 | +- Your API returns all data in one call (or a small number of calls) |
| 17 | +- You're indexing wikis, knowledge bases, documentation sites, or file systems with moderate content |
| 18 | + |
| 19 | +**Avoid when:** |
| 20 | + |
| 21 | +- The dataset is too large to fit in memory |
| 22 | +- Individual documents are very large (> 10MB each) |
| 23 | +- Memory usage is a concern |
| 24 | + |
| 25 | +### BaseStreamingDatasourceConnector |
| 26 | + |
| 27 | +**Use when:** |
| 28 | + |
| 29 | +- Data is too large to load all at once |
| 30 | +- Your source API is paginated |
| 31 | +- You want to process data incrementally to limit memory usage |
| 32 | +- You're in a memory-constrained environment |
| 33 | + |
| 34 | +**Avoid when:** |
| 35 | + |
| 36 | +- Your dataset fits comfortably in memory (use `BaseDatasourceConnector` instead for simplicity) |
| 37 | + |
| 38 | +### BaseAsyncStreamingDatasourceConnector |
| 39 | + |
| 40 | +**Use when:** |
| 41 | + |
| 42 | +- Your data source provides async APIs (e.g., `aiohttp`, `httpx` async client) |
| 43 | +- You want non-blocking I/O during data retrieval |
| 44 | +- You're already working in an async codebase |
| 45 | +- You need to make concurrent requests to your source system |
| 46 | + |
| 47 | +**Avoid when:** |
| 48 | + |
| 49 | +- Your source API only has synchronous clients (use `BaseStreamingDatasourceConnector` instead) |
| 50 | +- You don't need async I/O benefits |
| 51 | + |
| 52 | +## Forced Restart Uploads |
| 53 | + |
| 54 | +All connector types support forced restart uploads via `force_restart=True`: |
| 55 | + |
| 56 | +```python |
| 57 | +connector.index_data(mode=IndexingMode.FULL, force_restart=True) |
| 58 | +``` |
| 59 | + |
| 60 | +Or for async connectors: |
| 61 | + |
| 62 | +```python |
| 63 | +await connector.index_data_async(mode=IndexingMode.FULL, force_restart=True) |
| 64 | +``` |
| 65 | + |
| 66 | +### When to Use |
| 67 | + |
| 68 | +- Aborting and restarting a failed or interrupted upload |
| 69 | +- Ensuring a clean upload state by discarding partial uploads |
| 70 | +- Recovering from upload errors or inconsistent states |
| 71 | + |
| 72 | +### How It Works |
| 73 | + |
| 74 | +1. Generates a new `upload_id` to ensure clean separation from previous uploads |
| 75 | +2. Sets `forceRestartUpload=True` on the **first batch only** |
| 76 | +3. Continues with normal batch processing for subsequent batches |
| 77 | + |
| 78 | +This feature is available on `BaseDatasourceConnector`, `BaseStreamingDatasourceConnector`, `BaseAsyncStreamingDatasourceConnector`, and `BasePeopleConnector`. |
0 commit comments