Skip to content

Commit 9ccda2f

Browse files
Merge pull request #12 from gleanwork/docs/restructure-readme-for-onboarding
Restructure README for fast onboarding
2 parents 8304cec + 78206b6 commit 9ccda2f

13 files changed

Lines changed: 639 additions & 426 deletions

.markdown-coderc.json

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,5 @@
11
{
22
"snippetRoot": "./snippets",
3-
"markdownGlob": "README.md",
3+
"markdownGlob": "{README.md,docs/**/*.md}",
44
"includeExtensions": [".ts", ".js", ".py"]
55
}
6-

README.md

Lines changed: 57 additions & 330 deletions
Large diffs are not rendered by default.

docs/advanced.md

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
# Advanced Usage
2+
3+
## Choosing a Connector Type
4+
5+
| Connector | Data Client | Best For |
6+
|---|---|---|
7+
| `BaseDatasourceConnector` | `BaseDataClient` | Small-to-medium datasets that fit in memory |
8+
| `BaseStreamingDatasourceConnector` | `BaseStreamingDataClient` | Large datasets with sync/paginated APIs |
9+
| `BaseAsyncStreamingDatasourceConnector` | `BaseAsyncStreamingDataClient` | Large datasets with async APIs (aiohttp, httpx async) |
10+
11+
### BaseDatasourceConnector
12+
13+
**Use when:**
14+
15+
- All data fits comfortably in memory
16+
- Your API returns all data in one call (or a small number of calls)
17+
- You're indexing wikis, knowledge bases, documentation sites, or file systems with moderate content
18+
19+
**Avoid when:**
20+
21+
- The dataset is too large to fit in memory
22+
- Individual documents are very large (> 10MB each)
23+
- Memory usage is a concern
24+
25+
### BaseStreamingDatasourceConnector
26+
27+
**Use when:**
28+
29+
- Data is too large to load all at once
30+
- Your source API is paginated
31+
- You want to process data incrementally to limit memory usage
32+
- You're in a memory-constrained environment
33+
34+
**Avoid when:**
35+
36+
- Your dataset fits comfortably in memory (use `BaseDatasourceConnector` instead for simplicity)
37+
38+
### BaseAsyncStreamingDatasourceConnector
39+
40+
**Use when:**
41+
42+
- Your data source provides async APIs (e.g., `aiohttp`, `httpx` async client)
43+
- You want non-blocking I/O during data retrieval
44+
- You're already working in an async codebase
45+
- You need to make concurrent requests to your source system
46+
47+
**Avoid when:**
48+
49+
- Your source API only has synchronous clients (use `BaseStreamingDatasourceConnector` instead)
50+
- You don't need async I/O benefits
51+
52+
## Forced Restart Uploads
53+
54+
All connector types support forced restart uploads via `force_restart=True`:
55+
56+
```python
57+
connector.index_data(mode=IndexingMode.FULL, force_restart=True)
58+
```
59+
60+
Or for async connectors:
61+
62+
```python
63+
await connector.index_data_async(mode=IndexingMode.FULL, force_restart=True)
64+
```
65+
66+
### When to Use
67+
68+
- Aborting and restarting a failed or interrupted upload
69+
- Ensuring a clean upload state by discarding partial uploads
70+
- Recovering from upload errors or inconsistent states
71+
72+
### How It Works
73+
74+
1. Generates a new `upload_id` to ensure clean separation from previous uploads
75+
2. Sets `forceRestartUpload=True` on the **first batch only**
76+
3. Continues with normal batch processing for subsequent batches
77+
78+
This feature is available on `BaseDatasourceConnector`, `BaseStreamingDatasourceConnector`, `BaseAsyncStreamingDatasourceConnector`, and `BasePeopleConnector`.

docs/architecture.md

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
# Architecture Overview
2+
3+
The Glean Indexing SDK follows a simple, predictable pattern for all connector types. Understanding this flow will help you implement any connector quickly.
4+
5+
## Data Flow
6+
7+
```mermaid
8+
sequenceDiagram
9+
participant User
10+
participant Connector as "Connector<br/>(BaseDatasourceConnector<br/>or BasePeopleConnector)"
11+
participant DataClient as "DataClient<br/>(BaseDataClient<br/>or StreamingDataClient)"
12+
participant External as "External System<br/>(API/Database)"
13+
participant Glean as "Glean API"
14+
15+
User->>+Connector: 1. connector.index_data()<br/>or connector.index_people()
16+
Connector->>+DataClient: 2. get_source_data()
17+
DataClient->>+External: 3. Fetch data
18+
External-->>-DataClient: Raw source data
19+
DataClient-->>-Connector: Typed source data
20+
Connector->>Connector: 4. transform() or<br/>transform_people()
21+
Note over Connector: Transform to<br/>DocumentDefinition or<br/>EmployeeInfoDefinition
22+
Connector->>+Glean: 5. Batch upload documents<br/>or employee data
23+
Glean-->>-Connector: Upload response
24+
Connector-->>-User: Indexing complete
25+
```
26+
27+
## Key Components
28+
29+
1. **DataClient** — Fetches raw data from your external system (API, database, files, etc.)
30+
2. **Connector** — Transforms your data into Glean's format and handles the upload process
31+
32+
## Connector Hierarchy
33+
34+
```
35+
BaseConnector (abstract)
36+
├── BaseDatasourceConnector[T] — documents that fit in memory
37+
│ ├── BaseStreamingDatasourceConnector[T] — large/paginated datasets (sync generator)
38+
│ └── BaseAsyncStreamingDatasourceConnector[T] — large datasets with async I/O
39+
└── BasePeopleConnector — employee/identity indexing
40+
```
41+
42+
## Data Client Hierarchy
43+
44+
```
45+
BaseDataClient[T] — fetches all data at once, returns Sequence[T]
46+
BaseStreamingDataClient[T] — yields data incrementally via Generator[T]
47+
BaseAsyncStreamingDataClient[T] — yields data incrementally via AsyncGenerator[T]
48+
```
49+
50+
## Implementation Pattern
51+
52+
Every connector follows the same four steps:
53+
54+
1. **Define your data type** — a `TypedDict` describing your source data
55+
2. **Create a data client** — extends the appropriate `BaseDataClient` variant to fetch from your source
56+
3. **Create a connector** — extends the appropriate `BaseDatasourceConnector` variant, sets `configuration`, and implements `transform()`
57+
4. **Run it** — call `index_data()` (or `index_data_async()` for async connectors)
58+
59+
See the [Quickstart](../README.md) for a complete working example.

0 commit comments

Comments
 (0)