A Python SDK for building custom Glean indexing connectors. Provides base classes and utilities to create connectors that fetch data from external systems and upload to Glean's indexing APIs.
- Python >= 3.10
- A Glean instance and an indexing API token
pip install glean-indexing-sdkEvery connector has two parts:
- DataClient — fetches raw data from your external system (API, database, files)
- Connector — transforms that data into Glean's format and uploads it
The workflow is: fetch → transform → upload. You implement get_source_data() on your data client and transform() on your connector; the SDK handles batching and upload.
See Architecture overview for a data flow diagram and the full class hierarchy.
export GLEAN_SERVER_URL="https://your-company-be.glean.com"
export GLEAN_INDEXING_API_TOKEN="your-indexing-api-token"
# Deprecated alternative: use GLEAN_INSTANCE as legacy fallback
# export GLEAN_INSTANCE="acme"This complete example defines a data type, a data client, and a connector, then indexes everything into Glean:
from typing import List, Sequence, TypedDict
from glean.indexing.connectors import BaseConnectorDataClient, BaseDatasourceConnector
from glean.indexing.models import (
ContentDefinition,
CustomDatasourceConfig,
DocumentDefinition,
IndexingMode,
UserReferenceDefinition,
)
class WikiPageData(TypedDict):
id: str
title: str
content: str
author: str
created_at: str
updated_at: str
url: str
tags: List[str]
class WikiDataClient(BaseConnectorDataClient[WikiPageData]):
def __init__(self, wiki_base_url: str, api_token: str):
self.wiki_base_url = wiki_base_url
self.api_token = api_token
def get_source_data(self, since=None) -> Sequence[WikiPageData]:
# Example static data
return [
{
"id": "page_123",
"title": "Engineering Onboarding Guide",
"content": "Welcome to the engineering team...",
"author": "jane.smith@company.com",
"created_at": "2024-01-15T10:00:00Z",
"updated_at": "2024-02-01T14:30:00Z",
"url": f"{self.wiki_base_url}/pages/123",
"tags": ["onboarding", "engineering"],
},
{
"id": "page_124",
"title": "API Documentation Standards",
"content": "Our standards for API documentation...",
"author": "john.doe@company.com",
"created_at": "2024-01-20T09:15:00Z",
"updated_at": "2024-01-25T16:45:00Z",
"url": f"{self.wiki_base_url}/pages/124",
"tags": ["api", "documentation", "standards"],
},
]
class CompanyWikiConnector(BaseDatasourceConnector[WikiPageData]):
configuration: CustomDatasourceConfig = CustomDatasourceConfig(
name="company_wiki",
display_name="Company Wiki",
url_regex=r"https://wiki\.company\.com/.*",
trust_url_regex_for_view_activity=True,
is_user_referenced_by_email=True,
)
def transform(self, data: Sequence[WikiPageData]) -> List[DocumentDefinition]:
documents = []
for page in data:
documents.append(
DocumentDefinition(
id=page["id"],
title=page["title"],
datasource=self.name,
view_url=page["url"],
body=ContentDefinition(mime_type="text/plain", text_content=page["content"]),
author=UserReferenceDefinition(email=page["author"]),
created_at=self._parse_timestamp(page["created_at"]),
updated_at=self._parse_timestamp(page["updated_at"]),
tags=page["tags"],
)
)
return documents
def _parse_timestamp(self, timestamp_str: str) -> int:
from datetime import datetime
dt = datetime.fromisoformat(timestamp_str.replace("Z", "+00:00"))
return int(dt.timestamp())
data_client = WikiDataClient(wiki_base_url="https://wiki.company.com", api_token="your-wiki-token")
connector = CompanyWikiConnector(name="company_wiki", data_client=data_client)
connector.configure_datasource()
connector.index_data(mode=IndexingMode.FULL)| Connector | Data Client | Best For |
|---|---|---|
BaseDatasourceConnector |
BaseDataClient |
Small-to-medium datasets that fit in memory. Wikis, knowledge bases, file systems. |
BaseStreamingDatasourceConnector |
BaseStreamingDataClient |
Large or paginated datasets where you need to limit memory usage. Uses sync generators. |
BaseAsyncStreamingDatasourceConnector |
BaseAsyncStreamingDataClient |
Large datasets with async APIs (aiohttp, httpx async). Non-blocking I/O. |
BasePeopleConnector |
— | Employee and identity data indexing. |
For detailed guidance on choosing between these, see the decision matrix.
IndexingMode.FULL— Re-indexes all documents. Use for initial loads or when you need a complete refresh.IndexingMode.INCREMENTAL— Only indexes documents modified since the last crawl. Use for scheduled updates to minimize API calls.
connector.index_data(mode=IndexingMode.FULL) # full re-index
connector.index_data(mode=IndexingMode.INCREMENTAL) # only changes since last runThe SDK includes a ConnectorTestHarness that lets you validate your connector without making real API calls. It intercepts uploads and captures the documents your connector produces so you can assert on them.
from glean.indexing.connectors import ConnectorTestHarness
harness = ConnectorTestHarness(connector)
harness.run()
validator = harness.get_validator()
validator.assert_documents_posted(count=2)
# Inspect individual documents
for doc in validator.documents_posted:
print(doc.title)This project uses mise for toolchain management and uv for Python dependencies.
mise run setup # create venv and install dependencies
mise run test # run all tests
mise run lint # run all linters (ruff, pyright, markdown-code)
mise run lint:fix # auto-fix lint issues and format code- Architecture overview — data flow diagram and component hierarchy
- Streaming connectors — sync and async streaming walkthroughs
- Advanced usage — connector selection guide, forced restart uploads