Skip to content

Commit 6c2e1e4

Browse files
committed
Enhance documentation for language-specific tokenizers and per-property stopword overrides
1 parent 508bd44 commit 6c2e1e4

2 files changed

Lines changed: 11 additions & 8 deletions

File tree

docs/weaviate/concepts/indexing/inverted-index.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -207,11 +207,14 @@ Weaviate provides several tokenization methods optimized for different data type
207207
- **`field`**: No splitting - entire value is one token. For exact matching.
208208

209209
**Language-specific methods** (for languages without word boundaries):
210-
- **`gse`**: Chinese text segmentation using Jieba algorithm
210+
- **`gse`**: Japanese text segmentation using the [`gse`](https://pkg.go.dev/github.com/go-ego/gse) tokenizer (Japanese dictionary)
211+
- **`gse_ch`**: Chinese text segmentation using the same `gse` tokenizer with a Chinese dictionary
211212
- **`trigram`**: Splits into character trigrams for CJK languages
212213
- **`kagome_ja`**: Japanese morphological analysis
213214
- **`kagome_kr`**: Korean morphological analysis
214215

216+
These language-specific tokenizers are not loaded by default. Enable them with the corresponding environment variables (`ENABLE_TOKENIZER_GSE`, `ENABLE_TOKENIZER_GSE_CH`, `ENABLE_TOKENIZER_KAGOME_JA`, `ENABLE_TOKENIZER_KAGOME_KR`).
217+
215218
See the [tokenization configuration reference](../../config-refs/collections.mdx#tokenization) for detailed specifications and behavior examples.
216219

217220
### Accent folding
@@ -309,7 +312,7 @@ Beyond the built-in `en` and `none` presets, you can declare custom stopword pre
309312

310313
#### Per-property stopword overrides
311314

312-
Each text property can override the collection-level stopword behavior via `textAnalyzer.stopwordPreset`. This is useful for multilingual collections where different properties contain text in different languages.
315+
Each text property can override the collection-level stopword behavior via `textAnalyzer.stopwordPreset`. This is useful for multilingual collections where different properties contain text in different languages. The override is only supported on properties with `tokenization: "word"` — schema validation rejects it on other tokenizers.
313316

314317
```json
315318
"properties": [

docs/weaviate/config-refs/indexing/inverted-index.mdx

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -207,13 +207,13 @@ The existing [`stopwords`](#stopwords) configuration remains as the default for
207207

208208
<TokenizerPreview />
209209

210-
Part of a **property definition** (not `invertedIndexConfig`). Configures text analysis behavior for individual `text` properties, including accent folding and per-property stopword overrides.
210+
Part of a **property definition** (not `invertedIndexConfig`). Configures text analysis behavior for individual `text` properties, including accent folding and per-property stopword overrides. Supported on properties with tokenization `word`, `lowercase`, `whitespace`, `field`, or `trigram` — not on the language-specific tokenizers (`gse`, `gse_ch`, `kagome_ja`, `kagome_kr`).
211211

212-
| Parameter | Type | Default | Details |
213-
| :---------------- | :--------- | :------ | :---------------------------------------------------------------------------------------------------------------------- |
214-
| `asciiFold` | `boolean` | `false` | Normalizes accented Latin characters to ASCII equivalents during indexing and querying. Uses Unicode NFD decomposition. |
215-
| `asciiFoldIgnore` | `string[]` | `[]` | Characters exempt from ASCII folding. **Immutable** after property creation. |
216-
| `stopwordPreset` | `string` | (none) | Name of a collection-level stopword preset to use for this property, overriding the default `stopwords` config. |
212+
| Parameter | Type | Default | Details |
213+
| :---------------- | :--------- | :------ | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
214+
| `asciiFold` | `boolean` | `false` | Normalizes accented Latin characters to ASCII equivalents during indexing and querying. Uses Unicode NFD decomposition. **Immutable** after the property is created. |
215+
| `asciiFoldIgnore` | `string[]` | `[]` | Characters exempt from ASCII folding. Each entry must be a single character. Can be updated after property creation, but changes only apply to newly indexed data. |
216+
| `stopwordPreset` | `string` | (none) | Name of a built-in (`en`, `none`) or collection-level stopword preset to use for this property, overriding the default `stopwords` config. **Only supported on properties with `tokenization: "word"`** — schema validation rejects it on other tokenizers. |
217217

218218
<details>
219219
<summary>Example <code>textAnalyzer</code> configuration - JSON object</summary>

0 commit comments

Comments
 (0)