Enhance documentation for language-specific tokenizers and per-property stopword overrides

amourao · amourao · commit 6c2e1e4573af · 2026-04-23T15:49:13.000+01:00
diff --git a/docs/weaviate/concepts/indexing/inverted-index.md b/docs/weaviate/concepts/indexing/inverted-index.md
@@ -207,11 +207,14 @@ Weaviate provides several tokenization methods optimized for different data type
 - **`field`**: No splitting - entire value is one token. For exact matching.
 
 **Language-specific methods** (for languages without word boundaries):
-- **`gse`**: Chinese text segmentation using Jieba algorithm
+- **`gse`**: Japanese text segmentation using the [`gse`](https://pkg.go.dev/github.com/go-ego/gse) tokenizer (Japanese dictionary)
+- **`gse_ch`**: Chinese text segmentation using the same `gse` tokenizer with a Chinese dictionary
 - **`trigram`**: Splits into character trigrams for CJK languages
 - **`kagome_ja`**: Japanese morphological analysis
 - **`kagome_kr`**: Korean morphological analysis
 
+These language-specific tokenizers are not loaded by default. Enable them with the corresponding environment variables (`ENABLE_TOKENIZER_GSE`, `ENABLE_TOKENIZER_GSE_CH`, `ENABLE_TOKENIZER_KAGOME_JA`, `ENABLE_TOKENIZER_KAGOME_KR`).
+
 See the [tokenization configuration reference](../../config-refs/collections.mdx#tokenization) for detailed specifications and behavior examples.
 
 ### Accent folding
@@ -309,7 +312,7 @@ Beyond the built-in `en` and `none` presets, you can declare custom stopword pre
 
 #### Per-property stopword overrides
 
-Each text property can override the collection-level stopword behavior via `textAnalyzer.stopwordPreset`. This is useful for multilingual collections where different properties contain text in different languages.
+Each text property can override the collection-level stopword behavior via `textAnalyzer.stopwordPreset`. This is useful for multilingual collections where different properties contain text in different languages. The override is only supported on properties with `tokenization: "word"` — schema validation rejects it on other tokenizers.
 
 ```json
 "properties": [
diff --git a/docs/weaviate/config-refs/indexing/inverted-index.mdx b/docs/weaviate/config-refs/indexing/inverted-index.mdx
@@ -207,13 +207,13 @@ The existing [`stopwords`](#stopwords) configuration remains as the default for
 
 <TokenizerPreview />
 
-Part of a **property definition** (not `invertedIndexConfig`). Configures text analysis behavior for individual `text` properties, including accent folding and per-property stopword overrides.
+Part of a **property definition** (not `invertedIndexConfig`). Configures text analysis behavior for individual `text` properties, including accent folding and per-property stopword overrides. Supported on properties with tokenization `word`, `lowercase`, `whitespace`, `field`, or `trigram` — not on the language-specific tokenizers (`gse`, `gse_ch`, `kagome_ja`, `kagome_kr`).
 
-| Parameter         | Type       | Default | Details                                                                                                                 |
-| :---------------- | :--------- | :------ | :---------------------------------------------------------------------------------------------------------------------- |
-| `asciiFold`       | `boolean`  | `false` | Normalizes accented Latin characters to ASCII equivalents during indexing and querying. Uses Unicode NFD decomposition. |
-| `asciiFoldIgnore` | `string[]` | `[]`    | Characters exempt from ASCII folding. **Immutable** after property creation.                                            |
-| `stopwordPreset`  | `string`   | (none)  | Name of a collection-level stopword preset to use for this property, overriding the default `stopwords` config.         |
+| Parameter         | Type       | Default | Details                                                                                                                                                                                                                                  |
+| :---------------- | :--------- | :------ | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `asciiFold`       | `boolean`  | `false` | Normalizes accented Latin characters to ASCII equivalents during indexing and querying. Uses Unicode NFD decomposition. **Immutable** after the property is created.                                                                     |
+| `asciiFoldIgnore` | `string[]` | `[]`    | Characters exempt from ASCII folding. Each entry must be a single character. Can be updated after property creation, but changes only apply to newly indexed data.                                                                       |
+| `stopwordPreset`  | `string`   | (none)  | Name of a built-in (`en`, `none`) or collection-level stopword preset to use for this property, overriding the default `stopwords` config. **Only supported on properties with `tokenization: "word"`** — schema validation rejects it on other tokenizers. |
 
 <details>
   <summary>Example <code>textAnalyzer</code> configuration - JSON object</summary>