Skip to content

Commit 27f74dc

Browse files
Adding Ruff Lint: adding missing docstrings, improving consistency, updating README
1 parent 7805690 commit 27f74dc

File tree

13 files changed

+437
-176
lines changed

13 files changed

+437
-176
lines changed

.github/workflows/pylint.yml

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
name: Pylint
2+
3+
on: [push]
4+
5+
jobs:
6+
build:
7+
runs-on: ubuntu-latest
8+
strategy:
9+
matrix:
10+
python-version: ["3.8", "3.9", "3.10"]
11+
steps:
12+
- uses: actions/checkout@v4
13+
- name: Set up Python ${{ matrix.python-version }}
14+
uses: actions/setup-python@v3
15+
with:
16+
python-version: ${{ matrix.python-version }}
17+
- name: Install dependencies
18+
run: |
19+
python -m pip install --upgrade pip
20+
pip install pylint
21+
- name: Analysing the code with pylint
22+
run: |
23+
pylint $(git ls-files '*.py')

README.md

Lines changed: 31 additions & 65 deletions
Original file line numberDiff line numberDiff line change
@@ -1,81 +1,47 @@
11
# Wikidata Textifier
22

3-
**Wikidata Textifier** is an API that transforms Wikidata items into compact format for use in LLMs and GenAI applications. It resolves missing labels of properties and claim values by querying the Wikidata Action API, making it efficient and suitable for AI pipelines.
3+
**Wikidata Textifier** is an API that transforms Wikidata entities into compact outputs for LLM and GenAI use cases.
4+
It resolves missing labels for properties and claim values using the Wikidata Action API and caches labels to reduce repeated lookups.
45

5-
🔗 Live API: [https://wd-textify.toolforge.org/](https://wd-textify.toolforge.org/)
6+
Live API: [wd-textify.wmcloud.org](https://wd-textify.wmcloud.org/)
7+
API Docs: [wd-textify.wmcloud.org/docs](https://wd-textify.wmcloud.org/docs)
68

7-
---
9+
## Features
810

9-
## Functionalities
11+
- Textify Wikidata entities as `json`, `text`, or `triplet`.
12+
- Resolve labels for linked entities and properties.
13+
- Cache labels in MariaDB for faster repeated requests.
14+
- Support multilingual output with fallback language support.
15+
- Avoid SPARQL and use Wikidata Action API / EntityData endpoints.
1016

11-
- **Textifies** any Wikidata item into a readable or JSON format suitable for LLMs.
12-
- **Resolves all labels**, including those missing when querying the Wikidata API.
13-
- **Caches labels** for 90 days to boost performance and reduce API load.
14-
- **Avoids SPARQL** and uses the Wikidata Action API for better efficiency and compatibility.
15-
- **Hosted on Toolforge**: [https://wd-textify.toolforge.org/](https://wd-textify.toolforge.org/)
17+
## Output Formats
1618

17-
---
19+
- `json`: Structured representation with claims (and optionally qualifiers/references).
20+
- `text`: Readable summary including label, description, aliases, and attributes.
21+
- `triplet`: Triplet-style lines with labels and IDs for graph-style traversal.
1822

19-
## Formats
20-
21-
- **Text**: A textual representation or summary of the Wikidata item, including its label, description, aliases, and claims. Useful for helping LLMs understand what the item represents.
22-
- **Triplet**: Outputs each triplet as a structured line, including labels and IDs, but omits descriptions and aliases. Ideal for agentic LLMs to traverse and explore Wikidata.
23-
- **JSON**: A structured and compact representation of the full item, suitable for custom formats.
24-
25-
---
26-
27-
## API Usage
23+
## API
2824

2925
### `GET /`
3026

31-
#### Query Parameters
32-
33-
| Name | Type | Required | Description |
34-
|----------------|---------|----------|-----------------------------------------------------------------------------|
35-
| `id` | string | Yes | Wikidata item ID (e.g., `Q42`) |
36-
| `lang` | string | No | Language code for labels (default: `en`) |
37-
| `format` | string | No | The format of the response, either 'json', 'text', or 'triplet' (default: `json`) |
38-
| `external_ids` | bool | No | Whether to include external IDs in the output (default: `true`) |
39-
| `all_ranks` | bool | No | If false, returns ranked preferred statements, falling back to normal when unavailable (default: `false`) |
40-
| `references` | bool | No | Whether to include references (default: `false`) |
41-
| `fallback_lang` | string | No | Fallback language code if the preferred language is not available (default: `en`) |
42-
43-
---
44-
45-
## Deploy to Toolforge
46-
47-
1. Shell into the Toolforge system:
48-
49-
```bash
50-
ssh [UNIX shell username]@login.toolforge.org
51-
```
52-
53-
2. Switch to tool user account:
54-
55-
```bash
56-
become wd-textify
57-
```
58-
59-
3. Build from Git:
60-
61-
```bash
62-
toolforge build start https://github.com/philippesaade-wmde/WikidataTextifier.git
63-
```
27+
#### Query parameters
6428

65-
4. Start the web service:
29+
| Name | Type | Required | Description |
30+
|---|---|---|---|
31+
| `id` | string | Yes | Comma-separated Wikidata IDs (for example: `Q42` or `Q42,Q2`). |
32+
| `pid` | string | No | Comma-separated property IDs to filter claims (for example: `P31,P279`). |
33+
| `lang` | string | No | Preferred language code (default: `en`). |
34+
| `fallback_lang` | string | No | Fallback language code (default: `en`). |
35+
| `format` | string | No | Output format: `json`, `text`, or `triplet` (default: `json`). |
36+
| `external_ids` | bool | No | Include `external-id` datatype claims (default: `true`). |
37+
| `all_ranks` | bool | No | Include all statement ranks instead of preferred/normal filtering (default: `false`). |
38+
| `qualifiers` | bool | No | Include qualifiers in claim values (default: `true`). |
39+
| `references` | bool | No | Include references in claim values (default: `false`). |
6640

67-
```bash
68-
webservice buildservice start --mount all
69-
```
70-
71-
5. Debugging the web service:
72-
73-
Read the logs:
74-
```bash
75-
webservice logs
76-
```
41+
#### Example requests
7742

78-
Open the service shell:
7943
```bash
80-
webservice shell
44+
curl "https://wd-textify.wmcloud.org/?id=Q42"
45+
curl "https://wd-textify.wmcloud.org/?id=Q42&format=text&lang=en"
46+
curl "https://wd-textify.wmcloud.org/?id=Q42,Q2&pid=P31,P279&format=triplet"
8147
```

main.py

Lines changed: 26 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,16 @@
1-
from fastapi import FastAPI, HTTPException, Query, Request
2-
from fastapi.middleware.cors import CORSMiddleware
3-
from fastapi import BackgroundTasks
1+
"""FastAPI application that exposes Wikidata textification endpoints."""
2+
3+
import os
4+
import time
45
import traceback
6+
57
import requests
6-
import time
7-
import os
8+
from fastapi import BackgroundTasks, FastAPI, HTTPException, Query, Request
9+
from fastapi.middleware.cors import CORSMiddleware
810

9-
from src.Normalizer import TTLNormalizer, JSONNormalizer
10-
from src.WikidataLabel import WikidataLabel, LazyLabelFactory
1111
from src import utils
12+
from src.Normalizer import JSONNormalizer, TTLNormalizer
13+
from src.WikidataLabel import LazyLabelFactory, WikidataLabel
1214

1315
# Start Fastapi app
1416
app = FastAPI(
@@ -34,6 +36,7 @@
3436

3537
@app.on_event("startup")
3638
async def startup():
39+
"""Initialize database resources required by the API."""
3740
WikidataLabel.initialize_database()
3841

3942
@app.get(
@@ -71,22 +74,26 @@ async def get_textified_wd(
7174
qualifiers: bool = True,
7275
fallback_lang: str = 'en'
7376
):
74-
"""
75-
Retrieve a Wikidata item with all labels or textual representations for an LLM.
77+
"""Return normalized Wikidata entities in JSON, text, or triplet format.
7678
7779
Args:
78-
id (str): The Wikidata item ID (e.g., "Q42").
79-
pid (str): Comma-separated list of property IDs to filter claims (e.g., "P31,P279").
80-
format (str): The format of the response, either 'json', 'text', or 'triplet'.
81-
lang (str): The language code for labels (default is 'en').
82-
external_ids (bool): If True, includes external IDs in the response.
83-
all_ranks (bool): If True, includes statements of all ranks (preferred, normal, deprecated).
84-
references (bool): If True, includes references in the response. (only available in JSON format)
85-
qualifiers (bool): If True, includes qualifiers in the response.
86-
fallback_lang (str): The fallback language code if the preferred language is not available.
80+
request (Request): Incoming request object (currently unused).
81+
background_tasks (BackgroundTasks): Background task queue for periodic cache cleanup.
82+
id (str): Comma-separated entity IDs (for example, ``"Q42,Q2"``).
83+
pid (str): Optional comma-separated property IDs used to filter claims.
84+
lang (str): Preferred language code for labels and formatted values.
85+
format (str): Output format: ``"json"``, ``"text"``, or ``"triplet"``.
86+
external_ids (bool): Whether to include claims with the ``external-id`` datatype.
87+
references (bool): Whether to include references in claim values.
88+
all_ranks (bool): Whether to include all statement ranks (preferred, normal, deprecated).
89+
qualifiers (bool): Whether to include qualifiers in claim values.
90+
fallback_lang (str): Fallback language when ``lang`` is unavailable.
8791
8892
Returns:
89-
list: A list of dictionaries containing QIDs and the similarity scores.
93+
dict[str, object | None]: Mapping of requested QIDs to their normalized payloads.
94+
95+
Raises:
96+
HTTPException: If an entity is not found, an upstream request fails, or internal processing fails.
9097
"""
9198
try:
9299
filter_pids = []

pyproject.toml

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,3 +13,28 @@ dependencies = [
1313
"sqlalchemy>=2.0.41",
1414
"uvicorn>=0.35.0",
1515
]
16+
17+
[dependency-groups]
18+
dev = [
19+
"ruff>=0.9.0"
20+
]
21+
22+
[tool.ruff]
23+
target-version = "py313"
24+
line-length = 120
25+
26+
[tool.ruff.lint]
27+
select = [
28+
"E", # pycodestyle errors
29+
"F", # Pyflakes (catches undefined names, unused imports, etc.)
30+
"I", # isort (import sorting)
31+
"D", # pydocstyle (function/class documentation)
32+
]
33+
34+
[tool.ruff.lint.pydocstyle]
35+
convention = "google"
36+
37+
[tool.ruff.lint.isort]
38+
known-first-party = [
39+
"wikidatasearch"
40+
]

src/Normalizer/JSONNormalizer.py

Lines changed: 27 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,11 @@
1+
"""Normalize Wikidata Action API JSON into internal textifier objects."""
2+
13
from __future__ import annotations
24

35
from typing import Any, Dict, List, Optional
6+
47
import requests
58

6-
from ..WikidataLabel import WikidataLabel, LazyLabelFactory
79
from ..Textifier.WikidataTextifier import (
810
WikidataClaim,
911
WikidataClaimValue,
@@ -14,11 +16,11 @@
1416
WikidataTime,
1517
)
1618
from ..utils import wikidata_geolocation_to_text, wikidata_time_to_text
19+
from ..WikidataLabel import LazyLabelFactory, WikidataLabel
1720

1821

1922
class JSONNormalizer:
20-
"""Build WikidataEntity + claims tree from Wikidata JSON (wbgetentities style).
21-
"""
23+
"""Normalize ``wbgetentities`` JSON into internal textifier objects."""
2224

2325
def __init__(
2426
self,
@@ -29,6 +31,16 @@ def __init__(
2931
label_factory: Optional[LazyLabelFactory] = None,
3032
debug: bool = False,
3133
):
34+
"""Initialize a normalizer for a single entity payload.
35+
36+
Args:
37+
entity_id (str): Entity ID being normalized.
38+
entity_json (dict[str, Any]): Raw ``wbgetentities`` JSON for ``entity_id``.
39+
lang (str): Preferred language for label selection.
40+
fallback_lang (str): Fallback language when ``lang`` is unavailable.
41+
label_factory (LazyLabelFactory | None): Shared lazy label factory for nested entities.
42+
debug (bool): Whether to print additional debug output while parsing.
43+
"""
3244
self.entity_id = entity_id
3345
self.entity_json = entity_json
3446

@@ -51,6 +63,18 @@ def normalize(
5163
qualifiers: bool = True,
5264
filter_pids: List[str] = [],
5365
) -> WikidataEntity:
66+
"""Normalize the entity JSON payload into a ``WikidataEntity`` tree.
67+
68+
Args:
69+
external_ids (bool): Whether to include ``external-id`` datatype claims.
70+
references (bool): Whether to include references for each statement value.
71+
all_ranks (bool): Whether to include statements of all ranks.
72+
qualifiers (bool): Whether to include qualifiers for statement values.
73+
filter_pids (list[str]): Optional allow-list of property IDs to keep.
74+
75+
Returns:
76+
WikidataEntity: Parsed entity object with claims and values.
77+
"""
5478
e = self.entity_json
5579
if not isinstance(e, dict) or "labels" not in e:
5680
if self.debug:

src/Normalizer/TTLNormalizer.py

Lines changed: 34 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,14 @@
1+
"""Normalize Wikidata TTL into internal textifier objects."""
2+
13
from __future__ import annotations
24

35
from collections import defaultdict
46
from typing import Any, DefaultDict, Dict, List, Optional, Set
5-
import requests
67

8+
import requests
79
from rdflib import Graph, Literal, Namespace, URIRef
810
from rdflib.namespace import RDF, RDFS
911

10-
from ..WikidataLabel import WikidataLabel, LazyLabelFactory
1112
from ..Textifier.WikidataTextifier import (
1213
WikidataClaim,
1314
WikidataClaimValue,
@@ -18,9 +19,9 @@
1819
WikidataTime,
1920
)
2021
from ..utils import wikidata_geolocation_to_text, wikidata_time_to_text
22+
from ..WikidataLabel import LazyLabelFactory, WikidataLabel
2123

22-
23-
# Namespaces used by Wikidata TTL exports
24+
# Namespaces used by Wikidata TTL
2425
WD = Namespace("http://www.wikidata.org/entity/")
2526
P = Namespace("http://www.wikidata.org/prop/")
2627
PS = Namespace("http://www.wikidata.org/prop/statement/")
@@ -39,18 +40,18 @@
3940

4041

4142
class TTLNormalizer:
42-
"""Parse a Wikidata Special:EntityData TTL and build a WikidataEntity with claims.
43+
"""Normalize ``Special:EntityData`` TTL into internal textifier objects.
4344
4445
Label resolution order:
45-
1) labels present in TTL
46-
2) LazyLabelFactory bulk lookup for the remainder
46+
1) Labels present in TTL.
47+
2) ``LazyLabelFactory`` bulk lookup for unresolved IDs.
4748
4849
Notes:
49-
- Claims are extracted from wd:<Q> p:<P> <statement-node> triples only.
50-
- Statement nodes are validated structurally before value extraction.
51-
- Special values (somevalue/novalue) are treated as "no main value" when
50+
- Claims are extracted from ``wd:<Q> p:<P> <statement-node>`` triples only.
51+
- Statement nodes are validated structurally before value extraction.
52+
- Special values (somevalue/novalue) are treated as "no main value" when
5253
neither ps:<pid> nor psv:<pid> is present on the statement node.
53-
- Property datatype is read from wikibase:propertyType when available,
54+
- Property datatype is read from ``wikibase:propertyType`` when available,
5455
otherwise inferred from the statement's value nodes when possible.
5556
"""
5657

@@ -63,6 +64,16 @@ def __init__(
6364
label_factory: Optional[LazyLabelFactory] = None,
6465
debug: bool = False,
6566
):
67+
"""Initialize a normalizer for a single TTL document.
68+
69+
Args:
70+
entity_id (str): Entity ID being normalized.
71+
ttl_text (str): Raw TTL document from ``Special:EntityData``.
72+
lang (str): Preferred language for label selection.
73+
fallback_lang (str): Fallback language when ``lang`` is unavailable.
74+
label_factory (LazyLabelFactory | None): Shared lazy label factory for nested entities.
75+
debug (bool): Whether to print additional debug output while parsing.
76+
"""
6677
self.entity_id = entity_id
6778
self.g = Graph()
6879
self.g.parse(data=ttl_text, format="turtle")
@@ -85,6 +96,18 @@ def normalize(
8596
qualifiers: bool = True,
8697
filter_pids: List[str] = []
8798
) -> WikidataEntity:
99+
"""Normalize the parsed graph into a ``WikidataEntity`` tree.
100+
101+
Args:
102+
external_ids (bool): Whether to include ``external-id`` datatype claims.
103+
references (bool): Whether to include references for each statement value.
104+
all_ranks (bool): Whether to include statements of all ranks.
105+
qualifiers (bool): Whether to include qualifiers for statement values.
106+
filter_pids (list[str]): Optional allow-list of property IDs to keep.
107+
108+
Returns:
109+
WikidataEntity: Parsed entity object with claims and values.
110+
"""
88111
# Preload labels found inside TTL so LazyLabelFactory can avoid lookups.
89112
self.label_factory._resolved_labels = self._build_label_cache_from_ttl()
90113

src/Normalizer/__init__.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,6 @@
1+
"""Public exports for normalizer classes."""
2+
3+
from .JSONNormalizer import JSONNormalizer
14
from .TTLNormalizer import TTLNormalizer
2-
from .JSONNormalizer import JSONNormalizer
5+
6+
__all__ = ["JSONNormalizer", "TTLNormalizer"]

0 commit comments

Comments
 (0)