Challenges extracting structured data from large Brazilian fund regulation PDFs (230K+ chars, 22 entity types)

Hi! I'm Reinaldo Chaves, a data journalist from Brazil.

I'm building an open-source tool to extract structured information from Brazilian investment fund regulation PDFs using LangExtract + Gemini. These are standardized documents (CVM Resolution 175/2022) with ~100 pages / 230K characters each, containing 22 entity types (fund name, CNPJ, administrator, manager, fees, duration, risk factors, liquidation events, legal forum, etc.).

**Project repo:** https://github.com/reichaves/langextract-fundos

The goal is to enable investigative journalists to systematically analyze thousands of fund regulations for transparency and accountability purposes.

## Problems encountered

### 1. JSON truncation with many entity types (`ResolverParsingError: Unterminated string`)

Related: #127, #287

When extracting 22 entity types in a single `lx.extract()` call, the model generates JSON responses exceeding ~22K characters. This causes the output to be truncated mid-JSON, resulting in:

```
langextract.resolver.ResolverParsingError: Failed to parse JSON content: 
Unterminated string starting at: line 1125 column 7 (char 22564)
```

**Workaround attempted:** I split the 22 entity types into 2-3 separate extraction groups (7-12 types each) and merge results. This reduces per-call JSON size but multiplies API calls (from ~16 to ~48).

**Suggestion:** Could LangExtract implement an `max_output_tokens` parameter or automatically split entity types into groups when the expected JSON output might exceed the model's output limit? A warning when many extraction classes are defined would also help.

### 2. No retry/backoff for rate limit errors (429 RESOURCE_EXHAUSTED)

Related: #240

With 48 API calls (3 groups × 16 chunks) on Gemini free tier (15 RPM), I consistently hit:

```
Parallel inference error: Gemini API error: 429 RESOURCE_EXHAUSTED
```

When this happens, LangExtract raises an exception and **all progress is lost** — including groups that already completed successfully.

**Suggestion:** 
- Built-in exponential backoff for 429/503 errors (currently missing for Gemini provider)
- Save partial results when a later group fails, so completed work isn't lost
- A configurable `max_rpm` parameter to self-throttle API calls

### 3. Performance challenges with large documents (230K chars)

Related: #178, #188

Processing a 230K-character document requires significant pre-filtering to stay within reasonable API call counts. Without filtering, a single document needs ~75 API calls (with 3 groups), which takes 25+ minutes and exhausts the free-tier quota.

I implemented a custom text filter that identifies relevant sections by clause headers ("7. TAXA DE ADMINISTRAÇÃO", "26. EVENTOS DE LIQUIDAÇÃO", etc.) and reduces the text from 230K to ~50K characters. However, this domain-specific filtering shouldn't be necessary — ideally LangExtract could handle long documents more efficiently.

**Suggestion:** Consider implementing a relevance-aware chunking strategy that uses the `prompt_description` to prioritize sections likely to contain target entities, rather than processing the entire document uniformly.

## Environment

- langextract 1.0.9
- Python 3.11
- Model: gemini-2.0-flash (via free-tier API key)
- OS: macOS
- Document language: Portuguese (Brazilian)

## Minimal reproduction

```python
import langextract as lx
import textwrap

# 22 entity types for a Brazilian fund regulation
prompt = textwrap.dedent("""
    Extract: nome_fundo, cnpj_fundo, tipo_fundo, administrador, 
    gestor, custodiante, auditor, taxa_administracao, taxa_gestao, 
    taxa_performance, taxa_custodia, prazo_duracao, regime_condominial, 
    publico_alvo, classe_cotas, ativo_alvo, limite_concentracao, 
    fator_risco, evento_avaliacao, evento_liquidacao, 
    aplicacao_minima, foro
""")

# With a 230K char document, this will:
# 1. Generate JSON too large → truncation → ResolverParsingError
# 2. Make ~75 API calls → 429 errors on free tier
# 3. Lose all progress when error occurs
result = lx.extract(
    text_or_documents=large_pdf_text,  # 230K chars
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.0-flash",
    max_char_buffer=3000,
)
```

## Summary

LangExtract is an excellent library and I'd love to use it more effectively for journalism applications. The main pain points for my use case (large regulatory documents + many entity types) are:

1. **JSON output truncation** when many entity types generate large responses
2. **No retry logic** for rate limit errors, losing completed work
3. **No way to prioritize relevant sections** in very large documents

I'm happy to contribute test cases with Brazilian Portuguese documents if that would be helpful.

Thank you for building this tool!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Challenges extracting structured data from large Brazilian fund regulation PDFs (230K+ chars, 22 entity types) #358

Problems encountered

1. JSON truncation with many entity types (`ResolverParsingError: Unterminated string`)

2. No retry/backoff for rate limit errors (429 RESOURCE_EXHAUSTED)

3. Performance challenges with large documents (230K chars)

Environment

Minimal reproduction

Summary

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Challenges extracting structured data from large Brazilian fund regulation PDFs (230K+ chars, 22 entity types) #358

Description

Problems encountered

1. JSON truncation with many entity types (ResolverParsingError: Unterminated string)

2. No retry/backoff for rate limit errors (429 RESOURCE_EXHAUSTED)

3. Performance challenges with large documents (230K chars)

Environment

Minimal reproduction

Summary

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. JSON truncation with many entity types (`ResolverParsingError: Unterminated string`)