Skip to content

Challenges extracting structured data from large Brazilian fund regulation PDFs (230K+ chars, 22 entity types) #358

@reichaves

Description

@reichaves

Hi! I'm Reinaldo Chaves, a data journalist from Brazil.

I'm building an open-source tool to extract structured information from Brazilian investment fund regulation PDFs using LangExtract + Gemini. These are standardized documents (CVM Resolution 175/2022) with ~100 pages / 230K characters each, containing 22 entity types (fund name, CNPJ, administrator, manager, fees, duration, risk factors, liquidation events, legal forum, etc.).

Project repo: https://github.com/reichaves/langextract-fundos

The goal is to enable investigative journalists to systematically analyze thousands of fund regulations for transparency and accountability purposes.

Problems encountered

1. JSON truncation with many entity types (ResolverParsingError: Unterminated string)

Related: #127, #287

When extracting 22 entity types in a single lx.extract() call, the model generates JSON responses exceeding ~22K characters. This causes the output to be truncated mid-JSON, resulting in:

langextract.resolver.ResolverParsingError: Failed to parse JSON content: 
Unterminated string starting at: line 1125 column 7 (char 22564)

Workaround attempted: I split the 22 entity types into 2-3 separate extraction groups (7-12 types each) and merge results. This reduces per-call JSON size but multiplies API calls (from ~16 to ~48).

Suggestion: Could LangExtract implement an max_output_tokens parameter or automatically split entity types into groups when the expected JSON output might exceed the model's output limit? A warning when many extraction classes are defined would also help.

2. No retry/backoff for rate limit errors (429 RESOURCE_EXHAUSTED)

Related: #240

With 48 API calls (3 groups × 16 chunks) on Gemini free tier (15 RPM), I consistently hit:

Parallel inference error: Gemini API error: 429 RESOURCE_EXHAUSTED

When this happens, LangExtract raises an exception and all progress is lost — including groups that already completed successfully.

Suggestion:

  • Built-in exponential backoff for 429/503 errors (currently missing for Gemini provider)
  • Save partial results when a later group fails, so completed work isn't lost
  • A configurable max_rpm parameter to self-throttle API calls

3. Performance challenges with large documents (230K chars)

Related: #178, #188

Processing a 230K-character document requires significant pre-filtering to stay within reasonable API call counts. Without filtering, a single document needs ~75 API calls (with 3 groups), which takes 25+ minutes and exhausts the free-tier quota.

I implemented a custom text filter that identifies relevant sections by clause headers ("7. TAXA DE ADMINISTRAÇÃO", "26. EVENTOS DE LIQUIDAÇÃO", etc.) and reduces the text from 230K to ~50K characters. However, this domain-specific filtering shouldn't be necessary — ideally LangExtract could handle long documents more efficiently.

Suggestion: Consider implementing a relevance-aware chunking strategy that uses the prompt_description to prioritize sections likely to contain target entities, rather than processing the entire document uniformly.

Environment

  • langextract 1.0.9
  • Python 3.11
  • Model: gemini-2.0-flash (via free-tier API key)
  • OS: macOS
  • Document language: Portuguese (Brazilian)

Minimal reproduction

import langextract as lx
import textwrap

# 22 entity types for a Brazilian fund regulation
prompt = textwrap.dedent("""
    Extract: nome_fundo, cnpj_fundo, tipo_fundo, administrador, 
    gestor, custodiante, auditor, taxa_administracao, taxa_gestao, 
    taxa_performance, taxa_custodia, prazo_duracao, regime_condominial, 
    publico_alvo, classe_cotas, ativo_alvo, limite_concentracao, 
    fator_risco, evento_avaliacao, evento_liquidacao, 
    aplicacao_minima, foro
""")

# With a 230K char document, this will:
# 1. Generate JSON too large → truncation → ResolverParsingError
# 2. Make ~75 API calls → 429 errors on free tier
# 3. Lose all progress when error occurs
result = lx.extract(
    text_or_documents=large_pdf_text,  # 230K chars
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.0-flash",
    max_char_buffer=3000,
)

Summary

LangExtract is an excellent library and I'd love to use it more effectively for journalism applications. The main pain points for my use case (large regulatory documents + many entity types) are:

  1. JSON output truncation when many entity types generate large responses
  2. No retry logic for rate limit errors, losing completed work
  3. No way to prioritize relevant sections in very large documents

I'm happy to contribute test cases with Brazilian Portuguese documents if that would be helpful.

Thank you for building this tool!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions