-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
Hi! I'm Reinaldo Chaves, a data journalist from Brazil.
I'm building an open-source tool to extract structured information from Brazilian investment fund regulation PDFs using LangExtract + Gemini. These are standardized documents (CVM Resolution 175/2022) with ~100 pages / 230K characters each, containing 22 entity types (fund name, CNPJ, administrator, manager, fees, duration, risk factors, liquidation events, legal forum, etc.).
Project repo: https://github.com/reichaves/langextract-fundos
The goal is to enable investigative journalists to systematically analyze thousands of fund regulations for transparency and accountability purposes.
Problems encountered
1. JSON truncation with many entity types (ResolverParsingError: Unterminated string)
When extracting 22 entity types in a single lx.extract() call, the model generates JSON responses exceeding ~22K characters. This causes the output to be truncated mid-JSON, resulting in:
langextract.resolver.ResolverParsingError: Failed to parse JSON content:
Unterminated string starting at: line 1125 column 7 (char 22564)
Workaround attempted: I split the 22 entity types into 2-3 separate extraction groups (7-12 types each) and merge results. This reduces per-call JSON size but multiplies API calls (from ~16 to ~48).
Suggestion: Could LangExtract implement an max_output_tokens parameter or automatically split entity types into groups when the expected JSON output might exceed the model's output limit? A warning when many extraction classes are defined would also help.
2. No retry/backoff for rate limit errors (429 RESOURCE_EXHAUSTED)
Related: #240
With 48 API calls (3 groups × 16 chunks) on Gemini free tier (15 RPM), I consistently hit:
Parallel inference error: Gemini API error: 429 RESOURCE_EXHAUSTED
When this happens, LangExtract raises an exception and all progress is lost — including groups that already completed successfully.
Suggestion:
- Built-in exponential backoff for 429/503 errors (currently missing for Gemini provider)
- Save partial results when a later group fails, so completed work isn't lost
- A configurable
max_rpmparameter to self-throttle API calls
3. Performance challenges with large documents (230K chars)
Processing a 230K-character document requires significant pre-filtering to stay within reasonable API call counts. Without filtering, a single document needs ~75 API calls (with 3 groups), which takes 25+ minutes and exhausts the free-tier quota.
I implemented a custom text filter that identifies relevant sections by clause headers ("7. TAXA DE ADMINISTRAÇÃO", "26. EVENTOS DE LIQUIDAÇÃO", etc.) and reduces the text from 230K to ~50K characters. However, this domain-specific filtering shouldn't be necessary — ideally LangExtract could handle long documents more efficiently.
Suggestion: Consider implementing a relevance-aware chunking strategy that uses the prompt_description to prioritize sections likely to contain target entities, rather than processing the entire document uniformly.
Environment
- langextract 1.0.9
- Python 3.11
- Model: gemini-2.0-flash (via free-tier API key)
- OS: macOS
- Document language: Portuguese (Brazilian)
Minimal reproduction
import langextract as lx
import textwrap
# 22 entity types for a Brazilian fund regulation
prompt = textwrap.dedent("""
Extract: nome_fundo, cnpj_fundo, tipo_fundo, administrador,
gestor, custodiante, auditor, taxa_administracao, taxa_gestao,
taxa_performance, taxa_custodia, prazo_duracao, regime_condominial,
publico_alvo, classe_cotas, ativo_alvo, limite_concentracao,
fator_risco, evento_avaliacao, evento_liquidacao,
aplicacao_minima, foro
""")
# With a 230K char document, this will:
# 1. Generate JSON too large → truncation → ResolverParsingError
# 2. Make ~75 API calls → 429 errors on free tier
# 3. Lose all progress when error occurs
result = lx.extract(
text_or_documents=large_pdf_text, # 230K chars
prompt_description=prompt,
examples=examples,
model_id="gemini-2.0-flash",
max_char_buffer=3000,
)Summary
LangExtract is an excellent library and I'd love to use it more effectively for journalism applications. The main pain points for my use case (large regulatory documents + many entity types) are:
- JSON output truncation when many entity types generate large responses
- No retry logic for rate limit errors, losing completed work
- No way to prioritize relevant sections in very large documents
I'm happy to contribute test cases with Brazilian Portuguese documents if that would be helpful.
Thank you for building this tool!