Skip to content

Add new handled content_types for processable_file validator (CSV, XML, JSON, txt, md, epub, open office, zip, ...) (POTENTIALLY BREAKING) #405

@Mth0158

Description

@Mth0158

Expected behaviour

As we make the change mentioned in #404, we would like to add new supported content_types for deep analysis. Right now we handle:

  • image (vips/minimagick)
  • video (ffmpeg)
  • audio (ffmpeg)
  • pdf (poppler)

The gem code is clearly designed to handle new content_types, we could in theory add without too much work the following content_types:

  • csv (ruby csv with binary detection, encoding check, parsing 1 or 2 lines) => text/csv + application/csv + application/vnd.ms-excel + text/tab-separated-values
  • json (ruby json)
  • ndjson (line by line ruby json, with a sanity cap)
  • xml (nokogiri XML)
  • yaml (ruby yaml with safe_load)
  • txt (internal check with binary detection)
  • xlsx (ruby zip (>= 3.1) should reject zip bombs + nokogiri XML)
  • docx (ruby zip (>= 3.1) should reject zip bombs + nokogiri XML)
  • zip (ruby zip (>= 3.1) should reject zip bombs)
  • gzip (ruby zlib should reject zip bombs)
  • epub (ruby zip (>= 3.1) should reject zip bombs)
  • html, xhtml (nokogiri HTML)
  • svg (nokogiri XML with extra checks)
  • markdown (internal check with binary detection)

To not forget

  • Update readme accordingly

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestsecuritySecurity related issues

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions