Do sanitization / anonymisation as a Spark job

instead of the Python script - this would scale better on large datasets and process straight from S3.

And being with the Java code, this class would know all the Columns that are needed by SPRUCE. It could therefore filter by keeping only what is needed (+ some user defined columns), instead of having a long list of things to remove.

By knowing all the columns that are needed per Provider, no additional work would be required for adding support to new providers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Do sanitization / anonymisation as a Spark job #232

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Do sanitization / anonymisation as a Spark job #232

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions