Skip to content

Do sanitization / anonymisation as a Spark job #232

@jnioche

Description

@jnioche

instead of the Python script - this would scale better on large datasets and process straight from S3.

And being with the Java code, this class would know all the Columns that are needed by SPRUCE. It could therefore filter by keeping only what is needed (+ some user defined columns), instead of having a long list of things to remove.

By knowing all the columns that are needed per Provider, no additional work would be required for adding support to new providers.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions