instead of the Python script - this would scale better on large datasets and process straight from S3.
And being with the Java code, this class would know all the Columns that are needed by SPRUCE. It could therefore filter by keeping only what is needed (+ some user defined columns), instead of having a long list of things to remove.
By knowing all the columns that are needed per Provider, no additional work would be required for adding support to new providers.
instead of the Python script - this would scale better on large datasets and process straight from S3.
And being with the Java code, this class would know all the Columns that are needed by SPRUCE. It could therefore filter by keeping only what is needed (+ some user defined columns), instead of having a long list of things to remove.
By knowing all the columns that are needed per Provider, no additional work would be required for adding support to new providers.