DLP Inspect initial commit by YadnikiPawar · Pull Request #48 · GoogleCloudPlatform/dlp-dataflow-deidentification

YadnikiPawar · 2020-05-22T18:06:22Z

@santhh
Please review the code changes. I have pulled and then modified the code for better understanding of the changes.

santhh · 2020-05-22T20:02:59Z

src/main/java/com/google/swarm/tokenization/DLPS3ScannerPipeline.java

-        p.apply(
-            "File Read Transforrm",
-            FileReaderTransform.newBuilder().setSubscriber(options.getSubscriber()).build());
+    PCollection<KV<String, String>> nonInspectedContents =


Can we please return a tuple? Anywhere you have try/catch you should use multi output with error tag. I think there are quite a few places, you will need this change. Without multi output pipeline will fail to recover.

Okay sure. I will change this code block to return tuple and will take care about such try-catch blocks.

santhh · 2020-05-22T20:09:47Z

src/main/java/com/google/swarm/tokenization/common/AuditInspectDataTransform.java

                  Util.INSPECTED)
              .build();
-      LOG.info("Audit Row {}", aggrRow.toString());
+      LOG.info("FileTrackerTransform:MergePartialStatsRow: Audit Row {}", aggrRow.toString());


I think this is AuditInspectDataTransform

Yes the class name is AuditInspectDataTransform. But for logging I have maintained the usage of Transformation names that will be displayed on UI DAG. So here the name of this Transformation is FileTrackerTransform and the sub step is MergePartialStatsRow. Can take out MergePartialStatsRow if it doesn't seem to be right as a log message.

ok makes sense.

santhh · 2020-05-22T20:13:14Z

src/main/java/com/google/swarm/tokenization/common/DLPTransform.java

            String errorMessage =
                String.format(
-                    "Payload Size %s Exceeded Batch Size %s",
+                    "DLPTransform:DLPInspect: Payload Size %s Exceeded Batch Size %s",


Optional- Change the log.error to warn

Yes got that. Will run this through the team and decide accordingly.

src/main/java/com/google/swarm/tokenization/common/DLPTransform.java

santhh · 2020-05-22T20:20:14Z

src/main/java/com/google/swarm/tokenization/common/DLPTransform.java

+                  Row.withSchema(Util.bqAuditSchema)
+                          .addValues(fileName, Util.getTimeStamp(),0L, "EMPTY")
+                          .build());
+        }


For the processElement i think error is thrown and caught in the class after DLP is called. Can we have tuple like this? This should make sure internal error is not going to crash the pipeline.

dlp-dataflow-deidentification/src/main/java/com/google/swarm/tokenization/S3Import.java

Line 340 in 2c9bbd1

c.output(apiResponseFailedElements, e.toString());

Yes I got it. But just for clarification the try block is already handling the exception, right?
We are just adding the catch block to be sure.

Also, this class is already returning a tuple. I will add this additional TupleTag for the Errors.
Is there a need to log these in our main DLPS3ScannerPipeline.java file separately or can we leave it ?

Also should we handle this initialization error as follows : https://github.com/GoogleCloudPlatform/dlp-dataflow-deidentification/blob/2c9bbd178ccaaa4422f89ba49fd0e8b4a0e4f26b/src/main/java/com/google/swarm/tokenization/S3Import.java#L333

you can leave it. Ideally you will flatten all the errors and write back somewhere. processElement throws the error cand catch clock is to output without crashing the pipeline.

santhh · 2020-05-22T20:22:03Z

src/main/java/com/google/swarm/tokenization/common/FileReaderSplitDoFn.java

      }
    } catch (Exception e) {
-      c.output(Util.readRowFailure, KV.of(fileName, e.getMessage()));
+      LOG.error("File Read Transform:ReadFile: Error processing the file "+ fileName +" - " + Arrays.toString(e.getStackTrace()));


This is where the multi output should be added. First comment added. So this class should output a Tuple.

So should be keep the LOG as a warning here or just take out the log and keep output statement only?
Or log these in our main DLPS3ScannerPipeline.java file separately?

I think let's log all in the main class after flatten.

src/main/java/com/google/swarm/tokenization/common/FileReaderSplitDoFn.java

src/main/java/com/google/swarm/tokenization/common/FileReaderTransform.java

santhh · 2020-05-22T20:30:31Z

src/main/java/com/google/swarm/tokenization/common/FileReaderTransform.java

+                    LOG.info("File Read Transform:ConvertToGCSUri: Valid File Located: {}", file_name);
+                    c.output(file_name);
+                }
+        }


Feel like this logic can be simplified. (Optional)

santhh · 2020-05-22T20:31:53Z

src/main/java/com/google/swarm/tokenization/common/FileReaderTransform.java

+                            Instant file_ts = Instant.parse(file_ts_string);
+                            Instant tf_ts = new Instant(metadata.lastModifiedMillis());
+                            LOG.warn(file_ts.toString());
+                            LOG.warn(tf_ts.toString());


Should have one statement and a meaningful error message. This is not helpful. I feel you have too many logs here. Let's try to summarize and output what will be useful

Yes sure. Some parts of the code are added by Robert L. Will look into those parts and take out the unnecessary logs.

realjordanna · 2020-06-09T06:14:54Z

src/main/java/com/google/swarm/tokenization/common/Util.java

  public static TupleTag<Row> inspectData = new TupleTag<Row>() {};
  public static TupleTag<Row> auditData = new TupleTag<Row>() {};
  public static TupleTag<Row> errorData = new TupleTag<Row>() {};
+  public static TupleTag<KV<String, String>> readRowSuccess = new TupleTag<KV<String, String>>() {};


needs documentation explaining what the key/value pairs are whenever using such generic structs

santhh and others added 2 commits May 14, 2020 19:16

error handling

c885428

DLP Inspect initial commit

d35edec

googlebot added the cla: yes label May 22, 2020