When using remote machine/s for workers in a stand-alone setup, the RAPIDS jar file is downloaded to the $SPARK_HOME/work/app-xxx folder and when the client disconnects, the jar file is not deleted. For a machine with a multi GPU setup, this builds up to a lot of disk space wasted.
Steps/Code to reproduce bug
Run a query when a remote worker is configured and check the $SPARK_HOME/work dir.
See bottom of bug report for how it looks on disk.
I would have expected the temporary binaries to be cleaned upon disconnection.
`[eyal.h@kubegpu00139 work]$ ls -lR
.:
total 8
drwxrwxr-x. 5 eyal.h eyal.h 4096 May 24 18:50 app-20220524185015-0036
drwxrwxr-x. 5 eyal.h eyal.h 4096 May 24 18:53 app-20220524185313-0037
./app-20220524185015-0036:
total 12
drwxrwxr-x. 2 eyal.h eyal.h 4096 May 24 18:50 0
drwxrwxr-x. 2 eyal.h eyal.h 4096 May 24 18:50 1
drwxrwxr-x. 2 eyal.h eyal.h 4096 May 24 18:50 2
./app-20220524185015-0036/0:
total 370112
-rwxrwxr-x. 1 eyal.h eyal.h 378645171 May 24 18:50 rapids-4-spark_2.12-22.06.0-SNAPSHOT-cuda11.jar
-rw-rw-r--. 1 eyal.h eyal.h 82 May 24 18:50 resource-executor-16298842974096566969.json
-rw-rw-r--. 1 eyal.h eyal.h 338910 May 24 18:52 stderr
-rw-rw-r--. 1 eyal.h eyal.h 0 May 24 18:50 stdout
./app-20220524185015-0036/1:
total 370112
-rwxrwxr-x. 1 eyal.h eyal.h 378645171 May 24 18:50 rapids-4-spark_2.12-22.06.0-SNAPSHOT-cuda11.jar
-rw-rw-r--. 1 eyal.h eyal.h 82 May 24 18:50 resource-executor-15387637843995448273.json
-rw-rw-r--. 1 eyal.h eyal.h 338256 May 24 18:52 stderr
-rw-rw-r--. 1 eyal.h eyal.h 0 May 24 18:50 stdout
./app-20220524185015-0036/2:
total 370120
-rwxrwxr-x. 1 eyal.h eyal.h 378645171 May 24 18:50 rapids-4-spark_2.12-22.06.0-SNAPSHOT-cuda11.jar
-rw-rw-r--. 1 eyal.h eyal.h 82 May 24 18:50 resource-executor-13751777750447906128.json
-rw-rw-r--. 1 eyal.h eyal.h 345852 May 24 18:52 stderr
-rw-rw-r--. 1 eyal.h eyal.h 0 May 24 18:50 stdout
./app-20220524185313-0037:
total 12
drwxrwxr-x. 2 eyal.h eyal.h 4096 May 24 18:53 0
drwxrwxr-x. 2 eyal.h eyal.h 4096 May 24 18:53 1
drwxrwxr-x. 2 eyal.h eyal.h 4096 May 24 18:53 2
./app-20220524185313-0037/0:
total 370284
-rwxrwxr-x. 1 eyal.h eyal.h 378645171 May 24 18:53 rapids-4-spark_2.12-22.06.0-SNAPSHOT-cuda11.jar
-rw-rw-r--. 1 eyal.h eyal.h 82 May 24 18:53 resource-executor-3347874181209501324.json
-rw-rw-r--. 1 eyal.h eyal.h 512544 May 24 18:54 stderr
-rw-rw-r--. 1 eyal.h eyal.h 0 May 24 18:53 stdout
./app-20220524185313-0037/1:
total 370432
-rwxrwxr-x. 1 eyal.h eyal.h 378645171 May 24 18:53 rapids-4-spark_2.12-22.06.0-SNAPSHOT-cuda11.jar
-rw-rw-r--. 1 eyal.h eyal.h 82 May 24 18:53 resource-executor-13850073259340036214.json
-rw-rw-r--. 1 eyal.h eyal.h 665362 May 24 18:54 stderr
-rw-rw-r--. 1 eyal.h eyal.h 0 May 24 18:53 stdout
./app-20220524185313-0037/2:
total 370252
-rwxrwxr-x. 1 eyal.h eyal.h 378645171 May 24 18:53 rapids-4-spark_2.12-22.06.0-SNAPSHOT-cuda11.jar
-rw-rw-r--. 1 eyal.h eyal.h 82 May 24 18:53 resource-executor-1556116940110034438.json
-rw-rw-r--. 1 eyal.h eyal.h 479924 May 24 18:54 stderr
-rw-rw-r--. 1 eyal.h eyal.h 0 May 24 18:53 stdout
`
When using remote machine/s for workers in a stand-alone setup, the RAPIDS jar file is downloaded to the $SPARK_HOME/work/app-xxx folder and when the client disconnects, the jar file is not deleted. For a machine with a multi GPU setup, this builds up to a lot of disk space wasted.
Steps/Code to reproduce bug
Run a query when a remote worker is configured and check the $SPARK_HOME/work dir.
See bottom of bug report for how it looks on disk.
I would have expected the temporary binaries to be cleaned upon disconnection.
`[eyal.h@kubegpu00139 work]$ ls -lR
.:
total 8
drwxrwxr-x. 5 eyal.h eyal.h 4096 May 24 18:50 app-20220524185015-0036
drwxrwxr-x. 5 eyal.h eyal.h 4096 May 24 18:53 app-20220524185313-0037
./app-20220524185015-0036:
total 12
drwxrwxr-x. 2 eyal.h eyal.h 4096 May 24 18:50 0
drwxrwxr-x. 2 eyal.h eyal.h 4096 May 24 18:50 1
drwxrwxr-x. 2 eyal.h eyal.h 4096 May 24 18:50 2
./app-20220524185015-0036/0:
total 370112
-rwxrwxr-x. 1 eyal.h eyal.h 378645171 May 24 18:50 rapids-4-spark_2.12-22.06.0-SNAPSHOT-cuda11.jar
-rw-rw-r--. 1 eyal.h eyal.h 82 May 24 18:50 resource-executor-16298842974096566969.json
-rw-rw-r--. 1 eyal.h eyal.h 338910 May 24 18:52 stderr
-rw-rw-r--. 1 eyal.h eyal.h 0 May 24 18:50 stdout
./app-20220524185015-0036/1:
total 370112
-rwxrwxr-x. 1 eyal.h eyal.h 378645171 May 24 18:50 rapids-4-spark_2.12-22.06.0-SNAPSHOT-cuda11.jar
-rw-rw-r--. 1 eyal.h eyal.h 82 May 24 18:50 resource-executor-15387637843995448273.json
-rw-rw-r--. 1 eyal.h eyal.h 338256 May 24 18:52 stderr
-rw-rw-r--. 1 eyal.h eyal.h 0 May 24 18:50 stdout
./app-20220524185015-0036/2:
total 370120
-rwxrwxr-x. 1 eyal.h eyal.h 378645171 May 24 18:50 rapids-4-spark_2.12-22.06.0-SNAPSHOT-cuda11.jar
-rw-rw-r--. 1 eyal.h eyal.h 82 May 24 18:50 resource-executor-13751777750447906128.json
-rw-rw-r--. 1 eyal.h eyal.h 345852 May 24 18:52 stderr
-rw-rw-r--. 1 eyal.h eyal.h 0 May 24 18:50 stdout
./app-20220524185313-0037:
total 12
drwxrwxr-x. 2 eyal.h eyal.h 4096 May 24 18:53 0
drwxrwxr-x. 2 eyal.h eyal.h 4096 May 24 18:53 1
drwxrwxr-x. 2 eyal.h eyal.h 4096 May 24 18:53 2
./app-20220524185313-0037/0:
total 370284
-rwxrwxr-x. 1 eyal.h eyal.h 378645171 May 24 18:53 rapids-4-spark_2.12-22.06.0-SNAPSHOT-cuda11.jar
-rw-rw-r--. 1 eyal.h eyal.h 82 May 24 18:53 resource-executor-3347874181209501324.json
-rw-rw-r--. 1 eyal.h eyal.h 512544 May 24 18:54 stderr
-rw-rw-r--. 1 eyal.h eyal.h 0 May 24 18:53 stdout
./app-20220524185313-0037/1:
total 370432
-rwxrwxr-x. 1 eyal.h eyal.h 378645171 May 24 18:53 rapids-4-spark_2.12-22.06.0-SNAPSHOT-cuda11.jar
-rw-rw-r--. 1 eyal.h eyal.h 82 May 24 18:53 resource-executor-13850073259340036214.json
-rw-rw-r--. 1 eyal.h eyal.h 665362 May 24 18:54 stderr
-rw-rw-r--. 1 eyal.h eyal.h 0 May 24 18:53 stdout
./app-20220524185313-0037/2:
total 370252
-rwxrwxr-x. 1 eyal.h eyal.h 378645171 May 24 18:53 rapids-4-spark_2.12-22.06.0-SNAPSHOT-cuda11.jar
-rw-rw-r--. 1 eyal.h eyal.h 82 May 24 18:53 resource-executor-1556116940110034438.json
-rw-rw-r--. 1 eyal.h eyal.h 479924 May 24 18:54 stderr
-rw-rw-r--. 1 eyal.h eyal.h 0 May 24 18:53 stdout
`