Skip to content

[BUG] RAPIDS binaries not cleaned up on remote workers #5616

@eyalhir74

Description

@eyalhir74

When using remote machine/s for workers in a stand-alone setup, the RAPIDS jar file is downloaded to the $SPARK_HOME/work/app-xxx folder and when the client disconnects, the jar file is not deleted. For a machine with a multi GPU setup, this builds up to a lot of disk space wasted.

Steps/Code to reproduce bug
Run a query when a remote worker is configured and check the $SPARK_HOME/work dir.
See bottom of bug report for how it looks on disk.

I would have expected the temporary binaries to be cleaned upon disconnection.

`[eyal.h@kubegpu00139 work]$ ls -lR
.:
total 8
drwxrwxr-x. 5 eyal.h eyal.h 4096 May 24 18:50 app-20220524185015-0036
drwxrwxr-x. 5 eyal.h eyal.h 4096 May 24 18:53 app-20220524185313-0037

./app-20220524185015-0036:
total 12
drwxrwxr-x. 2 eyal.h eyal.h 4096 May 24 18:50 0
drwxrwxr-x. 2 eyal.h eyal.h 4096 May 24 18:50 1
drwxrwxr-x. 2 eyal.h eyal.h 4096 May 24 18:50 2

./app-20220524185015-0036/0:
total 370112
-rwxrwxr-x. 1 eyal.h eyal.h 378645171 May 24 18:50 rapids-4-spark_2.12-22.06.0-SNAPSHOT-cuda11.jar
-rw-rw-r--. 1 eyal.h eyal.h 82 May 24 18:50 resource-executor-16298842974096566969.json
-rw-rw-r--. 1 eyal.h eyal.h 338910 May 24 18:52 stderr
-rw-rw-r--. 1 eyal.h eyal.h 0 May 24 18:50 stdout

./app-20220524185015-0036/1:
total 370112
-rwxrwxr-x. 1 eyal.h eyal.h 378645171 May 24 18:50 rapids-4-spark_2.12-22.06.0-SNAPSHOT-cuda11.jar
-rw-rw-r--. 1 eyal.h eyal.h 82 May 24 18:50 resource-executor-15387637843995448273.json
-rw-rw-r--. 1 eyal.h eyal.h 338256 May 24 18:52 stderr
-rw-rw-r--. 1 eyal.h eyal.h 0 May 24 18:50 stdout

./app-20220524185015-0036/2:
total 370120
-rwxrwxr-x. 1 eyal.h eyal.h 378645171 May 24 18:50 rapids-4-spark_2.12-22.06.0-SNAPSHOT-cuda11.jar
-rw-rw-r--. 1 eyal.h eyal.h 82 May 24 18:50 resource-executor-13751777750447906128.json
-rw-rw-r--. 1 eyal.h eyal.h 345852 May 24 18:52 stderr
-rw-rw-r--. 1 eyal.h eyal.h 0 May 24 18:50 stdout

./app-20220524185313-0037:
total 12
drwxrwxr-x. 2 eyal.h eyal.h 4096 May 24 18:53 0
drwxrwxr-x. 2 eyal.h eyal.h 4096 May 24 18:53 1
drwxrwxr-x. 2 eyal.h eyal.h 4096 May 24 18:53 2

./app-20220524185313-0037/0:
total 370284
-rwxrwxr-x. 1 eyal.h eyal.h 378645171 May 24 18:53 rapids-4-spark_2.12-22.06.0-SNAPSHOT-cuda11.jar
-rw-rw-r--. 1 eyal.h eyal.h 82 May 24 18:53 resource-executor-3347874181209501324.json
-rw-rw-r--. 1 eyal.h eyal.h 512544 May 24 18:54 stderr
-rw-rw-r--. 1 eyal.h eyal.h 0 May 24 18:53 stdout

./app-20220524185313-0037/1:
total 370432
-rwxrwxr-x. 1 eyal.h eyal.h 378645171 May 24 18:53 rapids-4-spark_2.12-22.06.0-SNAPSHOT-cuda11.jar
-rw-rw-r--. 1 eyal.h eyal.h 82 May 24 18:53 resource-executor-13850073259340036214.json
-rw-rw-r--. 1 eyal.h eyal.h 665362 May 24 18:54 stderr
-rw-rw-r--. 1 eyal.h eyal.h 0 May 24 18:53 stdout

./app-20220524185313-0037/2:
total 370252
-rwxrwxr-x. 1 eyal.h eyal.h 378645171 May 24 18:53 rapids-4-spark_2.12-22.06.0-SNAPSHOT-cuda11.jar
-rw-rw-r--. 1 eyal.h eyal.h 82 May 24 18:53 resource-executor-1556116940110034438.json
-rw-rw-r--. 1 eyal.h eyal.h 479924 May 24 18:54 stderr
-rw-rw-r--. 1 eyal.h eyal.h 0 May 24 18:53 stdout
`

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinginvalidThis doesn't seem right

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions