Fix issues with tensorflow tf generator support for Object Storage#288
Fix issues with tensorflow tf generator support for Object Storage#288annie-anna wants to merge 5 commits intoargonne-lcf:mainfrom
Conversation
|
@johnugeorge , could you take a look and see whether the changes look good to you? |
|
@hariharan-devarajan please review this PR also. |
| try: | ||
| if not use_pattern: | ||
| return tf.io.gfile.listdir(id) | ||
| # parse id to get bucket name and prefix |
There was a problem hiding this comment.
Can you explain the motivation in overriding the function?
There was a problem hiding this comment.
Can you explain the bug that you referred to?
There was a problem hiding this comment.
Because when I run tf.io.gfile.listdir, I encountered an IndexError, saying the position passed into s.substr() is greater than s.size().
I looked into tensorflow code, tf.io.gfile.listdir is defined by list_directory_v2(), which calls GetChildren(). GetChildren() is defined in tensorflow/io and will basically return all the objects in the bucket with the given prefix. This func tries to do s.substr(s.size()+1). There is a detailed code walk in tensorflow/io#2149 that explains the bug.
There was a problem hiding this comment.
@johnugeorge does this look good to you? Are you able to do a quick test?
There was a problem hiding this comment.
Won't this break the storage/framework abstraction? Right now, this PR will tie S3 storage with the tf framework.
There was a problem hiding this comment.
@johnugeorge yes, but this is just to fix the original implementation, right? The original implementation for tf framework as well. All the create_node, walk_node functions are part of the framework class.
For pytorch, we can implement the same storage functions under pytorch_framework class.
There was a problem hiding this comment.
@annie-anna , after discussing with Johnu, we cannot accept the change in the framework layer. All the change should be inside the storage layer.
There was a problem hiding this comment.
@zhenghh04 @johnugeorge ack, I reverted changes in walk_node.
setup.py
Outdated
| f"hydra-core>={HYDRA_VERSION}", | ||
| "nvidia-dali-cuda120>=1.34.0", | ||
| "tensorflow>=2.11.0", | ||
| "tensorflow==2.13.1", |
There was a problem hiding this comment.
Could we set tensorflow>=2.13.1?
There was a problem hiding this comment.
See comment below.
|
@annie-anna The github action jobs are failing because of the dependency issue. tensorflow==2.13.1 and torch>=2.2.0 have incompatible dependencies. Could you check whether changing to tensorflow>=2.13.1 (and also releasing the constrain for tensorflow_io) still works for the S3 storage? |
@zhenghh04 Yes, we can set tensorflow to be >= 2.13.1. As long as tensorflow_io is installed the compatible version, it will still work for S3 storage. |
|
@zhenghh04 @johnugeorge Is there any other question or concern regarding the PR? |
|
Thanks very much @annie-anna. |
|
@zhenghh04 Why did the build fail again? |
|
@annie-anna , I incorporated your changes in the most recent PR #294 . Could you try to pull the most recent changes from main? |
Found a few bugs when I run MLPerf storage on Object Storage:
s3://.makedirsinstead ofmkdirto create intermediate dirs.walk_nodeis usingtf.io.gfile.listdir, which has a bug that causes out-of-index for substr.list_object_v2.