This section shows a way to configure a development environment that allows you to run tests and build documentation.
virtualenv env
source env/bin/activate
pip install -U pip setuptools
pip install -e .[opencv,tf,test,torch]Additionally, you can use the Dockerized Linux workspace via the Makefile provided at docker/Makefile. The following will build the Docker image, start a running container with petastorm source mounted into it from the host, and open a BASH shell into it (you must have GNU Make and Docker installed beforehand):
make build run shellWithin the Dockerized workspace, you can find the Python virtual environments at /petastorm_venv2.7 and /petastorm_venv3.6, and the local petastorm/ mounted at /petastorm. Remember to set python for pyspark correctly after load virtual env, for example:
export PYSPARK_PYTHON=`which python3`Also, if you are seeing "ImportError: libGL.so.1"from "import cv2", update with running "apt-get update; apt-get install ffmpeg libsm6 libxext6 -y" (reference: https://stackoverflow.com/questions/55313610/importerror-libgl-so-1-cannot-open-share).
To run unit tests:
pytest -v petastormNOTE: you need to have Java 1.8 to be installed for the test to pass (it's a dependency of Spark)
pytest has mulitple useful plugins. Consider installing the following plugins:
pip install pytest-xdist pytest-repeat pytest-pycharmwhich enable you to run tests in parallel (-n switch) and repeat tests multiple times (--count switch)
Some unit tests rely on mock data. Generating these datasets is not very fast as it spins up local Spark isntance.
Use -Y switch to cache these datasets. Be careful, as the dataset generation exercises Petastorm code, hence
in some cases you would need to invalidate the cache for the test to take all code changes into account.
Use --cache-clear switch to do so.
The petastorm project uses sphinx autodoc capabilities, along with free documentation hosting by ReadTheDocs.org (RTD), to serve up auto-generated API docs on http://petastorm.rtfd.io .
The RTD site is configured via webhooks to trigger sphinx doc builds from changes in the petastorm github repo. Documents are configured to build the same locally or on RTD.
All the source files needed to generate the autodocs reside under docs/autodoc/.
To make documents locally:
pip install -e .[docs]
cd docs/autodoc
# To nuke all generated HTMLs
make clean
# Each run incrementally updates HTML based on file changes
make htmlOnce the HTML build process completes successfully, naviate your browser to
file:///tmp/autodocs/_build/html/index.html.
Some changes may require build and deployment to see, including:
- Changes to
readthedocs.yml - Changes to
docs/autodoc/conf.py - A change that makes RTD build different from a local build
To see the above documentation changes:
- One needs to create a petastorm branch and push it
- Then configure RTD to activate a version for that branch
- A project maintainer will need to effect such version activation
- The status of a built version, as well as the resulting docs, can then be viewed
By default, RTD defines the latest version, which can be pointed at master
or another branch. Additionally, each release may have an associated RTD build
version, which must be explicitly activated in the
Versions settings page.
As with any source file, once a release is tagged, it is essentially immutable, so be sure that all the desired documentation changes are in place before tagging a release.
Note that conf.py defines a release and version property. For ease
of maintenance, we've set that to be the same version string as defined in
petastorm/__init__.py.
- Due to RTD's build resource limitations, we are unable to pip install any of the petastorm extra-required library packages.
- Since Sphinx must be able to load a python module to read its docstrings,
the doc page for any module that imports
cv2,tensorflow, ortorchwill, unfortunately, fail to build. - The alabaster Sphinx theme defaults to using
travis-ci.orgfor the Travis CI build badge, whereas the petastorm project is served on.com, so we don't currently have a working Travis CI build status.
Sphinx has the ability to auto-generate the entire API, either via the
autosummary extension, or the sphinx-apidoc tool.
The following sphinx-apidoc invocation will autogenerate an api/
subdirectory of rST files for each of the petastorm modules. Those files can
then be glob'd into a TOC tree.
cd docs/autodocs
sphinx-apidoc -fTo api ../.. ../../setup.pyThe apidoc_experiment branch and RTD output demonstrates the outcome of
vanilla usage. Actually leveraging this approach to produce uncluttered
auto-generated API doc will require:
- Code package reorganization
- Experimentation with sphinx settings, if available, to shorten link names
- Configuration change to auto-run
sphinx-apidocin RTD build, as opposed to committing theapi/*.rstfiles
- Make sure you are on the latest mater in your local workspace (
git checkout master && git pull). - Update
__version__inpetastorm/__init__.pyand commit. - Update
docs/release-notes.rst.- Delete
(unreleased)from the release we are about to release. - Add any additional information if needed.
- Add kudos message to any new contributors who contributed to the release.
- Create a future release entry and tag it with
(unreleased))string.
- Delete
- Commit the changes.
- Tag as
vX.X.Xrc0(git tag vX.X.Xrc0) and push both master and the tag (git push origin master vX.X.Xrc0). This will trigger build and pypi release. - Provide an opportunity for users to test the new release (slack channel/tweater). Create new release candidates as needed.
- Tag as
vX.X.X(git tag vX.X.X) and push both master and the tag (git push origin master vX.X.X). This will trigger build and pypi release - Once the build finishes, a new python wheel will be pushed to public pypi server.
- Navigate to https://readthedocs.org/ --> "My Projects" --> "Builds" --> Trigger build of the 'latest' documentation (not clear when RTD picks up new tags from github, so you may see only outdated release versions there).
Checked these instructions for pyspark 3.0.1 1. Download the following files into some local directory:
- https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar
- https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.7.4/hadoop-aws-2.7.4.jar
- https://repo1.maven.org/maven2/net/java/dev/jets3t/jets3t/0.9.4/jets3t-0.9.4.jar (was not able to confirm s3 protocol due to authentication issues)
- Add/set
CLASSPATHenvironment variable to point to the directory containing these jars.