Thanks for your interest in contributing to Pathling.
You can find out a bit more about Pathling by reading the README file within this repository.
Issues can be used to:
- Report a defect
- Request a new feature or enhancement
- Ask a question
New issues will be automatically populated with a template that highlights the information that needs to be submitted with an issue that describes a defect. If the issue is not related to a defect, please just delete the template and replace it with a detailed description of the problem you are trying to solve.
Please communicate with us (preferably through creation of an issue) before embarking on any significant work within a pull request. This will prevent situations where people are working at cross-purposes.
Your branch should be named issue/[GitHub issue #].
You will need the following software to build the solution:
- Java 17
- Maven 3.9+
- Python 3.8+
- R 3.5+
To build and install locally, run:
mvn clean install
This is a multi-module Maven project with inter-module dependencies. The following sections describe how to compile and test specific modules efficiently.
To compile just a specific module (if dependencies are already compiled):
mvn compile -pl <module-name>Example:
mvn compile -pl library-apiTo compile a module and all its upstream dependencies:
mvn compile -pl <module-name> -amExample:
mvn compile -pl library-api -amThis will compile utilities, encoders, terminology, fhirpath, and
library-api in the correct dependency order.
For a fresh build from scratch:
mvn clean compile -pl <module-name> -amTo compile all modules:
mvn clean compileIf you are testing changes to a single module and don't need to recompile dependencies:
mvn test -pl <module-name> -Dtest=<TestClassName>Example:
mvn test -pl library-api -Dtest=DataSourcesTestIf you know that code from module dependencies needs to be compiled, use the
-am (also make) flag with the failIfNoSpecifiedTests parameter:
mvn test -pl <module-name> -am -Dtest=<TestClassName> -Dsurefire.failIfNoSpecifiedTests=falseExample:
mvn test -pl library-api -am -Dtest=DataSourcesTest -Dsurefire.failIfNoSpecifiedTests=falseThe -Dsurefire.failIfNoSpecifiedTests=false parameter prevents the build from
failing when upstream modules don't contain the specified test class.
You can also use patterns to run multiple test classes:
mvn test -pl library-api -Dtest="*DataSource*"To run all tests in a specific module:
mvn test -pl <module-name>utilitiesencodersterminologyfhirpathlibrary-apilibrary-runtimelib/pythonlib/Rsitebenchmark
The Python and R libraries depend on both JARs and generated files created by
the Maven build process. You cannot run tests using language-specific tools
(such as pytest or devtools::test), install the libraries in editable mode
(pip install -e .), or use them in any way without first building through
Maven to generate the required files.
To test the Python library:
mvn test -pl lib/python -amTo test the R library:
mvn test -pl lib/R -amThe -am flag ensures that all upstream dependencies are built before running
the tests.
Before working with the Python or R libraries, you must run a Maven build to
generate required files (such as _version.py for Python). For a complete
build:
mvn clean install -pl lib/python -amor
mvn clean install -pl lib/R -amAfter this initial build, you can install the Python library in editable mode if needed:
cd lib/python
pip install -e .When rebuilding after making changes to upstream modules, you may need to clear
the local Ivy cache before the changes will be picked up by the Python and R
libraries. The Ivy cache is typically located at ~/.ivy2/cache.
To clear the cache:
rm -rf ~/.ivy2/cacheAfter clearing the cache, rebuild the libraries:
mvn clean install -pl lib/python -am
mvn clean install -pl lib/R -amAll versioning follows Semantic Versioning 2.0.0.
The core of Pathling consists of the following modules, all of which inherit
from the main pom.xml in the root of the repository.
utilities- Utility functions used by different components of Pathling.encoders- Encoders for transforming FHIR data into Spark Datasets.terminology- Interact with a FHIR terminology server from Spark.fhirpath- A library that can translate FHIRPath expressions into Spark queries.library-api- An API that exposes Pathling functionality to language libraries.library-runtime- A Spark package that bundles the Pathling Library API and its runtime dependencies for cluster deployment.lib/python- Python language bindings for Pathling.lib/R- R language bindings for Pathling.
graph TD
utilities[utilities<br/>Utility functions]
encoders[encoders<br/>FHIR data encoders]
terminology[terminology<br/>FHIR terminology server]
fhirpath[fhirpath<br/>FHIRPath engine]
library-api[library-api<br/>Language library API]
library-runtime[library-runtime<br/>Spark package bundle]
python[lib/python<br/>Python bindings]
r[lib/R<br/>R bindings]
encoders --> utilities
terminology --> utilities
terminology --> encoders
fhirpath --> terminology
library-api --> utilities
library-api --> encoders
library-api --> terminology
library-api --> fhirpath
library-runtime --> library-api
python --> library-runtime
r --> library-runtime
The "public API" of Pathling is defined as the public API of the library API module.
Other modules are versioned independently of the library API, but should still follow the principles of Semantic Versioning based upon their public, user-facing interfaces.
The branching strategy is very simple and is based on
GitHub Flow. There are no
long-lived branches, all changes are made via pull requests and will be the
subject of an issue branch that is created from and targeting main.
We release frequently, and we will use a short-lived release/-prefixed branch
to aggregate more than one PR into a new version.
The POM versions of the core modules should be on a SNAPSHOT version when developing on a release branch. Successful builds of the release branch will be published to the Maven Central repository as SNAPSHOT versions.
Maven POM versions on main are always release versions. Builds are always
verified to be green within CI before merging to main. Merging to main
automatically triggers publishing of artifacts and deployment of the software to
production environments such as the Pathling website and sandbox instance.
- All comments must use correct grammar and be written as complete sentences.
- Every comment, including single-line comments, must be terminated with a period.
- Comments should clearly explain the purpose or logic of the code they annotate.
- Avoid redundant comments that restate obvious code; focus on intent, rationale, or non-obvious behaviour.
- Update comments when code changes to keep them accurate and relevant.
- TODOs should not be present in code that is being submitted for review. If you have a task that you want to complete in the future, please create an issue for it.
- Use meaningful and descriptive names for classes, methods, and variables (avoid abbreviations).
- Follow standard Java naming conventions:
- Classes and interfaces: PascalCase (e.g.,
MyClass) - Methods and variables: camelCase (e.g.,
myVariable,calculateTotal) - Constants: UPPER_SNAKE_CASE (e.g.,
MAX_SIZE)
- Classes and interfaces: PascalCase (e.g.,
- Keep methods short and focused on a single responsibility.
- Avoid code duplication; extract common logic into reusable methods.
- Always use braces
{}forif,else,for,while, anddostatements, even for single statements. - Use
finalfor variables, parameters, and methods that should not change. - Avoid using magic numbers; define constants with meaningful names.
- Avoid the use of inner classes, records and enums - having each class defined in its own file is preferred and avoids any implicit dependencies on code within the enclosing scope.
- Document public classes and methods with Javadoc comments.
- Handle exceptions appropriately; do not use empty catch blocks.
- Close resources (e.g., streams, connections) in a
finallyblock or use try-with-resources. - Avoid deeply nested code; refactor to improve readability.
- Do not ignore method return values unless intentional and documented.
- Use logging frameworks instead of
System.outorSystem.errfor output. - Remove unused code, imports, and variables.
- Write unit tests for all public methods and critical logic.
- Avoid hardcoding file paths, URLs, or credentials; use configuration files or environment variables.
- Use access modifiers (
private,protected,public) appropriately to encapsulate data. - Do not suppress warnings without a clear justification.
- Use nullability annotations (
jakarta.annotation.Nonnullandjakarta.annotation.Nullable) on method parameters, return values, and class or record fields. - Do not leave unused or commented-out code in the codebase.
- Ensure code is free of major bugs, vulnerabilities, and code smells as reported by SonarQube.
- Use meaningful and descriptive names following Scala conventions.
- Follow standard Scala naming conventions:
- Classes and traits: PascalCase (e.g.,
MyClass) - Methods and variables: camelCase (e.g.,
myVariable,calculateTotal) - Constants: UPPER_SNAKE_CASE (e.g.,
MAX_SIZE)
- Classes and traits: PascalCase (e.g.,
- Prefer immutable data structures and functional programming patterns.
- Use
valinstead ofvarwhenever possible. - Leverage pattern matching instead of complex conditional statements.
- Use case classes for data containers.
- Handle errors with
Option,Either, orTryinstead of exceptions where appropriate. - Use appropriate collection methods (
map,filter,fold) instead of loops. - Keep methods short and focused on a single responsibility.
- Follow PEP 8 style guidelines.
- Use meaningful and descriptive names with snake_case convention.
- Use type hints for function parameters and return values.
- Use docstrings for all public functions, classes, and modules.
- Prefer list comprehensions over loops where appropriate.
- Use context managers (
withstatements) for resource management. - Handle exceptions explicitly; avoid bare
exceptclauses. - Write unit tests using pytest framework.
- Keep imports organized: standard library, third-party, local imports.
- Follow tidyverse style guide conventions.
- Use snake_case for variable and function names.
- Use meaningful and descriptive names.
- Prefer vectorized operations over loops.
- Use the pipe operator (
%>%or|>) for readable data transformations. - Use appropriate data structures (data.frames, tibbles, lists).
- Document functions with roxygen2 comments.
- Use testthat for unit testing.
- Handle missing values (
NA) appropriately. - Use consistent indentation (2 spaces recommended).
This repository uses multiple code formatters:
- Prettier for JavaScript/TypeScript code, JSON and YAML
- Ruff for Python code
- EditorConfig for all other languages
Please use the appropriate formatter to reformat your code before pushing.
Before making a contribution, please read the code of conduct.