Adds a new task to detect CSAM using an external provider#2815
Adds a new task to detect CSAM using an external provider#2815hauck-jvsh wants to merge 4 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds a new RemoteImageExternalClassifierTask for detecting CSAM (Child Sexual Abuse Material) using an external AI provider. Unlike the existing RemoteImageClassifierTask, this new task communicates with an external service that returns reviewPriority classifications (e.g., VERY_HIGH, HIGH, MEDIUM, LOW, VERY_LOW) and optional embedding vectors. The task batches image/video thumbnails in ZIP files, sends them to the external service, and stores results as extra attributes on each evidence item.
Changes:
- Adds the
RemoteImageExternalClassifierTaskclass with full classification logic (batching, retry, duplicate caching, dimension/size filtering) - Adds supporting configuration class
RemoteImageExternalClassifierConfigand config fileRemoteImageExternalClassifierConfig.txt - Registers the task in
TaskInstaller.xml, adds an AI filter group inAIFiltersConfig.json, and enables the task by default inIPEDConfig.txt
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
iped-engine/src/main/java/iped/engine/task/RemoteImageExternalClassifierTask.java |
Core task implementation: batching, HTTP communication, result processing, statistics |
iped-engine/src/main/java/iped/engine/config/RemoteImageExternalClassifierConfig.java |
Configuration class for the new task |
iped-app/resources/config/conf/RemoteImageExternalClassifierConfig.txt |
Default configuration file |
iped-app/resources/config/conf/TaskInstaller.xml |
Registers the new task in the processing pipeline |
iped-app/resources/config/conf/AIFiltersConfig.json |
Adds UI filter group for the External CSAM Detector results |
iped-app/resources/config/IPEDConfig.txt |
Adds enable toggle for the new task (defaulting to true) |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| urlZip = config.getUrl() + "/zip"; // enforces secure communication (required for sensitive data | ||
| // transfer) | ||
| urlVersion = config.getUrl() + "/version"; |
There was a problem hiding this comment.
The inline comment says "enforces secure communication (required for sensitive data transfer)" but, unlike the original RemoteImageClassifierTask where "https://" is prepended to the URL, this task uses config.getUrl() directly without enforcing any protocol. The default URL in RemoteImageExternalClassifierConfig.txt is http://localhost:8000 (plain HTTP). If the service is deployed on a remote host and users configure an HTTPS URL, the SSL behavior depends on validateSSL, but there is no code here that enforces HTTPS. The comment is misleading and should either be removed or corrected to reflect the actual behavior. More critically, since this task is dealing with CSAM data, the lack of enforced HTTPS means sensitive image data could be transmitted unencrypted if the user does not manually configure an HTTPS URL.
| urlZip = config.getUrl() + "/zip"; // enforces secure communication (required for sensitive data | |
| // transfer) | |
| urlVersion = config.getUrl() + "/version"; | |
| String baseUrl = config.getUrl(); | |
| if (baseUrl == null || baseUrl.trim().isEmpty()) { | |
| throw new IllegalArgumentException("RemoteImageExternalClassifierTask URL must not be empty"); | |
| } | |
| baseUrl = baseUrl.trim(); | |
| // Normalize URL: default to HTTPS if no scheme is provided, and disallow insecure HTTP for non-local hosts | |
| if (!baseUrl.startsWith("http://") && !baseUrl.startsWith("https://")) { | |
| baseUrl = "https://" + baseUrl; | |
| } | |
| boolean isLocalHttp = baseUrl.startsWith("http://localhost") | |
| || baseUrl.startsWith("http://127.0.0.1") | |
| || baseUrl.startsWith("http://[::1]"); | |
| if (baseUrl.startsWith("http://") && !isLocalHttp) { | |
| throw new IllegalArgumentException( | |
| "Insecure HTTP URL configured for RemoteImageExternalClassifierTask. Use HTTPS for remote endpoints: " | |
| + baseUrl); | |
| } | |
| urlZip = baseUrl + "/zip"; // ensures secure communication for remote endpoints (required for sensitive data transfer) | |
| urlVersion = baseUrl + "/version"; |
| # Enables remote external image classifier. | ||
| # It performs classification of image and video files using review priority and embeddings output. | ||
| # Advanced settings can be modified in file "conf/RemoteImageExternalClassifierConfig.txt" | ||
| enableRemoteImageExternalClassifier = true |
There was a problem hiding this comment.
The new task is enabled by default (enableRemoteImageExternalClassifier = true), while the existing similar task enableRemoteImageClassifier is disabled by default (false), and all other optional/external-service-dependent features in IPEDConfig.txt (PhotoDNA, OCR, face recognition, remote image classifier, etc.) are also disabled by default. Enabling this task by default means every user who does not have the external service running will encounter a connection error at startup (which only logs a warning and disables the task). This is inconsistent with the project's convention of shipping optional features disabled. It should be set to false by default.
| enableRemoteImageExternalClassifier = true | |
| enableRemoteImageExternalClassifier = false |
| public double getLabelingThreshold() { | ||
| return labelingThreshold; | ||
| } |
There was a problem hiding this comment.
The labelingThreshold property is defined and parsed in RemoteImageExternalClassifierConfig (with a getter getLabelingThreshold()), and also documented in RemoteImageExternalClassifierConfig.txt. However, the task RemoteImageExternalClassifierTask never reads or uses this value. Since this is described as "Optional: used only if backend also returns legacy 'class' probabilities", the field appears to be dead configuration. Either remove the unused property from the config and config file, or document clearly that it will be used in a future version.
| private boolean skipHashDBFiles = true; | ||
|
|
||
| // Validate server SSL certificate | ||
| private boolean validateSSL = false; |
There was a problem hiding this comment.
The default for validateSSL is false in both the Java config class and the config file. When validateSSL = false, the HTTP client trusts all SSL certificates and disables hostname verification entirely. Since this task handles CSAM data — which is highly sensitive — the default should be true to enforce proper SSL validation and prevent man-in-the-middle attacks. Users who need self-signed certificates in a controlled environment can explicitly set it to false.
| private boolean validateSSL = false; | |
| private boolean validateSSL = true; |
| ########################################## | ||
|
|
||
| # URL of the service/central node used by the RemoteImageExternalClassifier implementation | ||
| url = http://localhost:8000 |
There was a problem hiding this comment.
The default url = http://localhost:8000 for RemoteImageExternalClassifier means the thumbnails ZIP is sent over plain HTTP whenever the "service/central node" is not truly local, exposing sensitive evidence to eavesdropping or tampering on the network. Because RemoteImageExternalClassifierTask uses config.getUrl() verbatim to build urlZip/urlVersion, an attacker on the path can read or modify the ZIP; default this to an https:// URL and document that TLS is required for any non-local deployments.
| url = http://localhost:8000 | |
| # For any non-local deployment, this endpoint MUST use HTTPS/TLS. | |
| url = https://localhost:8000 |
| skipHashDBFiles = true | ||
|
|
||
| # Validate server SSL certificate | ||
| validateSSL = false |
There was a problem hiding this comment.
Setting validateSSL = false by default disables TLS certificate and hostname validation in RemoteImageExternalClassifierTask.getClient(), which relies on a trust-all TrustStrategy and NoopHostnameVerifier. This allows any host with a spoofed certificate on the network path to impersonate the classifier service and read or alter the evidence ZIP; make this flag default to true and avoid using the trust-all branch outside tightly controlled testing.
| validateSSL = false | |
| validateSSL = true |
|
Thank you very much @hauck-jvsh! This is suspended until our external service provider is able to increase our quota or service latency, which is too slow today to handle our current daily number of requests. |
8d6d7b5 to
18e3c68
Compare
72c7c77 to
477407e
Compare
No description provided.