Skip to content

Adds a new task to detect CSAM using an external provider#2815

Open
hauck-jvsh wants to merge 4 commits into
masterfrom
external_asi
Open

Adds a new task to detect CSAM using an external provider#2815
hauck-jvsh wants to merge 4 commits into
masterfrom
external_asi

Conversation

@hauck-jvsh
Copy link
Copy Markdown
Member

No description provided.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a new RemoteImageExternalClassifierTask for detecting CSAM (Child Sexual Abuse Material) using an external AI provider. Unlike the existing RemoteImageClassifierTask, this new task communicates with an external service that returns reviewPriority classifications (e.g., VERY_HIGH, HIGH, MEDIUM, LOW, VERY_LOW) and optional embedding vectors. The task batches image/video thumbnails in ZIP files, sends them to the external service, and stores results as extra attributes on each evidence item.

Changes:

  • Adds the RemoteImageExternalClassifierTask class with full classification logic (batching, retry, duplicate caching, dimension/size filtering)
  • Adds supporting configuration class RemoteImageExternalClassifierConfig and config file RemoteImageExternalClassifierConfig.txt
  • Registers the task in TaskInstaller.xml, adds an AI filter group in AIFiltersConfig.json, and enables the task by default in IPEDConfig.txt

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
iped-engine/src/main/java/iped/engine/task/RemoteImageExternalClassifierTask.java Core task implementation: batching, HTTP communication, result processing, statistics
iped-engine/src/main/java/iped/engine/config/RemoteImageExternalClassifierConfig.java Configuration class for the new task
iped-app/resources/config/conf/RemoteImageExternalClassifierConfig.txt Default configuration file
iped-app/resources/config/conf/TaskInstaller.xml Registers the new task in the processing pipeline
iped-app/resources/config/conf/AIFiltersConfig.json Adds UI filter group for the External CSAM Detector results
iped-app/resources/config/IPEDConfig.txt Adds enable toggle for the new task (defaulting to true)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +251 to +253
urlZip = config.getUrl() + "/zip"; // enforces secure communication (required for sensitive data
// transfer)
urlVersion = config.getUrl() + "/version";
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inline comment says "enforces secure communication (required for sensitive data transfer)" but, unlike the original RemoteImageClassifierTask where "https://" is prepended to the URL, this task uses config.getUrl() directly without enforcing any protocol. The default URL in RemoteImageExternalClassifierConfig.txt is http://localhost:8000 (plain HTTP). If the service is deployed on a remote host and users configure an HTTPS URL, the SSL behavior depends on validateSSL, but there is no code here that enforces HTTPS. The comment is misleading and should either be removed or corrected to reflect the actual behavior. More critically, since this task is dealing with CSAM data, the lack of enforced HTTPS means sensitive image data could be transmitted unencrypted if the user does not manually configure an HTTPS URL.

Suggested change
urlZip = config.getUrl() + "/zip"; // enforces secure communication (required for sensitive data
// transfer)
urlVersion = config.getUrl() + "/version";
String baseUrl = config.getUrl();
if (baseUrl == null || baseUrl.trim().isEmpty()) {
throw new IllegalArgumentException("RemoteImageExternalClassifierTask URL must not be empty");
}
baseUrl = baseUrl.trim();
// Normalize URL: default to HTTPS if no scheme is provided, and disallow insecure HTTP for non-local hosts
if (!baseUrl.startsWith("http://") && !baseUrl.startsWith("https://")) {
baseUrl = "https://" + baseUrl;
}
boolean isLocalHttp = baseUrl.startsWith("http://localhost")
|| baseUrl.startsWith("http://127.0.0.1")
|| baseUrl.startsWith("http://[::1]");
if (baseUrl.startsWith("http://") && !isLocalHttp) {
throw new IllegalArgumentException(
"Insecure HTTP URL configured for RemoteImageExternalClassifierTask. Use HTTPS for remote endpoints: "
+ baseUrl);
}
urlZip = baseUrl + "/zip"; // ensures secure communication for remote endpoints (required for sensitive data transfer)
urlVersion = baseUrl + "/version";

Copilot uses AI. Check for mistakes.
# Enables remote external image classifier.
# It performs classification of image and video files using review priority and embeddings output.
# Advanced settings can be modified in file "conf/RemoteImageExternalClassifierConfig.txt"
enableRemoteImageExternalClassifier = true
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new task is enabled by default (enableRemoteImageExternalClassifier = true), while the existing similar task enableRemoteImageClassifier is disabled by default (false), and all other optional/external-service-dependent features in IPEDConfig.txt (PhotoDNA, OCR, face recognition, remote image classifier, etc.) are also disabled by default. Enabling this task by default means every user who does not have the external service running will encounter a connection error at startup (which only logs a warning and disables the task). This is inconsistent with the project's convention of shipping optional features disabled. It should be set to false by default.

Suggested change
enableRemoteImageExternalClassifier = true
enableRemoteImageExternalClassifier = false

Copilot uses AI. Check for mistakes.
Comment on lines +65 to +67
public double getLabelingThreshold() {
return labelingThreshold;
}
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The labelingThreshold property is defined and parsed in RemoteImageExternalClassifierConfig (with a getter getLabelingThreshold()), and also documented in RemoteImageExternalClassifierConfig.txt. However, the task RemoteImageExternalClassifierTask never reads or uses this value. Since this is described as "Optional: used only if backend also returns legacy 'class' probabilities", the field appears to be dead configuration. Either remove the unused property from the config and config file, or document clearly that it will be used in a future version.

Copilot uses AI. Check for mistakes.
private boolean skipHashDBFiles = true;

// Validate server SSL certificate
private boolean validateSSL = false;
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default for validateSSL is false in both the Java config class and the config file. When validateSSL = false, the HTTP client trusts all SSL certificates and disables hostname verification entirely. Since this task handles CSAM data — which is highly sensitive — the default should be true to enforce proper SSL validation and prevent man-in-the-middle attacks. Users who need self-signed certificates in a controlled environment can explicitly set it to false.

Suggested change
private boolean validateSSL = false;
private boolean validateSSL = true;

Copilot uses AI. Check for mistakes.
##########################################

# URL of the service/central node used by the RemoteImageExternalClassifier implementation
url = http://localhost:8000
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default url = http://localhost:8000 for RemoteImageExternalClassifier means the thumbnails ZIP is sent over plain HTTP whenever the "service/central node" is not truly local, exposing sensitive evidence to eavesdropping or tampering on the network. Because RemoteImageExternalClassifierTask uses config.getUrl() verbatim to build urlZip/urlVersion, an attacker on the path can read or modify the ZIP; default this to an https:// URL and document that TLS is required for any non-local deployments.

Suggested change
url = http://localhost:8000
# For any non-local deployment, this endpoint MUST use HTTPS/TLS.
url = https://localhost:8000

Copilot uses AI. Check for mistakes.
skipHashDBFiles = true

# Validate server SSL certificate
validateSSL = false
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting validateSSL = false by default disables TLS certificate and hostname validation in RemoteImageExternalClassifierTask.getClient(), which relies on a trust-all TrustStrategy and NoopHostnameVerifier. This allows any host with a spoofed certificate on the network path to impersonate the classifier service and read or alter the evidence ZIP; make this flag default to true and avoid using the trust-all branch outside tightly controlled testing.

Suggested change
validateSSL = false
validateSSL = true

Copilot uses AI. Check for mistakes.
@lfcnassif
Copy link
Copy Markdown
Member

Thank you very much @hauck-jvsh!

This is suspended until our external service provider is able to increase our quota or service latency, which is too slow today to handle our current daily number of requests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants