Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .github/workflows/main-jdk17-build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,12 @@ name: main jdk17 build
on:
pull_request:
branches: [ main ]
paths-ignore:
- 'docs/**'
push:
branches: [ main ]
paths-ignore:
- 'docs/**'

jobs:
build:
Expand Down
4 changes: 4 additions & 0 deletions .github/workflows/main-jdk17-windows-build-multi-locale.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,12 @@ name: main jdk17 windows build (multi-locale)
on:
pull_request:
branches: [ main ]
paths-ignore:
- 'docs/**'
push:
branches: [ main ]
paths-ignore:
- 'docs/**'

jobs:
build:
Expand Down
4 changes: 4 additions & 0 deletions .github/workflows/main-jdk17-windows-build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,12 @@ name: main jdk17 windows build
on:
pull_request:
branches: [ main ]
paths-ignore:
- 'docs/**'
push:
branches: [ main ]
paths-ignore:
- 'docs/**'

jobs:
build:
Expand Down
2 changes: 2 additions & 0 deletions .github/workflows/main-jdk21-build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,8 @@ name: main jdk21 build
on:
push:
branches: [ main ]
paths-ignore:
- 'docs/**'

jobs:
build:
Expand Down
2 changes: 2 additions & 0 deletions .github/workflows/main-jdk25-build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,8 @@ name: main jdk25 build
on:
push:
branches: [ main ]
paths-ignore:
- 'docs/**'

jobs:
build:
Expand Down
96 changes: 96 additions & 0 deletions docs/pom.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<parent>
<groupId>org.apache.tika</groupId>
<artifactId>tika</artifactId>
<version>4.0.0-SNAPSHOT</version>
</parent>

<artifactId>tika-docs</artifactId>
<packaging>pom</packaging>
<name>Apache Tika Documentation</name>

<properties>
<!-- Update this when a new stable version is released -->
<tika.stable.version>3.2.3</tika.stable.version>
</properties>

<build>
<plugins>
<plugin>
<groupId>org.asciidoctor</groupId>
<artifactId>asciidoctor-maven-plugin</artifactId>
<version>3.2.0</version>
<executions>
<execution>
<id>output-html</id>
<phase>generate-resources</phase>
<goals>
<goal>process-asciidoc</goal>
</goals>
<configuration>
<doctype>article</doctype>
<attributes>
<source-highlighter>coderay</source-highlighter>
<toc />
<linkcss>false</linkcss>
<icons>font</icons>
<tika-stable-version>${tika.stable.version}</tika-stable-version>
<!-- Paths to config examples for include directives -->
<parser-examples>${project.basedir}/../tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/config-examples</parser-examples>
<server-examples>${project.basedir}/../tika-server/tika-server-core/src/test/resources/config-examples</server-examples>
<pipes-fs-examples>${project.basedir}/../tika-pipes/tika-pipes-plugins/tika-pipes-file-system/src/test/resources/config-examples</pipes-fs-examples>
</attributes>
</configuration>
</execution>
</executions>
<configuration>
<sourceDirectory>src/main/asciidoc</sourceDirectory>
<preserveDirectories>true</preserveDirectories>
</configuration>
</plugin>

<!-- Maven Assembly plugin to create tar.gz -->
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<executions>
<execution>
<id>make-docs-archive</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
<configuration>
<descriptors>
<descriptor>src/assembly/docs.xml</descriptor>
</descriptors>
<finalName>${project.artifactId}-${project.version}</finalName>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>

</project>
37 changes: 37 additions & 0 deletions docs/src/assembly/docs.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<assembly xmlns="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.3"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.3
https://maven.apache.org/xsd/assembly-1.1.3.xsd">
<id>docs</id>
<formats>
<format>tar.gz</format>
</formats>
<includeBaseDirectory>false</includeBaseDirectory>
<fileSets>
<fileSet>
<directory>${project.build.directory}/generated-docs</directory>
<outputDirectory>/</outputDirectory>
<includes>
<include>**/*</include>
</includes>
</fileSet>
</fileSets>
</assembly>
31 changes: 31 additions & 0 deletions docs/src/main/asciidoc/advanced/index.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
//
// Licensed to the Apache Software Foundation (ASF) under one or more
// contributor license agreements. See the NOTICE file distributed with
// this work for additional information regarding copyright ownership.
// The ASF licenses this file to You under the Apache License, Version 2.0
// (the "License"); you may not use this file except in compliance with
// the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
//

= Advanced Topics

This section covers advanced usage and internals of Apache Tika.

== Topics

* xref:robustness.adoc[Robustness] - Process isolation and fault tolerance when parsing untrusted content
* xref:spooling.adoc[TikaInputStream and Spooling] - Understanding how TikaInputStream handles buffering, caching, and spooling to disk

// Add links to specific topics as they are created
// * link:custom-parsers.html[Writing Custom Parsers]
// * link:custom-detectors.html[Writing Custom Detectors]
// * link:configuration.html[Advanced Configuration]
// * link:performance.html[Performance Tuning]
137 changes: 137 additions & 0 deletions docs/src/main/asciidoc/advanced/robustness.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
//
// Licensed to the Apache Software Foundation (ASF) under one or more
// contributor license agreements. See the NOTICE file distributed with
// this work for additional information regarding copyright ownership.
// The ASF licenses this file to You under the Apache License, Version 2.0
// (the "License"); you may not use this file except in compliance with
// the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
//

= The Robustness of Apache Tika

Running parsers on untrusted data carries inherent risks. In rare cases, Tika can
encounter infinite loops or allocate unexpected amounts of memory (OutOfMemoryErrors).
When processing documents at scale, you must implement protective measures.

IMPORTANT: Avoid running Tika in the same process as critical infrastructure like
indexers or search systems.

== Process Isolation

The primary defense against parser failures is process isolation. By running parsers
in separate processes, you protect your main application from:

* OutOfMemoryErrors
* Infinite loops
* Native code crashes
* Resource exhaustion

=== Tika 4.x

**In Tika 4.x, xref:../pipes/index.adoc[Tika Pipes] is the recommended approach for
robust document processing.** It provides:

* Automatic process isolation
* Fault tolerance and recovery
* Scalable parallel processing
* Unified architecture for all deployment scenarios

Pipes can be used in multiple ways:

* **Programmatically** - Via `PipesForkParser` in the `tika-pipes-fork-parser` module
(see xref:../using-tika/java-api/getting-started.adoc[Java API Getting Started])
* **Via tika-server** - REST endpoints for pipes-based processing
* **Via tika-grpc** - gRPC interface with pipes backend

In Tika 4.x, the approach to robustness has been simplified. Previous versions offered
four different forking mechanisms:

[cols="1,2,1"]
|===
|Mechanism |Description |Status in 4.x

|ForkParser
|Spawned child processes for individual parse operations
|Deprecated

|tika-batch
|Desktop/VM-scale batch processing
|Deprecated

|tika-server (forked mode)
|REST server with forked parsing processes
|Available, but Pipes recommended

|tika-pipes
|Scalable, fault-tolerant pipeline processing
|*Recommended approach*
|===

=== Tika 3.x and Earlier

If you are using Tika 3.x or earlier, you have several options for process isolation:

ForkParser::
Spawns child processes to protect against out-of-memory errors and infinite loops.
Suitable for programmatic use in Java applications.

tika-batch::
For desktop/VM-scale processing (not cloud-scale):
+
[source,bash]
----
java -jar tika-app.jar -i <input_dir> -o <output_dir>
----

tika-server::
In version 2.x and later, parsing defaults to forked processes. Clients must handle
tika-server restarts gracefully.

tika-pipes::
Available through programmatic use, tika-app `-a` option, or tika-server's `/async`
and `/pipes` endpoints.

== Security Testing and Prevention

The Apache Tika team implements several measures to identify and prevent vulnerabilities:

* **Regression testing** against ~2 million files from Common Crawl before releases
* **Code reviews** of dependencies to identify vulnerability patterns
* **Fuzzing modules** for automated vulnerability discovery
* **Collaboration** with security researchers
* **Maintained forks** of parsers with critical fixes (released independently when needed)
* **Public documentation** of vulnerabilities at xref:../security.adoc[security page]

== MockParser for Testing

Tika provides a `MockParser` tool for testing your system's robustness. You can
configure it to simulate various failure modes:

* Infinite loops
* OutOfMemoryErrors
* Excessive runtime
* Large output generation

This allows you to verify that your integration handles parser failures gracefully.

== Recommendations

1. **Use Tika Pipes** (4.x) for production workloads with untrusted content
2. **Isolate Tika** from critical systems - never run in the same JVM as your indexer
3. **Set timeouts** for all parsing operations
4. **Monitor memory usage** and set appropriate limits
5. **Plan for failures** - your system should handle parser crashes gracefully
6. **Stay updated** - apply security updates promptly

== Further Reading

* xref:../pipes/index.adoc[Tika Pipes] - Recommended approach for robust processing
* xref:../security.adoc[Security] - Known vulnerabilities and security model
File renamed without changes.
Loading
Loading