This is a straightforward port of Martin Porter's C implementation of the Porter stemming algorithm. The C version this port is based on is available for download here: http://tartarus.org/~martin/PorterStemmer/c_thread_safe.txt
The original algorithm is described in the paper:
M.F. Porter, 1980, An algorithm for suffix stripping, Program, 14(3) pp
130-137.
- Thread-safe implementation
- Multiple APIs: simple string API and zero-allocation byte-slice API
- Command-line tool for batch processing
- Comprehensive test suite
- Benchmarked and optimized
- No external dependencies
go get github.com/a2800276/portergo install github.com/a2800276/porter/cmd/porter@latestpackage main
import (
"fmt"
"log"
"github.com/a2800276/porter"
)
func main() {
// Simple string API (with allocations)
stemmed, err := porter.Stem("running")
if err != nil {
log.Fatal(err)
}
fmt.Println(stemmed) // Output: run
// Efficient byte-slice API (zero allocations)
word := []byte("running")
stemmed_bytes, err := porter.StemBytes(word)
if err != nil {
log.Fatal(err)
}
fmt.Println(string(stemmed_bytes)) // Output: run
}Install the command-line tool:
go install github.com/a2800276/porter/cmd/porter@latestUse it to stem words:
# Stem words from arguments
$ porter running jumped easily
run
jump
easili
# Stem words from stdin
$ echo -e "running\njumped\neasily" | porter
run
jump
easili
# Process a file
$ cat words.txt | porter > stemmed.txt
# Count unique stems
$ cat corpus.txt | porter | sort | uniq -c | sort -rnThe package provides two functions for different use cases:
The simplest API that takes a string and returns a stemmed string. Handles case conversion automatically. Returns an error if stemming fails (though this is rare in normal use).
Zero-allocation API that stems the byte slice in-place and returns the stemmed portion as a slice. The input is converted to lowercase. Best for high-performance scenarios. Returns an error if stemming fails.
The implementation is highly optimized:
BenchmarkStem-24 14064384 77.29 ns/op 16 B/op 2 allocs/op
BenchmarkStemBytes-24 23443530 51.85 ns/op 0 B/op 0 allocs/op
The byte-slice API (StemBytes) is ~35% faster and performs zero allocations,
making it ideal for high-performance applications.
Note: Error handling adds minimal overhead (~2ns) but provides explicit feedback on failures.
- The algorithm operates on English words only. Input is automatically converted to lowercase.
- For the
Stem()function, strings are converted to byte slices internally. For zero-copy operation, useStemBytes(). - Unicode handling: The algorithm is designed for ASCII English text. Non-ASCII characters should be handled by the caller before stemming.
make build # Build the CLI tool
make install # Install CLI to $GOPATH/binmake test # Run tests
make coverage # Generate coverage report
make bench # Run benchmarksmake fmt # Format code
make vet # Run go vet
make lint # Run golangci-lint (requires installation)Contributions are welcome! Please ensure:
- Tests pass:
make test - Code is formatted:
make fmt - No linting errors:
make lint
MIT licensed. See LICENSE file for details.