Skip to content

Commit 7603493

Browse files
authored
Merge pull request #170 from ArcInstitute/bqtools-0.5.4
Bqtools 0.5.4
2 parents 302a08e + 0a2a5d5 commit 7603493

File tree

24 files changed

+467
-333
lines changed

24 files changed

+467
-333
lines changed

CLAUDE.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,9 @@ Build without defaults: `cargo build --no-default-features -F fuzzy,gcs`
4444

4545
**Parallel processing**: Commands use the `paraseq` crate's `ParallelProcessor` trait for embarrassingly parallel batch processing. Each command has a `processor.rs` implementing this trait with thread-local buffers and `Arc<Mutex<T>>` for shared global state.
4646

47-
**Grep backends**: The grep command uses a `PatternMatcher` enum dispatching to three backends — `regex`, `aho-corasick` (fixed-string, multi-pattern), and `sassy` (fuzzy, feature-gated). The same pattern applies to `PatternCounter` for the `-P` pattern-count mode.
47+
**Grep backends**: The grep command uses a `PatternMatcher` enum dispatching to three backends — `regex`, `aho-corasick` (fixed-string, multi-pattern), and `sassy` (fuzzy, feature-gated). The same pattern applies to `PatternCounter` for the `-P` pattern-count mode. All backends accept `PatternCollection` which carries optional pattern names (from FASTA headers).
48+
49+
**Pattern types**: `patterns.rs` defines `Pattern` (name + sequence) and `PatternCollection` (newtype over `Vec<Pattern>`) with methods `.bytes()`, `.regexes()`, `.names()`. Pattern files (`--file`, `--sfile`, `--xfile`) auto-detect FASTA vs plain text. FASTA headers become pattern names; plain text patterns have no name and fall back to the pattern string in output.
4850

4951
**Encode modes**: Encoding dispatches across atomic (single/paired files), recursive (directory walk via `walkdir`), manifest (file list), and batch (multi-file thread distribution) modes.
5052

Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[package]
22
name = "bqtools"
3-
version = "0.5.3"
3+
version = "0.5.4"
44
edition = "2021"
55
license = "MIT"
66
authors = ["Noam Teyssier <noam.teyssier@arcinstitute.org>"]

README.md

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -320,6 +320,7 @@ bqtools grep input.bq "ACGTACGT" -zi
320320
```
321321

322322
`bqtools` can also handle a large collection of patterns which can be provided on the CLI as a file.
323+
Pattern files can be either **plain text** (one pattern per line) or **FASTA** format (sequences are used as patterns, auto-detected).
323324
You can provide files for either primary/extended, just primary, or just extended patterns with the relevant flags.
324325
Notably this will match _solely_ with OR logic.
325326
This can be used also with fuzzy matching as well as with pattern counting described below.
@@ -329,9 +330,12 @@ If your patterns are all fixed strings (and not regex), you can improve performa
329330
This will use the more efficient [Aho-Corasick algorithm](https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm) to match patterns.
330331

331332
```bash
332-
# Run grep with patterns from a file
333+
# Run grep with patterns from a plain text file (one pattern per line)
333334
bqtools grep input.bq --file patterns.txt
334335

336+
# Run grep with patterns from a FASTA file (sequences used as patterns)
337+
bqtools grep input.bq --file patterns.fa
338+
335339
# Run grep with patterns from a file (primary)
336340
bqtools grep input.bq --sfile patterns.txt
337341

@@ -387,7 +391,13 @@ bqtools grep input.bq --file patterns.txt -P
387391
bqtools grep input.bq --file patterns.txt -Px
388392
```
389393

390-
The output of pattern count is a TSV with three columns: [Pattern, Count, Fraction of Total]
394+
The output of pattern count is a TSV with three columns: [Name, Count, Fraction of Total].
395+
When patterns are loaded from a FASTA file, the FASTA sequence headers are used as names; otherwise, the pattern string itself is used.
396+
397+
```bash
398+
# Count patterns from a FASTA file (names column shows FASTA headers)
399+
bqtools grep input.bq --file patterns.fa -P
400+
```
391401

392402
### Pipe
393403

src/cli/grep.rs

Lines changed: 80 additions & 81 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,16 @@
1-
use anyhow::{bail, Result};
1+
use std::{
2+
fs,
3+
io::{self, Read},
4+
};
5+
6+
use anyhow::Result;
27
use clap::Parser;
8+
use paraseq::{fasta, Record};
39

4-
use crate::{cli::FileFormat, commands::grep::SimpleRange};
10+
use crate::{
11+
cli::FileFormat,
12+
commands::grep::{Pattern, PatternCollection, SimpleRange},
13+
};
514

615
use super::{InputBinseq, OutputFile};
716

@@ -121,51 +130,6 @@ impl GrepArgs {
121130
}
122131
Ok(())
123132
}
124-
fn chain_regex(
125-
&self,
126-
cli_patterns: &[String],
127-
filetype: PatternFileType,
128-
) -> Result<Vec<regex::bytes::Regex>> {
129-
let mut all_patterns = cli_patterns
130-
.iter()
131-
.map(std::borrow::ToOwned::to_owned)
132-
.collect::<Vec<String>>();
133-
if !self.file_args.empty_file(filetype) {
134-
all_patterns.extend(self.file_args.read_file_patterns(filetype)?);
135-
}
136-
137-
// all patterns are kept separate for:
138-
// 1. AND logic
139-
// 2. Individual pattern counting
140-
if self.and_logic() || self.pattern_count {
141-
Ok(all_patterns
142-
.iter()
143-
.map(|s| {
144-
regex::bytes::Regex::new(s).expect("Could not build regex from pattern: {s}")
145-
})
146-
.collect())
147-
148-
// for OR logic they can be compiled into a single regex for performance
149-
} else {
150-
let global_pattern = all_patterns.join("|");
151-
if global_pattern.is_empty() {
152-
Ok(vec![])
153-
} else {
154-
Ok(vec![regex::bytes::Regex::new(&global_pattern).expect(
155-
"Could not build regex from pattern: {global_pattern}",
156-
)])
157-
}
158-
}
159-
}
160-
pub fn bytes_reg1(&self) -> Result<Vec<regex::bytes::Regex>> {
161-
self.chain_regex(&self.reg1, PatternFileType::SFile)
162-
}
163-
pub fn bytes_reg2(&self) -> Result<Vec<regex::bytes::Regex>> {
164-
self.chain_regex(&self.reg2, PatternFileType::XFile)
165-
}
166-
pub fn bytes_reg(&self) -> Result<Vec<regex::bytes::Regex>> {
167-
self.chain_regex(&self.reg, PatternFileType::File)
168-
}
169133
pub fn and_logic(&self) -> bool {
170134
if self.file_args.empty() {
171135
!self.or_logic
@@ -177,27 +141,30 @@ impl GrepArgs {
177141
}
178142

179143
impl GrepArgs {
180-
fn chain_bytes(
144+
fn chain_patterns(
181145
&self,
182146
cli_patterns: &[String],
183147
filetype: PatternFileType,
184-
) -> Result<Vec<Vec<u8>>> {
185-
let bytes_iter = cli_patterns.iter().map(|s| s.as_bytes().to_vec());
148+
) -> Result<PatternCollection> {
149+
let cli_iter = cli_patterns.iter().map(|s| Pattern {
150+
name: None,
151+
sequence: s.as_bytes().to_vec(),
152+
});
186153
if self.file_args.empty_file(filetype) {
187-
Ok(bytes_iter.collect())
154+
Ok(PatternCollection(cli_iter.collect()))
188155
} else {
189-
let patterns = self.file_args.patterns(filetype)?;
190-
Ok(bytes_iter.chain(patterns).collect())
156+
let file_patterns = self.file_args.patterns(filetype)?;
157+
Ok(PatternCollection(cli_iter.chain(file_patterns).collect()))
191158
}
192159
}
193-
pub fn bytes_pat1(&self) -> Result<Vec<Vec<u8>>> {
194-
self.chain_bytes(&self.reg1, PatternFileType::SFile)
160+
pub fn patterns_m1(&self) -> Result<PatternCollection> {
161+
self.chain_patterns(&self.reg1, PatternFileType::SFile)
195162
}
196-
pub fn bytes_pat2(&self) -> Result<Vec<Vec<u8>>> {
197-
self.chain_bytes(&self.reg2, PatternFileType::XFile)
163+
pub fn patterns_m2(&self) -> Result<PatternCollection> {
164+
self.chain_patterns(&self.reg2, PatternFileType::XFile)
198165
}
199-
pub fn bytes_pat(&self) -> Result<Vec<Vec<u8>>> {
200-
self.chain_bytes(&self.reg, PatternFileType::File)
166+
pub fn patterns(&self) -> Result<PatternCollection> {
167+
self.chain_patterns(&self.reg, PatternFileType::File)
201168
}
202169
}
203170

@@ -229,28 +196,32 @@ pub struct FuzzyArgs {
229196
pub struct PatternFileArgs {
230197
/// File of patterns to search for
231198
///
232-
/// This assumes one pattern per line.
199+
/// Accepts a plain text file (one pattern per line) or a FASTA file
200+
/// (sequences are used as patterns). FASTA files are auto-detected.
233201
/// Patterns may be regex or literal (fuzzy doesn't support regex).
234202
/// These will match against either primary or extended sequence.
235203
#[clap(long)]
236204
pub file: Option<String>,
237205

238206
/// File of patterns to search for in primary sequence
239207
///
240-
/// This assumes one pattern per line.
208+
/// Accepts a plain text file (one pattern per line) or a FASTA file
209+
/// (sequences are used as patterns). FASTA files are auto-detected.
241210
/// Patterns may be regex or literal (fuzzy doesn't support regex).
242211
#[clap(long)]
243212
pub sfile: Option<String>,
244213

245214
/// File of patterns to search for in extended sequence
246215
///
247-
/// This assumes one pattern per line.
216+
/// Accepts a plain text file (one pattern per line) or a FASTA file
217+
/// (sequences are used as patterns). FASTA files are auto-detected.
248218
/// Patterns may be regex or literal (fuzzy doesn't support regex).
249219
#[clap(long)]
250220
pub xfile: Option<String>,
251221
}
222+
252223
impl PatternFileArgs {
253-
fn empty(&self) -> bool {
224+
pub(crate) fn empty(&self) -> bool {
254225
self.file.is_none() && self.sfile.is_none() && self.xfile.is_none()
255226
}
256227

@@ -262,34 +233,62 @@ impl PatternFileArgs {
262233
}
263234
}
264235

265-
fn read_file(&self, filetype: PatternFileType) -> Result<String> {
236+
fn file_path(&self, filetype: PatternFileType) -> Result<&str> {
266237
let file = match filetype {
267238
PatternFileType::File => &self.file,
268239
PatternFileType::SFile => &self.sfile,
269240
PatternFileType::XFile => &self.xfile,
270241
};
271-
if let Some(file) = file {
272-
Ok(std::fs::read_to_string(file)?)
273-
} else {
274-
bail!("Specified file type {filetype:?} not provided at CLI")
275-
}
242+
file.as_deref()
243+
.ok_or_else(|| anyhow::anyhow!("Specified file type {filetype:?} not provided at CLI"))
276244
}
277245

278-
fn read_file_patterns(&self, filetype: PatternFileType) -> Result<Vec<String>> {
279-
let contents = self.read_file(filetype)?;
280-
Ok(contents
281-
.lines()
282-
.map(std::string::ToString::to_string)
283-
.collect())
246+
/// Returns true if the file starts with '>' (FASTA format).
247+
fn is_fasta(path: &str) -> Result<bool> {
248+
let file = fs::File::open(path)?;
249+
// only take up to 10 bytes to determine fasta status
250+
for byte in io::BufReader::new(file).bytes().take(10) {
251+
let b = byte?;
252+
if b != b'\n' && b != b'\r' {
253+
return Ok(b == b'>');
254+
}
255+
}
256+
Ok(false)
284257
}
285258

286-
fn patterns(&self, filetype: PatternFileType) -> Result<Vec<Vec<u8>>> {
287-
let contents = self.read_file(filetype)?;
288-
let mut patterns = Vec::new();
289-
for line in contents.lines() {
290-
patterns.push(line.as_bytes().to_vec());
259+
/// Load patterns from a file, auto-detecting FASTA vs plain text.
260+
fn load_patterns(path: &str) -> Result<Vec<Pattern>> {
261+
if Self::is_fasta(path)? {
262+
let mut reader = fasta::Reader::from_path(path)?;
263+
let mut rset = fasta::RecordSet::default();
264+
let mut patterns = Vec::new();
265+
266+
while rset.fill(&mut reader)? {
267+
for record in rset.iter() {
268+
let record = record?;
269+
patterns.push(Pattern {
270+
name: Some(record.id_str().to_string()),
271+
sequence: record.seq().into_owned(),
272+
});
273+
}
274+
}
275+
276+
Ok(patterns)
277+
} else {
278+
let contents = std::fs::read_to_string(path)?;
279+
Ok(contents
280+
.lines()
281+
.map(|line| Pattern {
282+
name: None,
283+
sequence: line.as_bytes().to_vec(),
284+
})
285+
.collect())
291286
}
292-
Ok(patterns)
287+
}
288+
289+
fn patterns(&self, filetype: PatternFileType) -> Result<Vec<Pattern>> {
290+
let path = self.file_path(filetype)?;
291+
Self::load_patterns(path)
293292
}
294293
}
295294

src/cli/input.rs

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -272,17 +272,15 @@ impl Span {
272272
if let Some(start) = self.start {
273273
if start > max_records {
274274
error!(
275-
"Provided start ({}) exceeds maximum number of records ({})",
276-
start, max_records
275+
"Provided start ({start}) exceeds maximum number of records ({max_records})"
277276
);
278277
bail!("Maximum number of records exceeded")
279278
}
280279
}
281280
if let Some(end) = self.end {
282281
if end > max_records {
283282
warn!(
284-
"Clipping provided endpoint ({}) to maximum number of records ({})",
285-
end, max_records
283+
"Clipping provided endpoint ({end}) to maximum number of records ({max_records})"
286284
);
287285
}
288286
self.end = Some(end.min(max_records));

src/cli/output.rs

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -364,12 +364,12 @@ impl BinseqMode {
364364
}
365365
}
366366
}
367-
impl Into<binseq::write::Format> for BinseqMode {
368-
fn into(self) -> binseq::write::Format {
369-
match self {
370-
Self::Bq => binseq::write::Format::Bq,
371-
Self::Vbq => binseq::write::Format::Vbq,
372-
Self::Cbq => binseq::write::Format::Cbq,
367+
impl From<BinseqMode> for binseq::write::Format {
368+
fn from(val: BinseqMode) -> Self {
369+
match val {
370+
BinseqMode::Bq => binseq::write::Format::Bq,
371+
BinseqMode::Vbq => binseq::write::Format::Vbq,
372+
BinseqMode::Cbq => binseq::write::Format::Cbq,
373373
}
374374
}
375375
}

src/commands/cat/mod.rs

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -80,9 +80,9 @@ fn record_vbq_header(paths: &[String]) -> Result<vbq::FileHeader> {
8080
for path in &paths[1..] {
8181
let reader = vbq::MmapReader::new(path)?;
8282
if reader.header() != header {
83-
error!("Inconsistent header found for path: {}", path);
83+
error!("Inconsistent header found for path: {path}");
8484
warn!("Note: The first VBQ used in `cat` will be considered as the reference header. All subsequent VBQs must have the same header.");
85-
bail!("Inconsistent header found for path: {}", path);
85+
bail!("Inconsistent header found for path: {path}");
8686
}
8787
}
8888
Ok(header)
@@ -97,9 +97,9 @@ fn record_cbq_header(paths: &[String]) -> Result<cbq::FileHeader> {
9797
for path in &paths[1..] {
9898
let reader = cbq::MmapReader::new(path)?;
9999
if reader.header() != header {
100-
error!("Inconsistent header found for path: {}", path);
100+
error!("Inconsistent header found for path: {path}");
101101
warn!("Note: The first CBQ used in `cat` will be considered as the reference header. All subsequent CBQs must have the same header.");
102-
bail!("Inconsistent header found for path: {}", path);
102+
bail!("Inconsistent header found for path: {path}");
103103
}
104104
}
105105
Ok(header)

src/commands/decode/mod.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,7 @@ pub fn run(args: &DecodeCommand) -> Result<()> {
6565
proc.clone(),
6666
args.output.threads(),
6767
span.get_range(num_records)?,
68-
)?
68+
)?;
6969
} else {
7070
reader.process_parallel(proc.clone(), args.output.threads())?;
7171
}

src/commands/encode/encode.rs

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,7 @@ pub fn encode_collection(
5555
}
5656
} else {
5757
bail!("All input files must have the same format.");
58-
};
58+
}
5959
let ohandle = match_output(opath)?;
6060
let mut builder = BinseqWriterBuilder::new(mode.into())
6161
.block_size(config.block_size)
@@ -82,12 +82,12 @@ pub fn encode_collection(
8282
let inner = collection.inner_mut();
8383
let slen = get_sequence_len(&mut inner[0])?;
8484
let xlen = get_sequence_len(&mut inner[1])?;
85-
builder = builder.slen(slen as u32).xlen(xlen as u32)
85+
builder = builder.slen(slen as u32).xlen(xlen as u32);
8686
}
8787
fastx::CollectionType::Interleaved => {
8888
let inner = collection.inner_mut();
8989
let (slen, xlen) = get_interleaved_sequence_len(&mut inner[0])?;
90-
builder = builder.slen(slen as u32).xlen(xlen as u32)
90+
builder = builder.slen(slen).xlen(xlen);
9191
}
9292
_ => {
9393
bail!("Unsupported collection type found in `encode_collection_bq`");

src/commands/encode/processor.rs

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -29,8 +29,8 @@ impl<W: Write + Send> Clone for Encoder<W> {
2929
fn clone(&self) -> Self {
3030
Self {
3131
t_writer: self.t_writer.clone(),
32-
t_count: self.t_count.clone(),
33-
t_skip: self.t_skip.clone(),
32+
t_count: self.t_count,
33+
t_skip: self.t_skip,
3434
writer: self.writer.clone(),
3535
count: self.count.clone(),
3636
skip: self.skip.clone(),

0 commit comments

Comments
 (0)