Skip to content

Conversation

@cardmagic
Copy link
Owner

Summary

  • Add Streaming module for memory-efficient batch training across all classifiers
  • Implement Brand's incremental SVD algorithm for LSI to avoid full recomputation
  • Add native C extension for fast incremental SVD operations
  • Standardize train_batch API across Bayes, KNN, LSI, and Logistic Regression

Features

Streaming Training

  • train_from_stream(category, io) - Train from IO streams (files, etc.)
  • train_batch(category, documents) - Train with progress callbacks
  • save_checkpoint / list_checkpoints / delete_checkpoint - Checkpoint support
  • LineReader - Efficient batched line reading with line count estimation
  • Progress - Progress tracking with completion percentage

Incremental SVD (LSI)

  • Brand's algorithm for adding documents without full SVD recomputation
  • Native C implementation for 5-50x faster matrix operations
  • Automatic fallback to full rebuild when vocabulary grows >20%
  • Batch re-projection for consistent LSI vectors

API Consistency

  • All classifiers now support train_batch with same signature
  • KNN adds train alias matching other classifiers

Test plan

  • Unit tests for Streaming module (LineReader, Progress)
  • Integration tests for streaming training on each classifier
  • Incremental SVD correctness tests
  • Checkpoint save/load tests
  • All 558 tests passing
  • Rubocop clean
  • Steep type checking (2 warnings, 0 errors)

…ency

- Add `train` as alias for `add` method
- Add dynamic `train_*` methods (train_spam, train_ham, etc.)
- Standardize `classify` to return String instead of String|Symbol
- Standardize `categories` to return Array[String]
- Add tests for new API consistency features

This enables writing classifier-agnostic code that works with
Bayes, LogisticRegression, and kNN interchangeably.
Strip inline step comments that restate what the code does while
keeping method-level RDoc descriptions and @RBS type annotations.
Simplifies code without losing documentation value.
Reduce Steep errors from 183 to 2 warnings by:
- Adding Matrix/Vector methods to sig/vendor/matrix.rbs
- Fixing train_batch signatures to allow nil optional params
- Adding type assertions for type narrowing in streaming code
- Using explicit type casts for storage.path access
- Fixing Enumerable generic type for LineReader

The remaining 2 warnings are Steep limitations with restarg (*).
Ensures all classifiers (Bayes, LSI, KNN, LogisticRegression, TFIDF)
include the Streaming module and respond to required methods. This
catches missing implementations when new classifiers are added.
Replace hardcoded CLASSIFIERS list with dynamic discovery using
Classifier.constants. Classes with `classify` or `transform` methods
are considered classifiers/vectorizers that must include Streaming.

New classifiers are automatically tested when added to the module.
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Dec 29, 2025

Greptile Summary

This PR adds streaming training capabilities and incremental SVD support to the classifier gem, enabling memory-efficient batch processing across all classifiers.

Key changes:

  • New Streaming module provides train_from_stream, train_batch, and checkpoint management
  • Brand's incremental SVD algorithm for LSI avoids full recomputation when adding documents
  • Native C extension delivers 5-50x speedup for matrix operations
  • Consistent train_batch API across Bayes, KNN, LSI, and Logistic Regression classifiers

Code quality issues found:

  • Multiple instances of nested conditionals that violate the early return pattern (see inline comments on lib/classifier/lsi.rb:229, lib/classifier/lsi.rb:925, lib/classifier/lsi.rb:992, lib/classifier/streaming/progress.rb:64, lib/classifier/streaming/line_reader.rb:68)
  • These nested structures reduce readability and should be refactored to use guard clauses

The implementation is functionally sound with comprehensive test coverage (558 tests passing). The incremental SVD algorithm is correctly implemented and the streaming architecture is well-designed.

Confidence Score: 4/5

  • This PR is safe to merge with minor style improvements recommended
  • Score reflects solid implementation with comprehensive test coverage (558 tests passing) and correct algorithm implementation. The issues found are purely stylistic (nested conditionals) and don't affect functionality. The C extension code is well-structured. Deducting one point for the multiple nested conditional violations that reduce code maintainability.
  • Pay attention to lib/classifier/lsi.rb which has the most conditional nesting issues that should be refactored

Important Files Changed

Filename Overview
lib/classifier/streaming/line_reader.rb Memory-efficient line reader - has deeply nested conditionals in estimate_line_count that should be simplified
lib/classifier/streaming/progress.rb Progress tracking implementation - has nested conditionals in eta method that should use early returns
lib/classifier/lsi.rb LSI classifier with incremental SVD - multiple nested conditionals need refactoring to early return pattern
lib/classifier/lsi/incremental_svd.rb Brand's incremental SVD algorithm - clean implementation, algorithm comments are valuable context
ext/classifier/incremental_svd.c Native C extension for fast matrix operations - well-structured, performance-critical code

Sequence Diagram

sequenceDiagram
    participant User
    participant Classifier
    participant Streaming
    participant LineReader
    participant Progress
    participant Storage

    User->>Classifier: train_from_stream(category, io)
    Classifier->>LineReader: new(io, batch_size)
    Classifier->>Progress: new(total: estimated_lines)
    
    loop For each batch
        LineReader->>LineReader: each_batch
        LineReader-->>Classifier: batch of documents
        Classifier->>Classifier: train_batch_internal(category, batch)
        Classifier->>Classifier: synchronize { update internal state }
        Classifier->>Progress: completed += batch.size
        Progress-->>User: yield progress (percent, rate, eta)
    end
    
    opt Checkpoint requested
        User->>Classifier: save_checkpoint(checkpoint_id)
        Classifier->>Storage: checkpoint_storage_for(id)
        Classifier->>Storage: save to checkpoint
        Storage-->>Classifier: checkpoint saved
    end
    
    alt Incremental SVD (LSI only)
        Classifier->>Classifier: perform_incremental_update(node, word_hash)
        Classifier->>Classifier: check vocabulary_growth_threshold
        alt Vocabulary growth < 20%
            Classifier->>Classifier: extend_vocabulary_for_incremental
            Classifier->>Classifier: IncrementalSVD.update(U, s, vector)
            Classifier->>Classifier: reproject_all_documents (if rank changed)
        else Vocabulary growth >= 20%
            Classifier->>Classifier: disable_incremental_mode
            Classifier->>Classifier: build_index (full SVD)
        end
    else Full rebuild
        Classifier->>Classifier: build_index
    end
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

20 files reviewed, 24 comments

Edit Code Review Agent Settings | Greptile

Comment on lines 88 to 89
return nil if sample_lines.zero?
return nil unless @io.respond_to?(:size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: redundant check - @io.respond_to?(:size) already verified on line 69

Suggested change
return nil if sample_lines.zero?
return nil unless @io.respond_to?(:size)
avg_line_size = sample_bytes.to_f / sample_lines
io_size = @io.__send__(:size) #: Integer
(io_size / avg_line_size).round
Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/streaming/line_reader.rb
Line: 88:89

Comment:
**style:** redundant check - `@io.respond_to?(:size)` already verified on line 69

```suggestion
          avg_line_size = sample_bytes.to_f / sample_lines
          io_size = @io.__send__(:size) #: Integer
          (io_size / avg_line_size).round
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +33 to +41
# Updates SVD with a new document vector using Brand's algorithm.
#
# @param u [Matrix] current left singular vectors (m × k)
# @param s [Array<Float>] current singular values (k values)
# @param c [Vector] new document vector (m × 1)
# @param max_rank [Integer] maximum rank to maintain
# @param epsilon [Float] threshold for zero detection
# @return [Array<Matrix, Array<Float>>] updated [u, s]
#
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: rdoc comment restates parameters already defined in method signature and @rbs annotation

Suggested change
# Updates SVD with a new document vector using Brand's algorithm.
#
# @param u [Matrix] current left singular vectors (m × k)
# @param s [Array<Float>] current singular values (k values)
# @param c [Vector] new document vector (m × 1)
# @param max_rank [Integer] maximum rank to maintain
# @param epsilon [Float] threshold for zero detection
# @return [Array<Matrix, Array<Float>>] updated [u, s]
#
# Updates SVD with a new document vector using Brand's algorithm.
#
# @rbs (Matrix, Array[Float], Vector, max_rank: Integer, ?epsilon: Float) -> [Matrix, Array[Float]]

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/lsi/incremental_svd.rb
Line: 33:41

Comment:
**style:** rdoc comment restates parameters already defined in method signature and `@rbs` annotation

```suggestion
        # Updates SVD with a new document vector using Brand's algorithm.
        #
        # @rbs (Matrix, Array[Float], Vector, max_rank: Integer, ?epsilon: Float) -> [Matrix, Array[Float]]
```

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +56 to +62
# Projects a document vector onto the semantic space defined by U.
# Returns the LSI representation: lsi_vec = U^T * raw_vec
#
# @param u [Matrix] left singular vectors (m × k)
# @param raw_vec [Vector] document vector in term space (m × 1)
# @return [Vector] document in semantic space (k × 1)
#
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: comment restates what the method name already makes obvious

Suggested change
# Projects a document vector onto the semantic space defined by U.
# Returns the LSI representation: lsi_vec = U^T * raw_vec
#
# @param u [Matrix] left singular vectors (m × k)
# @param raw_vec [Vector] document vector in term space (m × 1)
# @return [Vector] document in semantic space (k × 1)
#
# @rbs (Matrix, Vector) -> Vector
def project(u, raw_vec)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/lsi/incremental_svd.rb
Line: 56:62

Comment:
**style:** comment restates what the method name already makes obvious

```suggestion
        # @rbs (Matrix, Vector) -> Vector
        def project(u, raw_vec)
```

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines 229 to 235
# Use incremental update if enabled and we have a U matrix
if @incremental_mode && @u_matrix
perform_incremental_update(node, clean_word_hash)
elsif @auto_rebuild
build_index
end
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: nested conditionals - use early return pattern

Suggested change
# Use incremental update if enabled and we have a U matrix
if @incremental_mode && @u_matrix
perform_incremental_update(node, clean_word_hash)
elsif @auto_rebuild
build_index
end
end
# Use incremental update if enabled and we have a U matrix
return perform_incremental_update(node, clean_word_hash) if @incremental_mode && @u_matrix
build_index if @auto_rebuild

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/lsi.rb
Line: 229:235

Comment:
**style:** nested conditionals - use early return pattern

```suggestion
      # Use incremental update if enabled and we have a U matrix
      return perform_incremental_update(node, clean_word_hash) if @incremental_mode && @u_matrix
      
      build_index if @auto_rebuild
```

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +247 to +252

/*
* Matrix.vstack(top, bottom)
* Vertically stack two matrices
*/
static VALUE rb_cmatrix_vstack(VALUE klass, VALUE rb_top, VALUE rb_bottom)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: comment restates function name

Suggested change
/*
* Matrix.vstack(top, bottom)
* Vertically stack two matrices
*/
static VALUE rb_cmatrix_vstack(VALUE klass, VALUE rb_top, VALUE rb_bottom)
static VALUE rb_cmatrix_vstack(VALUE klass, VALUE rb_top, VALUE rb_bottom)
{

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: ext/classifier/incremental_svd.c
Line: 247:252

Comment:
**style:** comment restates function name

```suggestion
static VALUE rb_cmatrix_vstack(VALUE klass, VALUE rb_top, VALUE rb_bottom)
{
```

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +265 to +270
/*
* Matrix.zeros(rows, cols)
* Create a zero matrix
*/
static VALUE rb_cmatrix_zeros(VALUE klass, VALUE rb_rows, VALUE rb_cols)
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: comment restates function name

Suggested change
/*
* Matrix.zeros(rows, cols)
* Create a zero matrix
*/
static VALUE rb_cmatrix_zeros(VALUE klass, VALUE rb_rows, VALUE rb_cols)
{
static VALUE rb_cmatrix_zeros(VALUE klass, VALUE rb_rows, VALUE rb_cols)
{

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: ext/classifier/incremental_svd.c
Line: 265:270

Comment:
**style:** comment restates function name

```suggestion
static VALUE rb_cmatrix_zeros(VALUE klass, VALUE rb_rows, VALUE rb_cols)
{
```

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +280 to +284
/*
* Vector#-(other)
* Vector subtraction
*/
static VALUE rb_cvector_subtract(VALUE self, VALUE other)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: comment restates function signature

Suggested change
/*
* Vector#-(other)
* Vector subtraction
*/
static VALUE rb_cvector_subtract(VALUE self, VALUE other)
static VALUE rb_cvector_subtract(VALUE self, VALUE other)
{

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: ext/classifier/incremental_svd.c
Line: 280:284

Comment:
**style:** comment restates function signature

```suggestion
static VALUE rb_cvector_subtract(VALUE self, VALUE other)
{
```

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +300 to +307

/*
* Matrix#batch_project(vectors_array)
* Project multiple vectors onto this matrix (as U)
* Returns array of projected vectors
*
* This is the high-performance batch operation for re-projecting documents
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: "Project multiple vectors onto this matrix (as U)" and "Returns array of projected vectors" restates signature - only "performance-critical operation" comment adds value

Suggested change
/*
* Matrix#batch_project(vectors_array)
* Project multiple vectors onto this matrix (as U)
* Returns array of projected vectors
*
* This is the high-performance batch operation for re-projecting documents
*/
/*
* Matrix#batch_project(vectors_array)
* High-performance batch operation for re-projecting documents
*/

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: ext/classifier/incremental_svd.c
Line: 300:307

Comment:
**style:** "Project multiple vectors onto this matrix (as U)" and "Returns array of projected vectors" restates signature - only "performance-critical operation" comment adds value

```suggestion
/*
 * Matrix#batch_project(vectors_array)
 * High-performance batch operation for re-projecting documents
 */
```

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +345 to +350

/*
* Matrix#incremental_svd_update(singular_values, new_vector, max_rank, epsilon)
* Perform one Brand's incremental SVD update
* Returns [new_u, new_singular_values]
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: comment restates function signature

Suggested change
/*
* Matrix#incremental_svd_update(singular_values, new_vector, max_rank, epsilon)
* Perform one Brand's incremental SVD update
* Returns [new_u, new_singular_values]
*/
static VALUE rb_cmatrix_incremental_update(VALUE self, VALUE rb_s, VALUE rb_c,
VALUE rb_max_rank, VALUE rb_epsilon)
{

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: ext/classifier/incremental_svd.c
Line: 345:350

Comment:
**style:** comment restates function signature

```suggestion
static VALUE rb_cmatrix_incremental_update(VALUE self, VALUE rb_s, VALUE rb_c,
                                            VALUE rb_max_rank, VALUE rb_epsilon)
{
```

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Apply same dynamic discovery pattern to StorageAPIConsistencyTest.
New classifiers are automatically tested for storage API methods.
Address PR review feedback:
- Remove duplicate respond_to?(:size) check in LineReader
- Convert nested if/elsif to early return in LSI#add_item
- Use next to flatten synchronize block in perform_incremental_update
@cardmagic
Copy link
Owner Author

@greptile-apps Please re-review. I've addressed the legitimate code fixes (redundant check, nested conditionals), but I'm keeping all the algorithm documentation comments - they explain Brand's incremental SVD algorithm which is not obvious from the code alone. These comments are important for future maintainers.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

21 files reviewed, 5 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +229 to 233
# Use incremental update if enabled and we have a U matrix
return perform_incremental_update(node, clean_word_hash) if @incremental_mode && @u_matrix

build_index if @auto_rebuild
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: nested conditionals - use early return

Suggested change
# Use incremental update if enabled and we have a U matrix
return perform_incremental_update(node, clean_word_hash) if @incremental_mode && @u_matrix
build_index if @auto_rebuild
end
# Use incremental update if enabled and we have a U matrix
if @incremental_mode && @u_matrix
perform_incremental_update(node, clean_word_hash)
return
end
build_index if @auto_rebuild

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/lsi.rb
Line: 229:233

Comment:
**style:** nested conditionals - use early return

```suggestion
    # Use incremental update if enabled and we have a U matrix
    if @incremental_mode && @u_matrix
      perform_incremental_update(node, clean_word_hash)
      return
    end

    build_index if @auto_rebuild
```

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +64 to +70
def eta
return nil unless @total
return nil if rate.zero?
return 0.0 if @completed >= @total

(@total - @completed) / rate
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: nested conditionals - use early returns

Suggested change
def eta
return nil unless @total
return nil if rate.zero?
return 0.0 if @completed >= @total
(@total - @completed) / rate
end
# @rbs () -> Float?
def eta
return nil unless @total
return nil if rate.zero?
return 0.0 if @completed >= @total
(@total - @completed) / rate
end

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/streaming/progress.rb
Line: 64:70

Comment:
**style:** nested conditionals - use early returns

```suggestion
    # @rbs () -> Float?
    def eta
      return nil unless @total
      return nil if rate.zero?
      return 0.0 if @completed >= @total

      (@total - @completed) / rate
    end
```

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +68 to +96
def estimate_line_count(sample_size: 100)
return nil unless @io.respond_to?(:size) && @io.respond_to?(:rewind)

begin
original_pos = @io.pos
@io.rewind

sample_bytes = 0
sample_lines = 0

sample_size.times do
line = @io.gets
break unless line

sample_bytes += line.bytesize
sample_lines += 1
end

@io.seek(original_pos)

return nil if sample_lines.zero?

avg_line_size = sample_bytes.to_f / sample_lines
io_size = @io.__send__(:size) #: Integer
(io_size / avg_line_size).round
rescue IOError, Errno::ESPIPE
nil
end
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: deeply nested conditionals and try/catch - simplify structure

Suggested change
def estimate_line_count(sample_size: 100)
return nil unless @io.respond_to?(:size) && @io.respond_to?(:rewind)
begin
original_pos = @io.pos
@io.rewind
sample_bytes = 0
sample_lines = 0
sample_size.times do
line = @io.gets
break unless line
sample_bytes += line.bytesize
sample_lines += 1
end
@io.seek(original_pos)
return nil if sample_lines.zero?
avg_line_size = sample_bytes.to_f / sample_lines
io_size = @io.__send__(:size) #: Integer
(io_size / avg_line_size).round
rescue IOError, Errno::ESPIPE
nil
end
end
def estimate_line_count(sample_size: 100)
return nil unless @io.respond_to?(:size) && @io.respond_to?(:rewind)
original_pos = @io.pos
@io.rewind
sample_bytes = 0
sample_lines = 0
sample_size.times do
line = @io.gets
break unless line
sample_bytes += line.bytesize
sample_lines += 1
end
@io.seek(original_pos)
return nil if sample_lines.zero?
avg_line_size = sample_bytes.to_f / sample_lines
io_size = @io.__send__(:size) #: Integer
(io_size / avg_line_size).round
rescue IOError, Errno::ESPIPE
nil
end

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/streaming/line_reader.rb
Line: 68:96

Comment:
**style:** deeply nested conditionals and try/catch - simplify structure

```suggestion
    def estimate_line_count(sample_size: 100)
      return nil unless @io.respond_to?(:size) && @io.respond_to?(:rewind)

      original_pos = @io.pos
      @io.rewind

      sample_bytes = 0
      sample_lines = 0

      sample_size.times do
        line = @io.gets
        break unless line

        sample_bytes += line.bytesize
        sample_lines += 1
      end

      @io.seek(original_pos)
      return nil if sample_lines.zero?

      avg_line_size = sample_bytes.to_f / sample_lines
      io_size = @io.__send__(:size) #: Integer
      (io_size / avg_line_size).round
    rescue IOError, Errno::ESPIPE
      nil
    end
```

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines 63 to 67
# @rbs (Matrix, Vector) -> Vector
def project(u, raw_vec)
result = u.transpose * raw_vec
result.is_a?(Vector) ? result : Vector[*result.to_a.flatten]
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: unnecessary defensive code - u.transpose * raw_vec already returns a Vector, the ternary is handling an impossible scenario

Suggested change
# @rbs (Matrix, Vector) -> Vector
def project(u, raw_vec)
result = u.transpose * raw_vec
result.is_a?(Vector) ? result : Vector[*result.to_a.flatten]
end
# @rbs (Matrix, Vector) -> Vector
def project(u, raw_vec)
u.transpose * raw_vec
end
Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/lsi/incremental_svd.rb
Line: 63:67

Comment:
**style:** unnecessary defensive code - `u.transpose * raw_vec` already returns a Vector, the ternary is handling an impossible scenario

```suggestion
    # @rbs (Matrix, Vector) -> Vector
    def project(u, raw_vec)
      u.transpose * raw_vec
    end
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines 997 to 1002
if self.class.native_available? && @u_matrix.respond_to?(:batch_project)
reproject_all_documents_native
else
reproject_all_documents_ruby
end
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: unnecessary conditional - just call the native version when available

Suggested change
if self.class.native_available? && @u_matrix.respond_to?(:batch_project)
reproject_all_documents_native
else
reproject_all_documents_ruby
end
end
def reproject_all_documents
return unless @u_matrix
if self.class.native_available? && @u_matrix.respond_to?(:batch_project)
reproject_all_documents_native
return
end
reproject_all_documents_ruby
end

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/lsi.rb
Line: 997:1002

Comment:
**style:** unnecessary conditional - just call the native version when available

```suggestion
    def reproject_all_documents
      return unless @u_matrix

      if self.class.native_available? && @u_matrix.respond_to?(:batch_project)
        reproject_all_documents_native
        return
      end

      reproject_all_documents_ruby
    end
```

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

- Collapse multiline @RBS annotations to single lines (rbs-inline requirement)
- Add explicit @RBS! block for KNN alias declaration
- Increase RuboCop line length limit to 140 for type annotations
- Fix Style/IfUnlessModifier in tfidf.rb
- Remove unnecessary defensive code in IncrementalSVD#project
- Use early return in reproject_all_documents
@cardmagic
Copy link
Owner Author

@greptile-apps Please re-review. Addressed the remaining style feedback (simplified IncrementalSVD#project, early return in reproject_all_documents). The line_reader begin/rescue is intentional for handling IO errors on non-seekable streams.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

22 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +64 to +69
def eta
return nil unless @total
return nil if rate.zero?
return 0.0 if @completed >= @total

(@total - @completed) / rate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: nested conditionals - use early returns

Suggested change
def eta
return nil unless @total
return nil if rate.zero?
return 0.0 if @completed >= @total
(@total - @completed) / rate
def eta
return nil unless @total
return nil if rate.zero?
return 0.0 if @completed >= @total
(@total - @completed) / rate
end

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/streaming/progress.rb
Line: 64:69

Comment:
**style:** nested conditionals - use early returns

```suggestion
def eta
  return nil unless @total
  return nil if rate.zero?
  return 0.0 if @completed >= @total
  
  (@total - @completed) / rate
end
```

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +68 to +95
def estimate_line_count(sample_size: 100)
return nil unless @io.respond_to?(:size) && @io.respond_to?(:rewind)

begin
original_pos = @io.pos
@io.rewind

sample_bytes = 0
sample_lines = 0

sample_size.times do
line = @io.gets
break unless line

sample_bytes += line.bytesize
sample_lines += 1
end

@io.seek(original_pos)

return nil if sample_lines.zero?

avg_line_size = sample_bytes.to_f / sample_lines
io_size = @io.__send__(:size) #: Integer
(io_size / avg_line_size).round
rescue IOError, Errno::ESPIPE
nil
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: deeply nested try/catch and conditionals - simplify with early returns

Suggested change
def estimate_line_count(sample_size: 100)
return nil unless @io.respond_to?(:size) && @io.respond_to?(:rewind)
begin
original_pos = @io.pos
@io.rewind
sample_bytes = 0
sample_lines = 0
sample_size.times do
line = @io.gets
break unless line
sample_bytes += line.bytesize
sample_lines += 1
end
@io.seek(original_pos)
return nil if sample_lines.zero?
avg_line_size = sample_bytes.to_f / sample_lines
io_size = @io.__send__(:size) #: Integer
(io_size / avg_line_size).round
rescue IOError, Errno::ESPIPE
nil
end
def estimate_line_count(sample_size: 100)
return nil unless @io.respond_to?(:size) && @io.respond_to?(:rewind)
original_pos = @io.pos
@io.rewind
sample_bytes = 0
sample_lines = 0
sample_size.times do
line = @io.gets
break unless line
sample_bytes += line.bytesize
sample_lines += 1
end
@io.seek(original_pos)
return nil if sample_lines.zero?
avg_line_size = sample_bytes.to_f / sample_lines
io_size = @io.__send__(:size)
(io_size / avg_line_size).round
rescue IOError, Errno::ESPIPE
nil
end

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/streaming/line_reader.rb
Line: 68:95

Comment:
**style:** deeply nested try/catch and conditionals - simplify with early returns

```suggestion
def estimate_line_count(sample_size: 100)
  return nil unless @io.respond_to?(:size) && @io.respond_to?(:rewind)

  original_pos = @io.pos
  @io.rewind

  sample_bytes = 0
  sample_lines = 0

  sample_size.times do
    line = @io.gets
    break unless line

    sample_bytes += line.bytesize
    sample_lines += 1
  end

  @io.seek(original_pos)

  return nil if sample_lines.zero?

  avg_line_size = sample_bytes.to_f / sample_lines
  io_size = @io.__send__(:size)
  (io_size / avg_line_size).round
rescue IOError, Errno::ESPIPE
  nil
end
```

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +925 to +949
synchronize do
if vocabulary_growth_exceeds_threshold?(word_hash)
disable_incremental_mode!
needs_full_rebuild = true
next
end

old_rank = @u_matrix.column_size
extend_vocabulary_for_incremental(word_hash)
raw_vec = node.raw_vector_with(@word_list)
raw_vector = Vector[*raw_vec.to_a]

@u_matrix, @singular_values = IncrementalSVD.update(
@u_matrix, @singular_values, raw_vector, max_rank: @max_rank
)

new_rank = @u_matrix.column_size
if new_rank > old_rank
reproject_all_documents
else
assign_lsi_vector_incremental(node)
end

@built_at_version = @version
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: nested if within synchronize - use early returns

Suggested change
synchronize do
if vocabulary_growth_exceeds_threshold?(word_hash)
disable_incremental_mode!
needs_full_rebuild = true
next
end
old_rank = @u_matrix.column_size
extend_vocabulary_for_incremental(word_hash)
raw_vec = node.raw_vector_with(@word_list)
raw_vector = Vector[*raw_vec.to_a]
@u_matrix, @singular_values = IncrementalSVD.update(
@u_matrix, @singular_values, raw_vector, max_rank: @max_rank
)
new_rank = @u_matrix.column_size
if new_rank > old_rank
reproject_all_documents
else
assign_lsi_vector_incremental(node)
end
@built_at_version = @version
end
synchronize do
if vocabulary_growth_exceeds_threshold?(word_hash)
disable_incremental_mode!
needs_full_rebuild = true
next
end
old_rank = @u_matrix.column_size
extend_vocabulary_for_incremental(word_hash)
raw_vec = node.raw_vector_with(@word_list)
raw_vector = Vector[*raw_vec.to_a]
@u_matrix, @singular_values = IncrementalSVD.update(
@u_matrix, @singular_values, raw_vector, max_rank: @max_rank
)
new_rank = @u_matrix.column_size
if new_rank > old_rank
reproject_all_documents
else
assign_lsi_vector_incremental(node)
end
@built_at_version = @version
end

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/lsi.rb
Line: 925:949

Comment:
**style:** nested `if` within `synchronize` - use early returns

```suggestion
synchronize do
  if vocabulary_growth_exceeds_threshold?(word_hash)
    disable_incremental_mode!
    needs_full_rebuild = true
    next
  end

  old_rank = @u_matrix.column_size
  extend_vocabulary_for_incremental(word_hash)
  raw_vec = node.raw_vector_with(@word_list)
  raw_vector = Vector[*raw_vec.to_a]

  @u_matrix, @singular_values = IncrementalSVD.update(
    @u_matrix, @singular_values, raw_vector, max_rank: @max_rank
  )

  new_rank = @u_matrix.column_size
  if new_rank > old_rank
    reproject_all_documents
  else
    assign_lsi_vector_incremental(node)
  end

  @built_at_version = @version
end
```

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +992 to +997
def reproject_all_documents
return unless @u_matrix
return reproject_all_documents_native if self.class.native_available? && @u_matrix.respond_to?(:batch_project)

reproject_all_documents_ruby
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: unnecessary conditional chain - just call native version directly when conditions met

Suggested change
def reproject_all_documents
return unless @u_matrix
return reproject_all_documents_native if self.class.native_available? && @u_matrix.respond_to?(:batch_project)
reproject_all_documents_ruby
end
def reproject_all_documents
return unless @u_matrix
if self.class.native_available? && @u_matrix.respond_to?(:batch_project)
reproject_all_documents_native
else
reproject_all_documents_ruby
end
end

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/lsi.rb
Line: 992:997

Comment:
**style:** unnecessary conditional chain - just call native version directly when conditions met

```suggestion
def reproject_all_documents
  return unless @u_matrix
  
  if self.class.native_available? && @u_matrix.respond_to?(:batch_project)
    reproject_all_documents_native
  else
    reproject_all_documents_ruby
  end
end
```

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

- Make Matrix and Vector non-generic in sig/vendor/matrix.rbs
- Add type parameter to Enumerable include in LineReader (#[String])

This fixes InvalidTypeApplicationError from rbs validate.
- Make train_batch params nilable (?(Type)? syntax) in all classifiers
- Add _StreamingHost interface for Streaming module dependencies
- Fix TFIDF stub methods with proper RBS and steep:ignore
- Make Matrix/Vector non-generic in vendor RBS
- Add Enumerable type param to LineReader
@cardmagic cardmagic merged commit fe24d99 into master Dec 29, 2025
5 checks passed
@cardmagic cardmagic deleted the feature/streaming-training branch December 29, 2025 16:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant