-
Notifications
You must be signed in to change notification settings - Fork 124
feat: Add streaming training and incremental SVD support #109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ency - Add `train` as alias for `add` method - Add dynamic `train_*` methods (train_spam, train_ham, etc.) - Standardize `classify` to return String instead of String|Symbol - Standardize `categories` to return Array[String] - Add tests for new API consistency features This enables writing classifier-agnostic code that works with Bayes, LogisticRegression, and kNN interchangeably.
Strip inline step comments that restate what the code does while keeping method-level RDoc descriptions and @RBS type annotations. Simplifies code without losing documentation value.
Reduce Steep errors from 183 to 2 warnings by: - Adding Matrix/Vector methods to sig/vendor/matrix.rbs - Fixing train_batch signatures to allow nil optional params - Adding type assertions for type narrowing in streaming code - Using explicit type casts for storage.path access - Fixing Enumerable generic type for LineReader The remaining 2 warnings are Steep limitations with restarg (*).
Ensures all classifiers (Bayes, LSI, KNN, LogisticRegression, TFIDF) include the Streaming module and respond to required methods. This catches missing implementations when new classifiers are added.
Replace hardcoded CLASSIFIERS list with dynamic discovery using Classifier.constants. Classes with `classify` or `transform` methods are considered classifiers/vectorizers that must include Streaming. New classifiers are automatically tested when added to the module.
Greptile SummaryThis PR adds streaming training capabilities and incremental SVD support to the classifier gem, enabling memory-efficient batch processing across all classifiers. Key changes:
Code quality issues found:
The implementation is functionally sound with comprehensive test coverage (558 tests passing). The incremental SVD algorithm is correctly implemented and the streaming architecture is well-designed. Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant Classifier
participant Streaming
participant LineReader
participant Progress
participant Storage
User->>Classifier: train_from_stream(category, io)
Classifier->>LineReader: new(io, batch_size)
Classifier->>Progress: new(total: estimated_lines)
loop For each batch
LineReader->>LineReader: each_batch
LineReader-->>Classifier: batch of documents
Classifier->>Classifier: train_batch_internal(category, batch)
Classifier->>Classifier: synchronize { update internal state }
Classifier->>Progress: completed += batch.size
Progress-->>User: yield progress (percent, rate, eta)
end
opt Checkpoint requested
User->>Classifier: save_checkpoint(checkpoint_id)
Classifier->>Storage: checkpoint_storage_for(id)
Classifier->>Storage: save to checkpoint
Storage-->>Classifier: checkpoint saved
end
alt Incremental SVD (LSI only)
Classifier->>Classifier: perform_incremental_update(node, word_hash)
Classifier->>Classifier: check vocabulary_growth_threshold
alt Vocabulary growth < 20%
Classifier->>Classifier: extend_vocabulary_for_incremental
Classifier->>Classifier: IncrementalSVD.update(U, s, vector)
Classifier->>Classifier: reproject_all_documents (if rank changed)
else Vocabulary growth >= 20%
Classifier->>Classifier: disable_incremental_mode
Classifier->>Classifier: build_index (full SVD)
end
else Full rebuild
Classifier->>Classifier: build_index
end
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
20 files reviewed, 24 comments
| return nil if sample_lines.zero? | ||
| return nil unless @io.respond_to?(:size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: redundant check - @io.respond_to?(:size) already verified on line 69
| return nil if sample_lines.zero? | |
| return nil unless @io.respond_to?(:size) | |
| avg_line_size = sample_bytes.to_f / sample_lines | |
| io_size = @io.__send__(:size) #: Integer | |
| (io_size / avg_line_size).round |
Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/streaming/line_reader.rb
Line: 88:89
Comment:
**style:** redundant check - `@io.respond_to?(:size)` already verified on line 69
```suggestion
avg_line_size = sample_bytes.to_f / sample_lines
io_size = @io.__send__(:size) #: Integer
(io_size / avg_line_size).round
```
How can I resolve this? If you propose a fix, please make it concise.| # Updates SVD with a new document vector using Brand's algorithm. | ||
| # | ||
| # @param u [Matrix] current left singular vectors (m × k) | ||
| # @param s [Array<Float>] current singular values (k values) | ||
| # @param c [Vector] new document vector (m × 1) | ||
| # @param max_rank [Integer] maximum rank to maintain | ||
| # @param epsilon [Float] threshold for zero detection | ||
| # @return [Array<Matrix, Array<Float>>] updated [u, s] | ||
| # |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: rdoc comment restates parameters already defined in method signature and @rbs annotation
| # Updates SVD with a new document vector using Brand's algorithm. | |
| # | |
| # @param u [Matrix] current left singular vectors (m × k) | |
| # @param s [Array<Float>] current singular values (k values) | |
| # @param c [Vector] new document vector (m × 1) | |
| # @param max_rank [Integer] maximum rank to maintain | |
| # @param epsilon [Float] threshold for zero detection | |
| # @return [Array<Matrix, Array<Float>>] updated [u, s] | |
| # | |
| # Updates SVD with a new document vector using Brand's algorithm. | |
| # | |
| # @rbs (Matrix, Array[Float], Vector, max_rank: Integer, ?epsilon: Float) -> [Matrix, Array[Float]] |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/lsi/incremental_svd.rb
Line: 33:41
Comment:
**style:** rdoc comment restates parameters already defined in method signature and `@rbs` annotation
```suggestion
# Updates SVD with a new document vector using Brand's algorithm.
#
# @rbs (Matrix, Array[Float], Vector, max_rank: Integer, ?epsilon: Float) -> [Matrix, Array[Float]]
```
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.| # Projects a document vector onto the semantic space defined by U. | ||
| # Returns the LSI representation: lsi_vec = U^T * raw_vec | ||
| # | ||
| # @param u [Matrix] left singular vectors (m × k) | ||
| # @param raw_vec [Vector] document vector in term space (m × 1) | ||
| # @return [Vector] document in semantic space (k × 1) | ||
| # |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: comment restates what the method name already makes obvious
| # Projects a document vector onto the semantic space defined by U. | |
| # Returns the LSI representation: lsi_vec = U^T * raw_vec | |
| # | |
| # @param u [Matrix] left singular vectors (m × k) | |
| # @param raw_vec [Vector] document vector in term space (m × 1) | |
| # @return [Vector] document in semantic space (k × 1) | |
| # | |
| # @rbs (Matrix, Vector) -> Vector | |
| def project(u, raw_vec) |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/lsi/incremental_svd.rb
Line: 56:62
Comment:
**style:** comment restates what the method name already makes obvious
```suggestion
# @rbs (Matrix, Vector) -> Vector
def project(u, raw_vec)
```
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.| # Use incremental update if enabled and we have a U matrix | ||
| if @incremental_mode && @u_matrix | ||
| perform_incremental_update(node, clean_word_hash) | ||
| elsif @auto_rebuild | ||
| build_index | ||
| end | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: nested conditionals - use early return pattern
| # Use incremental update if enabled and we have a U matrix | |
| if @incremental_mode && @u_matrix | |
| perform_incremental_update(node, clean_word_hash) | |
| elsif @auto_rebuild | |
| build_index | |
| end | |
| end | |
| # Use incremental update if enabled and we have a U matrix | |
| return perform_incremental_update(node, clean_word_hash) if @incremental_mode && @u_matrix | |
| build_index if @auto_rebuild |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/lsi.rb
Line: 229:235
Comment:
**style:** nested conditionals - use early return pattern
```suggestion
# Use incremental update if enabled and we have a U matrix
return perform_incremental_update(node, clean_word_hash) if @incremental_mode && @u_matrix
build_index if @auto_rebuild
```
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.|
|
||
| /* | ||
| * Matrix.vstack(top, bottom) | ||
| * Vertically stack two matrices | ||
| */ | ||
| static VALUE rb_cmatrix_vstack(VALUE klass, VALUE rb_top, VALUE rb_bottom) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: comment restates function name
| /* | |
| * Matrix.vstack(top, bottom) | |
| * Vertically stack two matrices | |
| */ | |
| static VALUE rb_cmatrix_vstack(VALUE klass, VALUE rb_top, VALUE rb_bottom) | |
| static VALUE rb_cmatrix_vstack(VALUE klass, VALUE rb_top, VALUE rb_bottom) | |
| { |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: ext/classifier/incremental_svd.c
Line: 247:252
Comment:
**style:** comment restates function name
```suggestion
static VALUE rb_cmatrix_vstack(VALUE klass, VALUE rb_top, VALUE rb_bottom)
{
```
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.| /* | ||
| * Matrix.zeros(rows, cols) | ||
| * Create a zero matrix | ||
| */ | ||
| static VALUE rb_cmatrix_zeros(VALUE klass, VALUE rb_rows, VALUE rb_cols) | ||
| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: comment restates function name
| /* | |
| * Matrix.zeros(rows, cols) | |
| * Create a zero matrix | |
| */ | |
| static VALUE rb_cmatrix_zeros(VALUE klass, VALUE rb_rows, VALUE rb_cols) | |
| { | |
| static VALUE rb_cmatrix_zeros(VALUE klass, VALUE rb_rows, VALUE rb_cols) | |
| { |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: ext/classifier/incremental_svd.c
Line: 265:270
Comment:
**style:** comment restates function name
```suggestion
static VALUE rb_cmatrix_zeros(VALUE klass, VALUE rb_rows, VALUE rb_cols)
{
```
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.| /* | ||
| * Vector#-(other) | ||
| * Vector subtraction | ||
| */ | ||
| static VALUE rb_cvector_subtract(VALUE self, VALUE other) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: comment restates function signature
| /* | |
| * Vector#-(other) | |
| * Vector subtraction | |
| */ | |
| static VALUE rb_cvector_subtract(VALUE self, VALUE other) | |
| static VALUE rb_cvector_subtract(VALUE self, VALUE other) | |
| { |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: ext/classifier/incremental_svd.c
Line: 280:284
Comment:
**style:** comment restates function signature
```suggestion
static VALUE rb_cvector_subtract(VALUE self, VALUE other)
{
```
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.|
|
||
| /* | ||
| * Matrix#batch_project(vectors_array) | ||
| * Project multiple vectors onto this matrix (as U) | ||
| * Returns array of projected vectors | ||
| * | ||
| * This is the high-performance batch operation for re-projecting documents | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: "Project multiple vectors onto this matrix (as U)" and "Returns array of projected vectors" restates signature - only "performance-critical operation" comment adds value
| /* | |
| * Matrix#batch_project(vectors_array) | |
| * Project multiple vectors onto this matrix (as U) | |
| * Returns array of projected vectors | |
| * | |
| * This is the high-performance batch operation for re-projecting documents | |
| */ | |
| /* | |
| * Matrix#batch_project(vectors_array) | |
| * High-performance batch operation for re-projecting documents | |
| */ |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: ext/classifier/incremental_svd.c
Line: 300:307
Comment:
**style:** "Project multiple vectors onto this matrix (as U)" and "Returns array of projected vectors" restates signature - only "performance-critical operation" comment adds value
```suggestion
/*
* Matrix#batch_project(vectors_array)
* High-performance batch operation for re-projecting documents
*/
```
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.|
|
||
| /* | ||
| * Matrix#incremental_svd_update(singular_values, new_vector, max_rank, epsilon) | ||
| * Perform one Brand's incremental SVD update | ||
| * Returns [new_u, new_singular_values] | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: comment restates function signature
| /* | |
| * Matrix#incremental_svd_update(singular_values, new_vector, max_rank, epsilon) | |
| * Perform one Brand's incremental SVD update | |
| * Returns [new_u, new_singular_values] | |
| */ | |
| static VALUE rb_cmatrix_incremental_update(VALUE self, VALUE rb_s, VALUE rb_c, | |
| VALUE rb_max_rank, VALUE rb_epsilon) | |
| { |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: ext/classifier/incremental_svd.c
Line: 345:350
Comment:
**style:** comment restates function signature
```suggestion
static VALUE rb_cmatrix_incremental_update(VALUE self, VALUE rb_s, VALUE rb_c,
VALUE rb_max_rank, VALUE rb_epsilon)
{
```
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.Apply same dynamic discovery pattern to StorageAPIConsistencyTest. New classifiers are automatically tested for storage API methods.
Address PR review feedback: - Remove duplicate respond_to?(:size) check in LineReader - Convert nested if/elsif to early return in LSI#add_item - Use next to flatten synchronize block in perform_incremental_update
|
@greptile-apps Please re-review. I've addressed the legitimate code fixes (redundant check, nested conditionals), but I'm keeping all the algorithm documentation comments - they explain Brand's incremental SVD algorithm which is not obvious from the code alone. These comments are important for future maintainers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
21 files reviewed, 5 comments
| # Use incremental update if enabled and we have a U matrix | ||
| return perform_incremental_update(node, clean_word_hash) if @incremental_mode && @u_matrix | ||
|
|
||
| build_index if @auto_rebuild | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: nested conditionals - use early return
| # Use incremental update if enabled and we have a U matrix | |
| return perform_incremental_update(node, clean_word_hash) if @incremental_mode && @u_matrix | |
| build_index if @auto_rebuild | |
| end | |
| # Use incremental update if enabled and we have a U matrix | |
| if @incremental_mode && @u_matrix | |
| perform_incremental_update(node, clean_word_hash) | |
| return | |
| end | |
| build_index if @auto_rebuild |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/lsi.rb
Line: 229:233
Comment:
**style:** nested conditionals - use early return
```suggestion
# Use incremental update if enabled and we have a U matrix
if @incremental_mode && @u_matrix
perform_incremental_update(node, clean_word_hash)
return
end
build_index if @auto_rebuild
```
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.| def eta | ||
| return nil unless @total | ||
| return nil if rate.zero? | ||
| return 0.0 if @completed >= @total | ||
|
|
||
| (@total - @completed) / rate | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: nested conditionals - use early returns
| def eta | |
| return nil unless @total | |
| return nil if rate.zero? | |
| return 0.0 if @completed >= @total | |
| (@total - @completed) / rate | |
| end | |
| # @rbs () -> Float? | |
| def eta | |
| return nil unless @total | |
| return nil if rate.zero? | |
| return 0.0 if @completed >= @total | |
| (@total - @completed) / rate | |
| end |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/streaming/progress.rb
Line: 64:70
Comment:
**style:** nested conditionals - use early returns
```suggestion
# @rbs () -> Float?
def eta
return nil unless @total
return nil if rate.zero?
return 0.0 if @completed >= @total
(@total - @completed) / rate
end
```
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.| def estimate_line_count(sample_size: 100) | ||
| return nil unless @io.respond_to?(:size) && @io.respond_to?(:rewind) | ||
|
|
||
| begin | ||
| original_pos = @io.pos | ||
| @io.rewind | ||
|
|
||
| sample_bytes = 0 | ||
| sample_lines = 0 | ||
|
|
||
| sample_size.times do | ||
| line = @io.gets | ||
| break unless line | ||
|
|
||
| sample_bytes += line.bytesize | ||
| sample_lines += 1 | ||
| end | ||
|
|
||
| @io.seek(original_pos) | ||
|
|
||
| return nil if sample_lines.zero? | ||
|
|
||
| avg_line_size = sample_bytes.to_f / sample_lines | ||
| io_size = @io.__send__(:size) #: Integer | ||
| (io_size / avg_line_size).round | ||
| rescue IOError, Errno::ESPIPE | ||
| nil | ||
| end | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: deeply nested conditionals and try/catch - simplify structure
| def estimate_line_count(sample_size: 100) | |
| return nil unless @io.respond_to?(:size) && @io.respond_to?(:rewind) | |
| begin | |
| original_pos = @io.pos | |
| @io.rewind | |
| sample_bytes = 0 | |
| sample_lines = 0 | |
| sample_size.times do | |
| line = @io.gets | |
| break unless line | |
| sample_bytes += line.bytesize | |
| sample_lines += 1 | |
| end | |
| @io.seek(original_pos) | |
| return nil if sample_lines.zero? | |
| avg_line_size = sample_bytes.to_f / sample_lines | |
| io_size = @io.__send__(:size) #: Integer | |
| (io_size / avg_line_size).round | |
| rescue IOError, Errno::ESPIPE | |
| nil | |
| end | |
| end | |
| def estimate_line_count(sample_size: 100) | |
| return nil unless @io.respond_to?(:size) && @io.respond_to?(:rewind) | |
| original_pos = @io.pos | |
| @io.rewind | |
| sample_bytes = 0 | |
| sample_lines = 0 | |
| sample_size.times do | |
| line = @io.gets | |
| break unless line | |
| sample_bytes += line.bytesize | |
| sample_lines += 1 | |
| end | |
| @io.seek(original_pos) | |
| return nil if sample_lines.zero? | |
| avg_line_size = sample_bytes.to_f / sample_lines | |
| io_size = @io.__send__(:size) #: Integer | |
| (io_size / avg_line_size).round | |
| rescue IOError, Errno::ESPIPE | |
| nil | |
| end |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/streaming/line_reader.rb
Line: 68:96
Comment:
**style:** deeply nested conditionals and try/catch - simplify structure
```suggestion
def estimate_line_count(sample_size: 100)
return nil unless @io.respond_to?(:size) && @io.respond_to?(:rewind)
original_pos = @io.pos
@io.rewind
sample_bytes = 0
sample_lines = 0
sample_size.times do
line = @io.gets
break unless line
sample_bytes += line.bytesize
sample_lines += 1
end
@io.seek(original_pos)
return nil if sample_lines.zero?
avg_line_size = sample_bytes.to_f / sample_lines
io_size = @io.__send__(:size) #: Integer
(io_size / avg_line_size).round
rescue IOError, Errno::ESPIPE
nil
end
```
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.| # @rbs (Matrix, Vector) -> Vector | ||
| def project(u, raw_vec) | ||
| result = u.transpose * raw_vec | ||
| result.is_a?(Vector) ? result : Vector[*result.to_a.flatten] | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: unnecessary defensive code - u.transpose * raw_vec already returns a Vector, the ternary is handling an impossible scenario
| # @rbs (Matrix, Vector) -> Vector | |
| def project(u, raw_vec) | |
| result = u.transpose * raw_vec | |
| result.is_a?(Vector) ? result : Vector[*result.to_a.flatten] | |
| end | |
| # @rbs (Matrix, Vector) -> Vector | |
| def project(u, raw_vec) | |
| u.transpose * raw_vec | |
| end |
Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/lsi/incremental_svd.rb
Line: 63:67
Comment:
**style:** unnecessary defensive code - `u.transpose * raw_vec` already returns a Vector, the ternary is handling an impossible scenario
```suggestion
# @rbs (Matrix, Vector) -> Vector
def project(u, raw_vec)
u.transpose * raw_vec
end
```
How can I resolve this? If you propose a fix, please make it concise.
lib/classifier/lsi.rb
Outdated
| if self.class.native_available? && @u_matrix.respond_to?(:batch_project) | ||
| reproject_all_documents_native | ||
| else | ||
| reproject_all_documents_ruby | ||
| end | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: unnecessary conditional - just call the native version when available
| if self.class.native_available? && @u_matrix.respond_to?(:batch_project) | |
| reproject_all_documents_native | |
| else | |
| reproject_all_documents_ruby | |
| end | |
| end | |
| def reproject_all_documents | |
| return unless @u_matrix | |
| if self.class.native_available? && @u_matrix.respond_to?(:batch_project) | |
| reproject_all_documents_native | |
| return | |
| end | |
| reproject_all_documents_ruby | |
| end |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/lsi.rb
Line: 997:1002
Comment:
**style:** unnecessary conditional - just call the native version when available
```suggestion
def reproject_all_documents
return unless @u_matrix
if self.class.native_available? && @u_matrix.respond_to?(:batch_project)
reproject_all_documents_native
return
end
reproject_all_documents_ruby
end
```
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.- Remove unnecessary defensive code in IncrementalSVD#project - Use early return in reproject_all_documents
|
@greptile-apps Please re-review. Addressed the remaining style feedback (simplified IncrementalSVD#project, early return in reproject_all_documents). The line_reader begin/rescue is intentional for handling IO errors on non-seekable streams. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
22 files reviewed, 4 comments
| def eta | ||
| return nil unless @total | ||
| return nil if rate.zero? | ||
| return 0.0 if @completed >= @total | ||
|
|
||
| (@total - @completed) / rate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: nested conditionals - use early returns
| def eta | |
| return nil unless @total | |
| return nil if rate.zero? | |
| return 0.0 if @completed >= @total | |
| (@total - @completed) / rate | |
| def eta | |
| return nil unless @total | |
| return nil if rate.zero? | |
| return 0.0 if @completed >= @total | |
| (@total - @completed) / rate | |
| end |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/streaming/progress.rb
Line: 64:69
Comment:
**style:** nested conditionals - use early returns
```suggestion
def eta
return nil unless @total
return nil if rate.zero?
return 0.0 if @completed >= @total
(@total - @completed) / rate
end
```
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.| def estimate_line_count(sample_size: 100) | ||
| return nil unless @io.respond_to?(:size) && @io.respond_to?(:rewind) | ||
|
|
||
| begin | ||
| original_pos = @io.pos | ||
| @io.rewind | ||
|
|
||
| sample_bytes = 0 | ||
| sample_lines = 0 | ||
|
|
||
| sample_size.times do | ||
| line = @io.gets | ||
| break unless line | ||
|
|
||
| sample_bytes += line.bytesize | ||
| sample_lines += 1 | ||
| end | ||
|
|
||
| @io.seek(original_pos) | ||
|
|
||
| return nil if sample_lines.zero? | ||
|
|
||
| avg_line_size = sample_bytes.to_f / sample_lines | ||
| io_size = @io.__send__(:size) #: Integer | ||
| (io_size / avg_line_size).round | ||
| rescue IOError, Errno::ESPIPE | ||
| nil | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: deeply nested try/catch and conditionals - simplify with early returns
| def estimate_line_count(sample_size: 100) | |
| return nil unless @io.respond_to?(:size) && @io.respond_to?(:rewind) | |
| begin | |
| original_pos = @io.pos | |
| @io.rewind | |
| sample_bytes = 0 | |
| sample_lines = 0 | |
| sample_size.times do | |
| line = @io.gets | |
| break unless line | |
| sample_bytes += line.bytesize | |
| sample_lines += 1 | |
| end | |
| @io.seek(original_pos) | |
| return nil if sample_lines.zero? | |
| avg_line_size = sample_bytes.to_f / sample_lines | |
| io_size = @io.__send__(:size) #: Integer | |
| (io_size / avg_line_size).round | |
| rescue IOError, Errno::ESPIPE | |
| nil | |
| end | |
| def estimate_line_count(sample_size: 100) | |
| return nil unless @io.respond_to?(:size) && @io.respond_to?(:rewind) | |
| original_pos = @io.pos | |
| @io.rewind | |
| sample_bytes = 0 | |
| sample_lines = 0 | |
| sample_size.times do | |
| line = @io.gets | |
| break unless line | |
| sample_bytes += line.bytesize | |
| sample_lines += 1 | |
| end | |
| @io.seek(original_pos) | |
| return nil if sample_lines.zero? | |
| avg_line_size = sample_bytes.to_f / sample_lines | |
| io_size = @io.__send__(:size) | |
| (io_size / avg_line_size).round | |
| rescue IOError, Errno::ESPIPE | |
| nil | |
| end |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/streaming/line_reader.rb
Line: 68:95
Comment:
**style:** deeply nested try/catch and conditionals - simplify with early returns
```suggestion
def estimate_line_count(sample_size: 100)
return nil unless @io.respond_to?(:size) && @io.respond_to?(:rewind)
original_pos = @io.pos
@io.rewind
sample_bytes = 0
sample_lines = 0
sample_size.times do
line = @io.gets
break unless line
sample_bytes += line.bytesize
sample_lines += 1
end
@io.seek(original_pos)
return nil if sample_lines.zero?
avg_line_size = sample_bytes.to_f / sample_lines
io_size = @io.__send__(:size)
(io_size / avg_line_size).round
rescue IOError, Errno::ESPIPE
nil
end
```
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.| synchronize do | ||
| if vocabulary_growth_exceeds_threshold?(word_hash) | ||
| disable_incremental_mode! | ||
| needs_full_rebuild = true | ||
| next | ||
| end | ||
|
|
||
| old_rank = @u_matrix.column_size | ||
| extend_vocabulary_for_incremental(word_hash) | ||
| raw_vec = node.raw_vector_with(@word_list) | ||
| raw_vector = Vector[*raw_vec.to_a] | ||
|
|
||
| @u_matrix, @singular_values = IncrementalSVD.update( | ||
| @u_matrix, @singular_values, raw_vector, max_rank: @max_rank | ||
| ) | ||
|
|
||
| new_rank = @u_matrix.column_size | ||
| if new_rank > old_rank | ||
| reproject_all_documents | ||
| else | ||
| assign_lsi_vector_incremental(node) | ||
| end | ||
|
|
||
| @built_at_version = @version | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: nested if within synchronize - use early returns
| synchronize do | |
| if vocabulary_growth_exceeds_threshold?(word_hash) | |
| disable_incremental_mode! | |
| needs_full_rebuild = true | |
| next | |
| end | |
| old_rank = @u_matrix.column_size | |
| extend_vocabulary_for_incremental(word_hash) | |
| raw_vec = node.raw_vector_with(@word_list) | |
| raw_vector = Vector[*raw_vec.to_a] | |
| @u_matrix, @singular_values = IncrementalSVD.update( | |
| @u_matrix, @singular_values, raw_vector, max_rank: @max_rank | |
| ) | |
| new_rank = @u_matrix.column_size | |
| if new_rank > old_rank | |
| reproject_all_documents | |
| else | |
| assign_lsi_vector_incremental(node) | |
| end | |
| @built_at_version = @version | |
| end | |
| synchronize do | |
| if vocabulary_growth_exceeds_threshold?(word_hash) | |
| disable_incremental_mode! | |
| needs_full_rebuild = true | |
| next | |
| end | |
| old_rank = @u_matrix.column_size | |
| extend_vocabulary_for_incremental(word_hash) | |
| raw_vec = node.raw_vector_with(@word_list) | |
| raw_vector = Vector[*raw_vec.to_a] | |
| @u_matrix, @singular_values = IncrementalSVD.update( | |
| @u_matrix, @singular_values, raw_vector, max_rank: @max_rank | |
| ) | |
| new_rank = @u_matrix.column_size | |
| if new_rank > old_rank | |
| reproject_all_documents | |
| else | |
| assign_lsi_vector_incremental(node) | |
| end | |
| @built_at_version = @version | |
| end |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/lsi.rb
Line: 925:949
Comment:
**style:** nested `if` within `synchronize` - use early returns
```suggestion
synchronize do
if vocabulary_growth_exceeds_threshold?(word_hash)
disable_incremental_mode!
needs_full_rebuild = true
next
end
old_rank = @u_matrix.column_size
extend_vocabulary_for_incremental(word_hash)
raw_vec = node.raw_vector_with(@word_list)
raw_vector = Vector[*raw_vec.to_a]
@u_matrix, @singular_values = IncrementalSVD.update(
@u_matrix, @singular_values, raw_vector, max_rank: @max_rank
)
new_rank = @u_matrix.column_size
if new_rank > old_rank
reproject_all_documents
else
assign_lsi_vector_incremental(node)
end
@built_at_version = @version
end
```
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.| def reproject_all_documents | ||
| return unless @u_matrix | ||
| return reproject_all_documents_native if self.class.native_available? && @u_matrix.respond_to?(:batch_project) | ||
|
|
||
| reproject_all_documents_ruby | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: unnecessary conditional chain - just call native version directly when conditions met
| def reproject_all_documents | |
| return unless @u_matrix | |
| return reproject_all_documents_native if self.class.native_available? && @u_matrix.respond_to?(:batch_project) | |
| reproject_all_documents_ruby | |
| end | |
| def reproject_all_documents | |
| return unless @u_matrix | |
| if self.class.native_available? && @u_matrix.respond_to?(:batch_project) | |
| reproject_all_documents_native | |
| else | |
| reproject_all_documents_ruby | |
| end | |
| end |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/lsi.rb
Line: 992:997
Comment:
**style:** unnecessary conditional chain - just call native version directly when conditions met
```suggestion
def reproject_all_documents
return unless @u_matrix
if self.class.native_available? && @u_matrix.respond_to?(:batch_project)
reproject_all_documents_native
else
reproject_all_documents_ruby
end
end
```
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.- Make Matrix and Vector non-generic in sig/vendor/matrix.rbs - Add type parameter to Enumerable include in LineReader (#[String]) This fixes InvalidTypeApplicationError from rbs validate.
- Make train_batch params nilable (?(Type)? syntax) in all classifiers - Add _StreamingHost interface for Streaming module dependencies - Fix TFIDF stub methods with proper RBS and steep:ignore - Make Matrix/Vector non-generic in vendor RBS - Add Enumerable type param to LineReader
Summary
Streamingmodule for memory-efficient batch training across all classifierstrain_batchAPI across Bayes, KNN, LSI, and Logistic RegressionFeatures
Streaming Training
train_from_stream(category, io)- Train from IO streams (files, etc.)train_batch(category, documents)- Train with progress callbackssave_checkpoint/list_checkpoints/delete_checkpoint- Checkpoint supportLineReader- Efficient batched line reading with line count estimationProgress- Progress tracking with completion percentageIncremental SVD (LSI)
API Consistency
train_batchwith same signaturetrainalias matching other classifiersTest plan