feat: add save/load methods for classifier persistence#85
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds JSON-based serialization and file persistence capabilities to the Bayes and LSI classifiers, enabling models to be saved, loaded, and continued training after persistence.
Key changes:
- Implemented
as_json,to_json,from_json,save, andloadmethods for both Bayes and LSI classifiers - Used versioned JSON format for forward compatibility
- LSI serializes only source data (word_hash, categories) and rebuilds vectors on load for portability
Reviewed changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| test/test_helper.rb | Added tmpdir and json requires for persistence tests |
| test/lsi/lsi_test.rb | Added comprehensive tests for LSI serialization, round-trip persistence, and classification preservation |
| test/bayes/bayesian_test.rb | Added comprehensive tests for Bayes serialization, round-trip persistence, and classification preservation |
| lib/classifier/lsi.rb | Implemented serialization methods for LSI with source data preservation and index rebuilding |
| lib/classifier/bayes.rb | Implemented serialization methods for Bayes with full state restoration |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| { | ||
| version: 1, | ||
| type: 'bayes', | ||
| categories: @categories.transform_keys(&:to_s).transform_values { |v| v.transform_keys(&:to_s) }, | ||
| total_words: @total_words, | ||
| category_counts: @category_counts.transform_keys(&:to_s), | ||
| category_word_count: @category_word_count.transform_keys(&:to_s) |
There was a problem hiding this comment.
Nested hash transformations create intermediate objects. Consider using each_with_object to build the hash in a single pass for better performance with large category sets.
| { | |
| version: 1, | |
| type: 'bayes', | |
| categories: @categories.transform_keys(&:to_s).transform_values { |v| v.transform_keys(&:to_s) }, | |
| total_words: @total_words, | |
| category_counts: @category_counts.transform_keys(&:to_s), | |
| category_word_count: @category_word_count.transform_keys(&:to_s) | |
| categories = @categories.each_with_object({}) do |(key, value), result| | |
| result[key.to_s] = value.each_with_object({}) do |(inner_key, inner_value), inner_result| | |
| inner_result[inner_key.to_s] = inner_value | |
| end | |
| end | |
| category_counts = @category_counts.each_with_object({}) do |(key, value), result| | |
| result[key.to_s] = value | |
| end | |
| category_word_count = @category_word_count.each_with_object({}) do |(key, value), result| | |
| result[key.to_s] = value | |
| end | |
| { | |
| version: 1, | |
| type: 'bayes', | |
| categories: categories, | |
| total_words: @total_words, | |
| category_counts: category_counts, | |
| category_word_count: category_word_count |
| items_data = @items.transform_values do |node| | ||
| { | ||
| word_hash: node.word_hash.transform_keys(&:to_s), |
There was a problem hiding this comment.
Nested transformations with transform_values and transform_keys create multiple intermediate hashes. For large item sets, consider using each_with_object to build the structure in a single pass.
| items_data = @items.transform_values do |node| | |
| { | |
| word_hash: node.word_hash.transform_keys(&:to_s), | |
| items_data = @items.each_with_object({}) do |(item_key, node), acc| | |
| word_hash = {} | |
| node.word_hash.each do |word_key, value| | |
| word_hash[word_key.to_s] = value | |
| end | |
| acc[item_key] = { | |
| word_hash: word_hash, |
Provides a cleaner API than raw Marshal for persisting trained
classifiers. Users can now save training state and resume later:
classifier.save('model.json')
loaded = Classifier::Bayes.load('model.json')
loaded.train_spam('more data') # continue training
Both Bayes and LSI classifiers support:
- to_json / from_json for string serialization
- save(path) / load(path) for file operations
LSI serializes only source data (word_hash, categories), not computed
vectors. The index rebuilds on load, making JSON files portable across
GSL/non-GSL environments.
Closes #17
- Add as_json method that returns a Hash representation - Modify to_json to use as_json internally - Modify from_json to accept both String and Hash arguments This provides more flexibility for serialization workflows.
- Extract restore_state private method to reduce from_json AbcSize - Change as_json return type to untyped for Steep compatibility - Use assert_path_exists in tests per Minitest/AssertPathExists - Add JSON RBS vendor file for type checking - Regenerate RBS files
935833e to
39edd45
Compare
|
Regarding the Copilot suggestions to use The current implementation using The branch has been rebased onto master and now includes the thread-safety changes. Fixed an issue where |
|
@greptile |
Greptile SummaryAdded JSON-based persistence to both Key implementation details:
Testing:
Confidence Score: 5/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant Bayes/LSI
participant FileSystem
participant JSON
Note over User,JSON: Save Flow
User->>Bayes/LSI: save(path)
Bayes/LSI->>Bayes/LSI: to_json()
Bayes/LSI->>Bayes/LSI: as_json()
Note right of Bayes/LSI: Transform symbols to strings<br/>Serialize state data
Bayes/LSI->>JSON: to_json
JSON-->>Bayes/LSI: JSON string
Bayes/LSI->>FileSystem: File.write(path, json)
FileSystem-->>User: Success
Note over User,JSON: Load Flow
User->>Bayes/LSI: load(path)
Bayes/LSI->>FileSystem: File.read(path)
FileSystem-->>Bayes/LSI: JSON string
Bayes/LSI->>Bayes/LSI: from_json(json)
Bayes/LSI->>JSON: JSON.parse(json)
JSON-->>Bayes/LSI: Hash
Bayes/LSI->>Bayes/LSI: allocate new instance
alt Bayes
Bayes/LSI->>Bayes/LSI: restore_state(data)
Note right of Bayes/LSI: Initialize mutex<br/>Transform strings to symbols<br/>Restore all state variables
else LSI
Bayes/LSI->>Bayes/LSI: new(auto_rebuild: false)
Note right of Bayes/LSI: Create ContentNodes from data<br/>Increment version counter
Bayes/LSI->>Bayes/LSI: build_index()
Note right of Bayes/LSI: Rebuild SVD vectors<br/>from source data
end
Bayes/LSI-->>User: Loaded classifier
|
Summary
as_jsonmethod that returns a Hash representationto_jsonmethod that returns a JSON stringfrom_jsonclass method that accepts either a JSON string or Hashsave(path)andload(path)for file operationsUsage
Design Decisions
{"version": 1, "type": "bayes|lsi", ...}allows future format changesfrom_jsonhandles both for flexibilityTest plan
as_jsonreturning Hashfrom_jsonwith both String and Hashbundle exec rake testNATIVE_VECTOR=true bundle exec rake testCloses #17