Conversation
|
Hello @nicklein! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
Comment last updated at 2022-05-19 22:08:31 UTC |
kgtk/cli/build_faiss.py
Outdated
|
|
||
| # REQUIRED # | ||
| # Related to input file | ||
| parser.add_argument('-i', '--input_file', '--embeddings_file', action='store', dest='embeddings_file', |
There was a problem hiding this comment.
There is a add_input_file function, please use it
https://github.com/usc-isi-i2/kgtk/blob/master/kgtk/cli/cat.py#L45
kgtk/cli/build_faiss.py
Outdated
| help='Input file containing the embeddings for which a Faiss index will be created.') | ||
|
|
||
| # Related to output | ||
| parser.add_argument('-o', '--output_file', '--index_file_out', action='store', dest='index_file_out', |
There was a problem hiding this comment.
Also a add_output_file() function
https://github.com/usc-isi-i2/kgtk/blob/master/kgtk/cli/cat.py#L50
kgtk/cli/build_faiss.py
Outdated
| required=True, help="Output .idx file where the index fill be saved.", | ||
| metavar="INDEX_FILE_OUT") | ||
|
|
||
| parser.add_argument('-id2n', '--index_to_node_file_out', action='store', dest='index_to_node_file_out', |
There was a problem hiding this comment.
all the secondary options should only have -, instead of underscore.
--index_to_node_file_out should be --index-to-node-file-out.
Also as a general rule, do not use stop words like to in the parameter name, This review comment applies to all the subsequent parameters.
The dest parameter is fine.
There was a problem hiding this comment.
Is the reason to avoid stop words simply for brevity? I am struggling to think of a name for this that omits 'to' and isn't ambiguous. This parameter specifies the path where a kgtk file will be saved. The kgtk file contains a mapping of index to corresponding node. Do you have a suggestion for this parameter name?
There was a problem hiding this comment.
can you explain a bit more as to what exactly this file contains? mapping of what index ?how is this different from the output file?
There was a problem hiding this comment.
When you use a faiss index to search for nearest neighbors, it returns distances and corresponding indexes/IDs of the nearest neighbors, rather than the names or embeddings of the nearest neighbors. This file would allow you to look up the entity name that corresponds to the index/ID. Here's an example of what the file would look like:
Input embedding file:
q1 <embedding>
q2 <embedding>
q3 <embedding>
output index_to_node file:
node1 label node2
0 index_to_node q1
1 index_to_node q2
2 index_to_node q3
There was a problem hiding this comment.
I suppose I could use the word ID rather than index here to avoid confusion. Then I could call this 'node_id_file_out' to avoid 'to'.
There was a problem hiding this comment.
i see, using the stop word to makes sense here. I would call this --faiss-id-to-node-mapping-file As you can, brevity is not the issue, meaningful name is.
kgtk/cli/build_faiss.py
Outdated
| except SystemExit as e: | ||
| raise KGTKException("Exit requested") | ||
| except KGTKException as e: | ||
| raise |
| """ | ||
| Train and populate a faiss index that can compute nearest neighbors of given embeddings. | ||
| """ | ||
|
|
There was a problem hiding this comment.
rename the command and the file in cli to build-faiss-index`
| def build_faiss(embeddings_file, embeddings_format, no_input_header, index_file_out, index_to_node_file_out, | ||
| max_train_examples, workers, index_string, metric_type, p=None, verbose=False): | ||
|
|
||
| # validate input file path |
There was a problem hiding this comment.
use KGTKReader to read the input file as it is an edge file. It handles a lot of exceptions and cases. Take a look at the cat command for example.
kgtk/graph_embeddings/build_faiss.py
Outdated
|
|
||
| # validate metric type and translate to a faiss metric | ||
| metrics = { | ||
| "Inner_product": faiss.METRIC_INNER_PRODUCT, |
There was a problem hiding this comment.
make sure this is case insensitive
kgtk/graph_embeddings/build_faiss.py
Outdated
| else: | ||
| raise KGTKException("Unrecognized value for metric_type parameter: {}.".format(metric_type) + | ||
| "Please choose one of {}.".format(" | ".join(list(metrics.keys())))) | ||
| if metric_type == "Lp" and p is None: |
There was a problem hiding this comment.
what is p? Use a descriptive name please
There was a problem hiding this comment.
If the Lp distance metric is specified, then the user also needs to specify the value of p that they want to use, e.g. L1, L2, L...
I'll change it to metric_arg though since that is what Faiss calls it.
| print("Writing index-to-node file...") | ||
| with open(embeddings_file, 'r') as f_in: | ||
| with open(index_to_node_file_out, 'w+') as f_out: | ||
| f_out.write("node1\tlabel\tnode2\n") # header |
| # Load training examples for index training | ||
| train_vecs = [] | ||
| if verbose: | ||
| print("Loading training vectors...") |
There was a problem hiding this comment.
print to an error_file, see example in cat. No printing to standard out
|
@nicklein ETA on the requested changes? |
New kgtk command for building a faiss index. Intended for use with graph embeddings.