-
Notifications
You must be signed in to change notification settings - Fork 21
Open
Description
`
tokenizer = AutoTokenizer.from_pretrained("NousResearch/Llama-2-7b-hf",
token="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")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
with open(f"examples/grammars/geo_query.ebnf", "r") as file:
grammar_str = file.read()
query = "answer(population_1(cityid('austin', _)))"
query_prefix_1 = ""
query_prefix_2 = "answer"
query_prefix_3 = "answer("
grammar = IncrementalGrammarConstraint(grammar_str, "root", tokenizer)
# Parse from the first token
grammar_processor = GrammarConstrainedLogitsProcessor(grammar, 0)
encoded_1 = tokenizer(query_prefix_1, add_special_tokens=False, return_tensors="pt", padding=True)
encoded_2 = tokenizer(query_prefix_2, add_special_tokens=False, return_tensors="pt", padding=True)
encoded_3 = tokenizer(query_prefix_3, add_special_tokens=False, return_tensors="pt", padding=True)
scores_1 = grammar_processor.process_logits(encoded_1["input_ids"], torch.zeros(1, len(tokenizer)))
print(torch.nonzero(scores_1[0] != -math.inf).squeeze(axis=1))
#tensor([ 273, 550, 12011, 29874])
scores_2 = grammar_processor.process_logits(encoded_2["input_ids"], torch.zeros(1, len(tokenizer)))
print(torch.nonzero(scores_2[0] != -math.inf).squeeze(axis=1))
#tensor([2])
grammar_processor.process_logits(encoded_3["input_ids"], torch.zeros(1, len(tokenizer)))
`
would give out the following error for the final process_logits call:
ValueError: All stacks are empty, so the only token accepted is EOS(2) but got 29898
This is because the Llama tokenizer adds a dummy white space at the start of the sequence. Therefore, when encoding answer, the tokenizer would give token_id 1234, instead of 12011
`
print(tokenizer.convert_ids_to_tokens([1234]))
#['▁answer']
print(tokenizer.convert_ids_to_tokens([12011]))
#['answer']
`
When parsing the first empty prefix, the next allowed token won't contain [1234], but only token_ids correspond to tokens without the prefix space, which will make the parsing of the second prefix illegal, thus only accepting EOS token for the next token, so the third prefix would raise an error.
It would be very helpful if we could deal with this prefix whitespace problem.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels