Prefix Space with Llama Tokenizer

`

    tokenizer = AutoTokenizer.from_pretrained("NousResearch/Llama-2-7b-hf", 
    token="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")

    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.padding_side = "right" 

    with open(f"examples/grammars/geo_query.ebnf", "r") as file:
        grammar_str = file.read()

    query = "answer(population_1(cityid('austin', _)))"
    query_prefix_1 = ""
    query_prefix_2 = "answer"
    query_prefix_3 = "answer("

    grammar = IncrementalGrammarConstraint(grammar_str, "root", tokenizer)
    # Parse from the first token
    grammar_processor = GrammarConstrainedLogitsProcessor(grammar, 0)

    encoded_1 = tokenizer(query_prefix_1, add_special_tokens=False, return_tensors="pt", padding=True)
    encoded_2 = tokenizer(query_prefix_2, add_special_tokens=False, return_tensors="pt", padding=True)
    encoded_3 = tokenizer(query_prefix_3, add_special_tokens=False, return_tensors="pt", padding=True)


    scores_1 = grammar_processor.process_logits(encoded_1["input_ids"], torch.zeros(1, len(tokenizer)))
    print(torch.nonzero(scores_1[0] != -math.inf).squeeze(axis=1))
    #tensor([  273,   550, 12011, 29874])
    scores_2 = grammar_processor.process_logits(encoded_2["input_ids"], torch.zeros(1, len(tokenizer)))
    print(torch.nonzero(scores_2[0] != -math.inf).squeeze(axis=1))
    #tensor([2])
    grammar_processor.process_logits(encoded_3["input_ids"], torch.zeros(1, len(tokenizer)))
`

would give out the following error for the final `process_logits` call:

> ValueError: All stacks are empty, so the only token accepted is EOS(2) but got 29898

This is because the Llama tokenizer adds a **dummy white space** at the start of the sequence. Therefore, when encoding **answer**, the tokenizer would give token_id **1234**, instead of **12011**



`

    print(tokenizer.convert_ids_to_tokens([1234]))
    #['▁answer']
    print(tokenizer.convert_ids_to_tokens([12011]))
    #['answer']
`

When parsing the first empty prefix, the next allowed token won't contain [1234], but only token_ids correspond to tokens without the prefix space, which will make the parsing of the second prefix illegal,  thus only accepting EOS token for the next token, so the third prefix would raise an error. 

It would be very helpful if we could deal with this prefix whitespace problem.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prefix Space with Llama Tokenizer #40

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Prefix Space with Llama Tokenizer #40

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions