Package to define, convert, encode and decode crystal structures into text representations.
xtal2txt is an important part of our MatText framework.
Note on SLICES: This version uses SLICES 2.x, which includes a metadata prefix in the output format (e.g., o w b DOD c ODD d OOO o). SLICES 1.x did not include this metadata prefix.
Requirements: Python 3.9-3.12 (Python 3.9 recommended for SLICES support)
pip install xtal2txtFor all features (local environment analysis):
# Ubuntu/Debian
sudo apt-get install openbabel libopenbabel-dev libfftw3-dev
pip install xtal2txt openbabel-wheel
# macOS
brew install open-babel fftw
pip install xtal2txt openbabel-wheelDevelopment:
git clone https://github.com/lamalab-org/xtal2txt.git
cd xtal2txt
uv sync --extra dev
pre-commit install --install-hooksThe TextRep class in xtal2txt.core
facilitates the transformation of crystal structures into different text
representations. Below is an example of its usage:
from xtal2txt.core import TextRep
from pymatgen.core import Structure
# Load structure from a CIF file
from_file = "InCuS2_p1.cif"
structure = Structure.from_file(from_file, "cif")
# Initialize TextRep Class
text_rep = TextRep.from_input(structure)
requested_reps = [
"cif_p1",
"slices",
"atom_sequences",
"atom_sequences_plusplus",
"crystal_text_llm",
"zmatrix",
]
# Get the requested text representations
requested_text_reps = text_rep.get_requested_text_reps(requested_reps)By default, the tokenizer is initialized with \[CLS\] and \[SEP\]
tokens. For an example, see the SliceTokenizer usage:
from xtal2txt.tokenizer import SliceTokenizer
tokenizer = SliceTokenizer(
model_max_length=512, truncation=True, padding="max_length", max_length=512
)
print(tokenizer.cls_token) # returns [CLS]You can access the \[CLS\] token using the [cls_token]{.title-ref}
attribute of the tokenizer. During decoding, you can utilize the
[skip_special_tokens]{.title-ref} parameter to skip these special
tokens.
Decoding with skipping special tokens:
tokenizer.decode(token_ids, skip_special_tokens=True)In scenarios where the \[CLS\] token is not required, you can initialize
the tokenizer with an empty special_tokens dictionary.
Initialization without \[CLS\] and \[SEP\] tokens:
tokenizer = SliceTokenizer(
model_max_length=512,
special_tokens={},
truncation=True,
padding="max_length",
max_length=512,
)All Xtal2txtTokenizer instances inherit from
PreTrainedTokenizer and accept arguments compatible with the Hugging Face tokenizer.
The special_num_token argument (by default False) can be
set to true to tokenize numbers in a special way as designed and
implemented by
RegressionTransformer.
tokenizer = SliceTokenizer(
special_num_token=True,
model_max_length=512,
special_tokens={},
truncation=True,
padding="max_length",
max_length=512,
)Contributions, whether filing an issue, making a pull request, or forking, are appreciated. See CONTRIBUTING.md for more information on getting involved.
The code in this package is licensed under the MIT License. See the Notice for imported LGPL code.
This project has been supported by the Carl Zeiss Foundation as well as Intel and Merck.
