xtal2txt

Package to define, convert, encode and decode crystal structures into text representations. xtal2txt is an important part of our MatText framework.

Note on SLICES: This version uses SLICES 2.x, which includes a metadata prefix in the output format (e.g., o w b DOD c ODD d OOO o). SLICES 1.x did not include this metadata prefix.

💪 Getting Started

🚀 Installation

Requirements: Python 3.9-3.12 (Python 3.9 recommended for SLICES support)

pip install xtal2txt

For all features (local environment analysis):

# Ubuntu/Debian
sudo apt-get install openbabel libopenbabel-dev libfftw3-dev
pip install xtal2txt openbabel-wheel

# macOS
brew install open-babel fftw
pip install xtal2txt openbabel-wheel

Development:

git clone https://github.com/lamalab-org/xtal2txt.git
cd xtal2txt

uv sync --extra dev
pre-commit install --install-hooks

Text Representation with xtal2txt

The TextRep class in xtal2txt.core facilitates the transformation of crystal structures into different text representations. Below is an example of its usage:

from xtal2txt.core import TextRep
from pymatgen.core import Structure

# Load structure from a CIF file
from_file = "InCuS2_p1.cif"
structure = Structure.from_file(from_file, "cif")

# Initialize TextRep Class
text_rep = TextRep.from_input(structure)

requested_reps = [
    "cif_p1",
    "slices",
    "atom_sequences",
    "atom_sequences_plusplus",
    "crystal_text_llm",
    "zmatrix",
]

# Get the requested text representations
requested_text_reps = text_rep.get_requested_text_reps(requested_reps)

Using xtal2txt Tokenizers

By default, the tokenizer is initialized with \[CLS\] and \[SEP\] tokens. For an example, see the SliceTokenizer usage:

from xtal2txt.tokenizer import SliceTokenizer

tokenizer = SliceTokenizer(
    model_max_length=512, truncation=True, padding="max_length", max_length=512
)
print(tokenizer.cls_token)  # returns [CLS]

You can access the \[CLS\] token using the [cls_token]{.title-ref} attribute of the tokenizer. During decoding, you can utilize the [skip_special_tokens]{.title-ref} parameter to skip these special tokens.

Decoding with skipping special tokens:

tokenizer.decode(token_ids, skip_special_tokens=True)

Initializing tokenizers with custom special tokens

In scenarios where the \[CLS\] token is not required, you can initialize the tokenizer with an empty special_tokens dictionary.

Initialization without \[CLS\] and \[SEP\] tokens:

tokenizer = SliceTokenizer(
    model_max_length=512,
    special_tokens={},
    truncation=True,
    padding="max_length",
    max_length=512,
)

All Xtal2txtTokenizer instances inherit from PreTrainedTokenizer and accept arguments compatible with the Hugging Face tokenizer.

Tokenizers with special number tokenization

The special_num_token argument (by default False) can be set to true to tokenize numbers in a special way as designed and implemented by RegressionTransformer.

tokenizer = SliceTokenizer(
    special_num_token=True,
    model_max_length=512,
    special_tokens={},
    truncation=True,
    padding="max_length",
    max_length=512,
)

👐 Contributing

Contributions, whether filing an issue, making a pull request, or forking, are appreciated. See CONTRIBUTING.md for more information on getting involved.

👋 Attribution

⚖️ License

The code in this package is licensed under the MIT License. See the Notice for imported LGPL code.

💰 Funding

This project has been supported by the Carl Zeiss Foundation as well as Intel and Merck.

Name		Name	Last commit message	Last commit date
Latest commit History 256 Commits
.github		.github
docs		docs
notebooks		notebooks
src/xtal2txt		src/xtal2txt
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
CITATION.cff		CITATION.cff
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
NOTICE.txt		NOTICE.txt
README.md		README.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

xtal2txt

💪 Getting Started

🚀 Installation

Text Representation with xtal2txt

Using xtal2txt Tokenizers

Initializing tokenizers with custom special tokens

Tokenizers with special number tokenization

👐 Contributing

👋 Attribution

⚖️ License

💰 Funding

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

xtal2txt

💪 Getting Started

🚀 Installation

Text Representation with xtal2txt

Using xtal2txt Tokenizers

Initializing tokenizers with custom special tokens

Tokenizers with special number tokenization

👐 Contributing

👋 Attribution

⚖️ License

💰 Funding

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages