Skip to content

Use int as basis for CvdumpTypeKey#315

Merged
disinvite merged 34 commits intoisledecomp:masterfrom
disinvite:strong-type-for-cvdump-key
Feb 13, 2026
Merged

Use int as basis for CvdumpTypeKey#315
disinvite merged 34 commits intoisledecomp:masterfrom
disinvite:strong-type-for-cvdump-key

Conversation

@disinvite
Copy link
Collaborator

Fixes #211.

This is a huge change and it could easily be split into a few smaller ones if needed.

Intro

Type keys are used throughout the TYPES section of cvdump, along with other sections that refer to a type, such as SYMBOLS and GLOBALS. They come in three forms:

  • Scalar types like T_NOTYPE(0000)
  • "Unknown" scalar types like ???(047C)
  • User-defined types like 0x1234

Until now we have used a string representation for this key, but the use of upper or lower-case hex digits is not consistent. We use the normalize_type_id function to handle this, but also manually convert using str.lower() in many spots. #203 added the CvdumpTypeKey annotation but this was just an alias for str so it was not checked with mypy.

The obvious improvement is to exploit the fact that scalar types occupy the range from 0x0000 to 0x1000, so the key could just be an int. The obstacle there is that we depend on the string version of the T_* scalar types to decide the type's footprint and other attributes. cvinfo.h contains bit-shifting macros that do this using integers. Considering that the range of scalar types is only 10% full (about 300 total) it is reasonable to use const data for this instead of calculating it on the fly.

This change

  • All scalar types now have CvInfoType tuples in a new file cvinfo.py, which includes the type's size and format chars for struct.unpack.

  • CvdumpTypeKey is now a class that extends int so mypy will type check it. We need to make sure it is an int when storing in the Entity DB and convert back to CvdumpTypeKey when reading data back. For now, this happens automatically, but we could use conversion methods too.

  • The parametrized tests in test_cvdump.py may not have much value now that we store the type attributes directly in cvinfo.py. One change to test data is that float and double are now considered unsigned to match cvinfo.h. The only impact was the choice of format char, but we now store this directly rather than computing it.

  • On the topic of struct.unpack format chars, I decided to use i/I for the INT types and l/L for LONG. We had previously used l/L for any 4-byte integer. This is more of a style thing and I don't expect any practical difference.

  • CvdumpTypeKey.from_str() replaces normalize_type_id and it's used everywhere we need to convert a type key. Similarly, the is_scalar() function replaces magic string checks for the "T_" prefix.

  • Because CvdumpTypeKey is now a class, we need to use it to initialize any int variable that will serve as a type key. This change is most prevalent in unit tests.

  • Similarly, whenever we need to refer to a specific type, there is an Enum based on CvdumpTypeKey. This is created dynamically with types.new_class to save space.

  • The Ghidra scripts use the string representation of scalar types to decide which Ghidra primitive to use, and to add a * for pointers. We can access the T_ string in the CvInfoType tuple so the conversion should be the same.

  • I fixed a bug with LF_ENUM where forward refs were not followed. Forward refs have an underlying type of T_NOTYPE(0000) but this slipped by unnoticed because we assumed a size of 4 most types.

  • Fixed another bug: we did not correctly handle type keys larger than 0xffff because they use more characters than expected.

Future

Some types might be handled incorrectly in the new system, but we won't know for sure until we see them in a PDB. I marked most scalar types with a weird attribute and the logger will display an info message if we see any.

It is probably time to separate cvdump text parsing from the general type database.

Not all data is exposed through the get() API and we need to access the keys dict directly. e.g. We don't store modifiers like const or volatile or the destination type of pointers. See #106.

test_cvdump_types.py is now over 1000 lines, so I added the pylint:disable option for now. The problem is that we need the huge blocks of text to set up the parser, but this could probably be broken out into tests for each leaf type.

@disinvite disinvite requested a review from jonschz February 7, 2026 23:05
Copy link
Collaborator

@jonschz jonschz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good, this is a big improvement to the code quality!

@disinvite disinvite merged commit 78fba75 into isledecomp:master Feb 13, 2026
15 checks passed
@disinvite disinvite deleted the strong-type-for-cvdump-key branch February 13, 2026 02:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Type checking for keys in cvdump TYPES parser

2 participants