-
Notifications
You must be signed in to change notification settings - Fork 201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consistently handle parsed enum values that are not listed in the enum #778
Comments
I've been hit by this as well today again and in my opinion there are three choices, which some KSY-author could choose amongst when defining an While all target languages need to implement all three choices in the end, it would be totally OK to simply throw exceptions in the compiler if some target language lacks support. This is sometimes done already and allows KSY-authors to either implement support or choose some other strategy already supported. In any case it's better than now, because authors would be made aware of some problem, while currently things might fail or behave unexpected in different target languages on runtime. Throw exceptions on unknown values.Sometimes unknown values are simply wrong from the point of view of an author of some format, but people could provide arbitrary wrong values. While it's sometimes possible to check those things with validations, it doesn't make much sense to force KSY-authors to do so when they have already decided about a whitelist of valid values. They wouldn't use an Would be important to provide a good error message, though, containing the invalid value and not only a message about "things" being unsupported or such. Some syntax to support this could simply be a specially named key, triggering KS to generate throwing code and maybe even provide a custom prefix for the error message. enums:
ip_protocol:
1: icmp
6: tcp
17: udp
throw|die: ""|"Unsupported protocol, details: " Map unknown values to a concrete
|
@ams-tschoening Thank you for the detailed write-up! While reporting the issue I also had thoughts about supporting switching between the different behaviors, but I couldn't get my thoughts properly organized, so I left that part out of my initial comment :) I agree that it would make sense to allow switching between the three different behaviors for handling unknown values (throw error, replace with single "unknown" value, or preserve exact value). But I'm not sure if this choice should be hardcoded into the spec (or at least not only in the spec), because different users of the same spec might want different behaviors for unknown enum values. As an example, say you have a TLV-style format, where you have blocks of data prefixed with a type code that defines the meaning of the following data. The format defines a few type codes, and those are the only allowed codes - use of any other type codes is invalid according to the format (or perhaps indicates a newer format version). As a spec writer, the first behavior (throw errors on unknown enum values) would be the correct one to choose here. However, as a user of the spec, any of the three behaviors might be the most suitable, depending on the use case:
On the other hand, I agree that the spec writer should also have some control over the behavior (at least the ability to set a default), because depending on the format, different behaviors are more likely to be the best for most people. For example, if a format spec says "these are the allowed type codes, all others are invalid and can never occur", then it would make sense to use the first behavior (throw errors) unless the user of the spec overrides it. On the other hand, if the format spec says "these are the standardized type codes, but others are valid and must be preserved when the data is modified", then the third behavior (preserve unknown values) would be required by most users. Although I suppose it would still be a valid solution to have the error behavior controlled only by the spec. This way, spec writers would choose the behavior that best matches the official definition of the format, and any users that need a different behavior for specialized use cases have to modify the spec. It might not be ideal for all users, but it would probably be the simplest solution to understand and implement. |
Also, on the topic of preserving unknown enum values in Java, I had another idea that would be less user-unfriendly than the wrapper class solution. Java enums are allowed to implement interfaces, so it would be possible to represent each KS enum as an interface with two implementations: the enum, which lists all known values, and a separate class that wraps unknown values. For example, if you have this KS enum: enums:
compression_algorithm:
0: uncompressed
1: gzip
2: bzip2
3: xz It might be compiled to this Java code: Generated Java code (example)public interface ICompressionAlgorithm {
int id();
static ICompressionAlgorithm byId(long id) {
CompressionAlgorithm known = CompressionAlgorithm.byId(id);
if (known != null) {
return known;
} else {
return UnknownCompressionAlgorithm.byId(id);
}
}
}
public enum CompressionAlgorithm implements ICompressionAlgorithm {
UNCOMPRESSED(0),
GZIP(1),
BZIP2(2),
XZ(3),
;
private final long id;
CompressionAlgorithm(long id) { this.id = id; }
@Override
public long id() { return id; }
private static final Map<Long, CompressionAlgorithm> byId = new HashMap<Long, CompressionAlgorithm>(4);
static {
for (CompressionAlgorithm e : CompressionAlgorithm.values())
byId.put(e.id(), e);
}
public static CompressionAlgorithm byId(long id) { return byId.get(id); }
}
// might not even need to be public
public class UnknownCompressionAlgorithm implements ICompressionAlgorithm {
private final long id;
private UnknownCompressionAlgorithm(long id) { this.id = id; }
@Override
public long id() { return id; }
public static UnknownCompressionAlgorithm byId(long id) {
// Optional: add memoization to avoid allocating multiple objects for the same ID.
return new UnknownCompressionAlgorithm(id);
}
// Insert equals, hashCode, toString, etc. boilerplate here,
// which need to be implemented manually, unlike in enums.
} Then a struct field // Check against known value
if (struct.algo() == CompressionAlgorithm.UNCOMPRESSED) {/* ... */}
// Extract raw value
System.out.println(struct.algo().id());
// (Note that the two examples above don't need any casts or type checks,
// because id() is part of the interface and works with known and unknown values.)
// Ensure that value is known
CompressionAlgorithm known_algo = (CompressionAlgorithm)struct.algo();
// Switch over known values
switch ((CompressionAlgorithm)struct.algo()) {
case UNCOMPRESSED: // ...
case GZIP: // ...
// ...
}
// Switch over known or unknown values (this is not very nice sadly)
if (struct.algo() instanceof CompressionAlgorithm) {
switch ((CompressionAlgorithm)struct.algo()) {/* like above */)}
} else {
handleUnknown(struct.algo().id());
}
// Alternative version of the above
// (allows single else case for both "known but unhandled value" and "unknown value")
if (struct.algo() == CompressionAlgorithm.UNCOMPRESSED) {
// ...
} else if (struct.algo() == CompressionAlgorithm.GZIP) {
// ...
} else {
handleUnknown(struct.algo().id());
} |
Makes sense, but things a lot more complicated as well I guess. Letting users decide sounds like something needed at runtime, which makes an approach like opaque types or custom processing routines come into my mind. Either of both would be needed to implement the lookup of decimals to Or KS would provide some interfaces in the runtime used during resolving I wouldn't give that any priority right now. Letting KSY-authors switch between the three different behaviours seems more important to me. The docs could simply mention that e.g. throwing an exception might not be the most compatible behaviour for users, so KSY-authors should really be sure. Especially the wrapper preserving original values would be really compatible for all use cases, so might additionally make sense to make that the default. |
No. Instead the exception should be specific enough and contain the info in the machine-readable way. Human-readable error messages is NOT KS was designed for. Only embedders of specs know the best way they should process the errors, not specs authors, not KSC.
For dynamically typed languages, or object oriented languages where enums are inherited from integers it is not needed. For C++ such a wrapper can be created, even with conversion operators, allowing the objects be used as numbers, requiring no changes in the code embedding the spec.
Both agree and disagree. Agree that it should be redefineable, finely-grained redefineable, and disagree that it shouldn't be hardcoded - it is format that define tue semantics. In TLV unknown items can usually be just skipped, as their size is known. In TV the size is defined by the enum value, so the failure to recognize the type will ruin parsing. Wnd the compiler can derive the default mode from the code itself following the rule:
And in some cases the default hook can be generated automatically by KSC, such as an array of items, each one having a common signature, where it is impossible or extremily unlikely that the signature is present in a valid item body. Then we can just scan for the signature. |
"Machine-readable way" means unnecessary format decisions about what is provided how for whom. Instead, one can easily focus on providing somewhat useful error messages telling what the problem is and for which data it occurred. A lot of implementers simply won't care about those exceptions anyway and simply let things bubble up the stack. More importantly, KS simply provides human-readable exception messages already in the validation infrastructure of the runtime.
That error message isn't very machine-readable as well, but totally fits its purpose: It's easy to understand for humans and if someone wants to parse it, it's not too difficult as well. So there's absolutely no need to raise the bar for no real benefit here and make things unnecessary difficult.
But simply not all target languages behave that way and not all use cases are compatible with that. I'm using Java mostly and have a lot of use cases in which I'm forwarding the NAME of an enum into some config file to get compared with other names. This improves maintenance and documentation of the config files and simply doesn't work anymore if decimals instead of names occur, because the interface is accepting strings only currently. Therefore I often prefer some fallback like It totally makes sense for KS to provide an own concept across all supported target languages. Each target language could easily add custom magic as needed in future to make life for those users easier, but the point is that such magic isn't available in all target languages anyway.
Keep in mind that sometimes enums are really only used to provide names to decimals and don't influence further parsing in any way. I do this a lot to make some data structure easier to understand. In those cases there won't ever be any parsing failure, it's only that users of specs sometimes run into |
Absolutely. And while reverse-engineering or debugging, we clearly should not fail on unknown enum values, if they don't affect parsing. And in production we usually shouldn't too, because these values may just mean unimplemented extensions which can be safely (for the purpose of parsing) skipped.
I completely agree that in these cases when a num doesn't match the enum, it must be replaced by the num rather the breaking behavior.
Completely true, without explicit things we cannot properly test implicit things that must result into the same code.
Instead it should provide them machine-readable. All the exceptions must be macuine-readable. Exceptions can be used for direct consumption of their string representation by a user in the case of program failure, but they must be machine-readable because their primary purpose is to be caught and processed.
I was speaking about the ones behaving the needed way. And I forgot to add - the langs with runtime type identification. So they don't need external bits as we will have to use in C++. BTW, for C++ IMHO we SHOULDN'T verify the enums at their parsing. It may carry a large overhead, depending on the structure of the enum, and in the most of cases is completely unnneeded. Let it be just a number (in a The problems are langs verifying them. Should we introduce own classes for the enums, like it is proposed to do in C++ for all of them, including python, allowing to avoid verification where it is unnneded? |
To clarify: with "users" I meant "users of the spec", i. e. developers who want to compile and use an existing spec (from KSF or another repo) that they aren't developing themselves. My thought was to maybe add a compiler flag for overriding the enum error behavior, to allow selecting a different behavior if the one chosen by the spec author doesn't fit your use case. But perhaps that really isn't worth the added complexity and we should just ask people to modify the spec if they need a different error behavior.
I would argue that if you need machine-readable error information, you should be using the behavior that preserves invalid values. This gives the code that uses the spec much more context than machine-readable information on an exception.
I'm not sure what the point is here? Different target languages can have different implementations of the same feature. So where Java would require a wrapper class or interface, Python could instead directly store either the enum value or raw integer, and C++ would simply cast the raw value to the enum type without checking (perhaps with an extra generated function to allow checking if a value is actually valid for the enum, which can be manually called by user code if it needs to know). |
That's how I understood you already, but a compiler flag 1. adds complexity as well and 2. isn't even sufficient in theory: Not all users compile from KSY-files themself, I e.g. have customers to which I deploy KSY-based C# and Ruby. While I could provide them KSY-files in theory, they don't want to take care about compiling those and to be honest, my company doesn't want to provide them the KSY-source for business reasons currently as well. So one would need something available on compilation results as well or we simply don't care too much currently. I prefer the latter, the mentioned 3 choices most likely fit for 9X% of the use cases already.
My argument is to really provide a KS-specific wrapper for all ENUMs if KSY-authors choose so, pretty much like One and the same ENUM might be used in different places, where it sometimes is OK to map to some |
Of course the original values should be preserved. They are always preserved in the cases they are valid, we should also preserve them for the runtimes cleaning them for the cases they are invalid. The exception should not carry a preformatted error message, but should carry the necessary info to be able to format it. Not
It is proposed to not to use the native enums directly, but to create a wrapper class for them, if it is needed. The wrapper class does the following:
|
Yes, KS should have the same functionality and parsing behavior in all languages, but that doesn't mean that the interface should be identical across all languages - it should still follow the conventions of the target language. Python is dynamically typed, so I think it's completely okay (and perhaps even expected) to have a field that can contain two possible types of values, and the developer is allowed to assume one or the other without explicitly checking/casting/unwrapping the value first. KS already behaves similarly in other cases, for example with
I don't think that's true in general - just because a spec allows unknown values in an enum field doesn't mean that all code that uses that spec needs to care about those values and should be forced to explicitly handle them. Someone who is writing a quick-and-dirty script that only needs to handle a limited subset of the format, or even working interactively in the REPL, might not care about handling (or explicitly not handling) every edge case. If you want to be really sure that all possible cases are handled correctly, then you should be using a statically typed language that has such checks built into the language, and not Python. |
This has been sitting for a while. Has there been any progress on this? My two cents: I think it should be part of the .KSY spec if an unknown value for an enum is "ok" or an error. There are valid reasons for both, and whoever is writing the spec should know, and the difference should be propagated to the runtime code. Examples: But there are protocols where identifying the type is the only way you know how to process the following data. You can not "skip". That is a catastrophic error. Same w/ just simple enums that don't have any effect on parsing. You have protocols like the automotive & marine "CAN bus" family, where manufacturers are always adding new enum value for their specific hardware. It's not an error to not understand them. This is really common when you only have a reversed engineered protocol. But for financial transactions, parsing an order that has a known side (e.g. buy/sell), that's a catastrophic error. The run-time should have to do work to choose to continue, the default should be to fail. I don't have an option about how to encode it in the .KSY file, but I think it needs to be there, and the concepts of "unknow ok" and "unknown error" enums need to be understood by the runtime. |
valid: enum but it is not yet implemented and authors currently have to repeat oneself by using |
|
#435 and #775 is the spec for it Just usage of I remember there was an example using |
@armijnhemel uses them extensively in his https://github.com/armijnhemel/binaryanalysis-ng repo, just search for
- id: crypt_method
type: u4
enum: crypt_methods
valid:
any-of:
- crypt_methods::no_encryption
- crypt_methods::aes |
According to the decision in kaitai-io/kaitai_struct#778 unknown values of enums still should be available
According to the decision in kaitai-io/kaitai_struct#778 unknown values of enums still should be available
According to the decision in kaitai-io/kaitai_struct#778 unknown values of enums still should be available
At the moment, Kaitai Struct doesn't consistently handle parsing of enum values that are not defined in the ksy. For example, with a spec like this:
When parsing the byte array
[0x03]
, i. e. with thevalue
field set to3
, which is not listed in thething
enum, it's not clear what's supposed to happen. The documentation doesn't say anything about this case, and the behavior is inconsistent across target languages. Here's a summary of how each target language currently handles this case (I don't know all of these languages very well, please correct me if any of this is wrong):null
/nil
. Enum values are objects and can only represent values defined in the enum. KS-generated parsers look up an enum value matching the read integer value, and the lookup silently returnsnull
/nil
if there is no matching enum value,In summary, most targets allow enum fields to have values that are not listed in the ksy, but some of the targets replace unknown values with
null
/nil
or treat them as an error. I think it would make sense to change the behavior of the remaining target languages to allow unknown values as well.The main problem with this is that unknown enum values are difficult to represent in languages like Java, which are strongly typed but don't have native support for something like strongly typed unions ("either" types) or case classes. Allowing unknown enum values would require defining/generating a custom wrapper class that can contain either a known enum object value or an unknown raw integer value, but this would make them less convenient to use in code that doesn't care about unknown enum values.
Related to #300 and #568, and somewhat related to #518.
The text was updated successfully, but these errors were encountered: