-
Notifications
You must be signed in to change notification settings - Fork 30
2.0.0 roadmap
This page is a work-in-progress plan for updates to the ASDF Standard that will be included in version 2.0.0.
A little bit of history of ASDF schema wanderings
In the beginning all schemas were in asdf-standard, including schemas currently in astropy and gwcs.
astropy.coordinates was changing rapidly and we could not keep up with those changes. This led to many bug reports related to coordinate frames. So we decided to move the "transform" and "coordinate" tag code to astropy. At some point some of the schemas were moved there too because IIRC they were considered "astropy" or "astronomy" specific (e.g., the coordinate schemas, but not the transform schemas). At that time WCS schemas were moved to gwcs.
Most problems came from adding/changing attributes in astropy classes. The idea was that by moving tags to astropy they will be easier to maintain because a failing tag test will alert us to problems. This sort of worked. Instead we could have improved testing with astropy dev in asdf to keep up with changes in astropy. Another perceived advantage was that an astropy release would be self-contained because it has versions of supported tags and schema with the code. Essentially this was a versioning problem. How to properly handle versioning was poorly understood at the time. Support has improved since then.
In retrospect, was the decision to move tags and schemas the right one? In hindsight, moving the tags was the right solution. Moving the schemas to astropy was probably not. Both problems were in a way management issues - lack of resources to support the development. The first one could have been solved/avoided also with better testing. There was also the feeling that until there was another language supported that this would simplify management, and when more languages were supported, we would have to move the schemas, not that we thought carefully about the mechanics of how that would happen.
- Perception of a more stable ASDF Standard
- Users may be skeptical of a file format that changes so often
- We may be discouraging other implementations by appearing unstable
- Keep very astronomy-specific stuff out a more general standard for scientific data
- Much of the modeling stuff has potential users well outside of astronomy, but it really shouldn't be a main requirement for supporting ASDF
- An ASDF implementation that does not implement transforms would still be useful
- Developer convenience
- Updates to schemas will no longer require release of asdf and asdf-standard before they can be used in other packages
- The packages that actually implement the tags can directly include or have the schemas as a dependency, and updates to those schemas can move in lockstep as the objects and tags that serialize those objects are developed. And there needs to be no updates or new releases in asdf-standard or asdf to achieve this. Think of this as something like the pytest and pytest plugin ecosystem.
- Does this raise a policy issue of when someone binds tags to an implementation, do we outline how that can be disassociated when someone else wants support in a different language; essentially what we are facing with transforms and LSST.
- A user will have to install more dependencies
- Can this inconvenience be solved by collecting the dependencies into a single meta-package that users can install?
- e.g. instead of "pip install asdf asdf-transform-schemas asdf-fits-schemas asdf-coordinate-schemas" users can "pip install asdf-astropy" and get everything they might need
- That's a Python solution, what about other languages? We ought to suggest solutions for common cases.
- Meta packages in Python can be a real pain
- e.g. instead of "pip install asdf asdf-transform-schemas asdf-fits-schemas asdf-coordinate-schemas" users can "pip install asdf-astropy" and get everything they might need
- Installation of more dependencies should be handled via standard package dependency management (hopefully)
- Can this inconvenience be solved by collecting the dependencies into a single meta-package that users can install?
- The ecosystem of ASDF repositories and packages will become more fragmented
- Increased maintenance overhead, more combinations of package versions to debug
- We may need some enhanced schema versioning management, e.g. WCS schemas X support transform schemas (or the equivalent of version-map) Y, coordinates schemas Z and unit schemas U.
- Increased maintenance overhead, more combinations of package versions to debug
- Astropy will not install these packages by default
- Ultimately we would like that to happen. But in the meantime, we should have a simple pip mechanism to install all that is needed in one command.
- Would be kind of neat for the code to do that when trying to use an astropy tool that needs it. XXX is not installed, type i to install it...
- Ultimately we would like that to happen. But in the meantime, we should have a simple pip mechanism to install all that is needed in one command.
- Perhaps a summary for explicit use cases in a table is worthwhile
- Create repository with Python package that installs transform schemas.
- Astropy lists the new package as an additional optional dependency
- Users must install "all", or manually install both packages, or maybe astropy would be open to adding an "asdf" extras category
- Options for non-Python software
- Include as submodule and package with the software itself somehow
- Astropy lists the new package as an additional optional dependency
- Create repository without Python package
- Astropy incorporates the schemas as a submodule and the files are installed with astropy on the user's system
- Non-Python software also "vendorizes" the schemas, potential for multiple copies of the same schema to exist on a user's system
- Schemas not included with software but downloaded over http from some centralized "schema repository"
- Not desirable to require network connection to open files
- But software could still "vendorize" the schemas and only hit the http service as a backup
- Caching would also help
- Not desirable to require network connection to open files
- Create schema package but also move astropy tag classes to a new package, asdf-astropy
- astropy users would only need to explicitly install a single dependency
- other packages (transform schemas, asdf itself) would be hard dependencies of asdf-astropy and would be installed automatically
- potential advantage to be able to work on schemas and tag code without waiting for astropy releases
- would need tests in astropy to encourage developers to maintain the tag code even if it lives in a separate package
- Keep schemas in the ASDF Standard
- Continue to endure pain around releasing both asdf-standard and asdf to make new schema material available to astropy
- dependency tree remains simple
- Consider moving all ASDF related packages to a dedicated github organization.
- Consider using pipfile and pipenv (although this may be some way in the future)
- How is the new package going to be installed in different environments? How is asdf-standard installed by asdf-cpp or in a C only environment.
- Include the schema repo(s) as submodules
- Auto-generate header files that contain the schema content as a string?
- Alternatively install the schemas at some path in the user's system that the executable knows how to find
- Include the schema repo(s) as submodules
Remove the transform schemas from version_map-2.0.0.yaml. Transform schemas not included in ASDF Standard 1.x will only be available via a new dedicated repository, which will be installable as a Python package.
Consider adding additional transform attributes, serialized in the basic Transform tag, to the base Transform schema.
Things like bounding_box, equivalencies, .... One reason to have them in the schemas is that libraries implementing the standard in other languages are not aware that these attributes exist. However, they can change the behaviour of the deserialized object. One example is the bounding_box and its use in WCS. Currently it is written to file but is not in the schema. Which means the LSST asdf conversion code will not take it into account. That particular case may be OK but the general problem exists.
Some schemas already support this, for example affine.
Astropy normally writes out these as numpy arrays. However, a different library may be writing them out as arrays.
Astropy should be able to read asdf files where values are written as YAML arrays instead of numpy arrays, if the schema validates the file.
For example, an affine transform written as YAML array, while valid to the affine schema, cannot be read by astropy.
These are the other (non-transform) ASDF Standard schemas, grouped by URI prefix:
core - schemas for essential asdf objects like ndarray, the top-level node, etc
fits - schema supporting nesting a FITS file inside of an ASDF file
unit - support for units and quantities
time - schema supporting time objects, with "special emphasis ... on supporting time scales that are used in astronomy"
Some of these (e.g., fits) may be candidates for moving out of the ASDF Standard and into their own satellite repositories.
Consider also moving schemas from the following packages into separate repositories:
- gwcs
- astropy.coordinates
Users have expressed some interest in the new features of JSON Schema draft-07, so we might take this opportunity to designate draft-07 as the ASDF Standard 2.0.0 schema format.
One downside of this change is that draft-07 and draft-04 schemas are mutually incompatible, so all current schemas would need to be updated before they could be used with an ASDF Standard 2.0.0 file.
A potentially troublesome JSON Schema change introduced in draft-06 is that the "integer" type now validates any number with a zero fractional part, so floats like 1.0 will begin validating where they did not before.
Is this a problem since it is a relaxation and presumably won't break reading old files? It is a reasonable interpretation; but does it make it difficult to support in our or other libraries?
Actually it will help with at least one issue we know of in the jwst pipeline where a spectral order needs to validated as integer but it comes out of a model as a float because modeling turns everything into float.
We are currently maintaining two parallel URI schemes: http:// URIs that refer to schemas, and tag: URIs that refer to tagged YAML objects. There is a 1:1 mapping between the two sets of URIs. Since YAML supports http:// URIs as tags, we have the option to drop the tag: URIs entirely and just use http:// URIs everywhere. This would remove a source of confusion and mistakes.
The version_map-x.y.z.yaml files in the ASDF Standard would need to be changed in some way, as they currently refer to schemas by tag.
It is useful to have an overarching version that ties together a group of related schemas – for example, software can read a list of schemas associated with that version and write objects to an ASDF file that validate against schemas in that particular set. For the ASDF core schemas, the ASDF Standard version provides that overarching version. For user-defined schemas, there is currently no solution, and no library support for selecting a particular version of user-defined schema on write. Define a format for a schema collection manifest
This will be a YAML file, analogous to the existing version_map-x.y.z.yaml files, that defines a schema collection version. The file will need the following features:
- Unique id that defines the name and version
- Could be a similar URI to the schema ids, for example http://stsci.edu/schema_collections/core-1.0.0.yaml
- Would be used by implementations to allow users to select a particular version of a schema collection
- List of schemas in the "collection"
- A list of schema id URIs
- Could be a similar URI to the schema ids, for example http://stsci.edu/schema_collections/core-1.0.0.yaml
If all we do is drop the transform schemas, this will simply contain all of the schema ids of non-transform schemas currently listed in version_map-1.5.0.yaml.
Since the version_map files aren't described by the spec, we are free to replace them with the new-style manifest files. This would simplify implementation. We have the choice of creating one manifest for all schemas in the ASDF Standard, or multiple manifests for each URI prefix like "core", "unit", etc.
Similar to the existing extension metadata section, this would be a list of schema collections used when writing the file. Useful for debugging and providing warnings when support for a given collection is missing.
This is an experimental feature that sought to make serialization of subclasses more convenient by reusing the superclass's schema, with some additional metadata appended to inform the library of which subclass to instantiate. This feature has some drawbacks. For one, the name of a class or subclass is an implementation detail that is meaningless to other ASDF implementations. Another drawback is that by using a generic schema for multiple subclasses, we are not able to validate as strictly as we could with separate schemas – for example, if subclass A requires property "foo", but subclass B does not, we can't make the property required because both objects must validate against the same schema.
These drawbacks may be reason enough to remove subclass_metadata from the standard.
Replace the "extensions" section of the file history with a section for implementation-specific metadata
The "extensions" section of the history object contains a list of AsdfExtension class names used by the Python library when writing the file. This is useful when debugging issues with a file, and enables the Python library to issue warnings when an extension that was used to write the file is missing on read. Since the concept of an "extension" is not defined by the ASDF Standard and is an implementation detail of the Python library, it may not be reasonable to require that other implementations store their metadata in the same structure.
An alternative is to replace "extensions" with a new section for freeform implementation-specific metadata.
Perhaps use a standard convention for library specific metadata; e.g., some sort of standard prefix?
There's been discussion around supporting additional compression modes offered by the blosc library, particularly zstd with blosc's byte transposition filter. Supporting the transposition would require new field(s) in the ASDF block header that describes the compression block size and the fact that the bytes were transposed. We would also need to add a new 4-byte compression code for zstd.
Could we just create a block prefix area for extra metadata information that is implicit for that compressions scheme? Does it have to be explicit in the block fields? This would allow much more flexible additions in the future without having to keep changing the definition of the block structure?
The ASDF Standard doesn't specify behavior around null values, but the Python library currently strips out any object key whose value is null. Some users would prefer that keys with null values be preserved. Regardless of which behavior we settle on, we should consider adding language to the ASDF Standard that defines how nulls are to be treated.
This needs some careful thought. There are cases where the absence of something should be taken to imply a certain mode. We probably have been misusing None. Getting rid of defaults probably makes this easier (e.g, a distinction between a missing attribute, which is handled by the tag code, and a None value which is preserved in the tree). Yet, this would raise the question of how extensions document handling of missing attributes and their defaults since we cannot use the schema directly (unless we have a special field that describes the behavior the library should have without actually enforcing though schema validation tools). This is because extensions ought to be language neutral in principle. and someone implementing the extension in a different language needs to know how to handle these issues without being an expert in the original implementation.
The ASDF Standard doesn't provide explicit guidance on how the "default" annotation in the schemas is to be used. The Python library currently adds default values to the tree where missing on read, and removes values that match the default on write. This feature seems intended to reduce file size when many objects with default values are present. There are some downsides: the files when viewed independent of the schemas seem to be missing some of their data (including required fields), and it's not always possible to identify a single default value for objects that are validated against multiple schemas using combiners.
Regardless of which behavior we settle on, we should consider adding language to the ASDF Standard that defines how default values are to be treated.
We talked about the option of removing defaults from the ASDF standard. Can we give that serious consideration? This is linked a bit to the previous item regarding null values.
The YAML 1.1 spec permits object keys to themselves be objects or arrays, which isn't well supported by Python (since dicts and lists are not hashable). A more serious issue is that complex keys are not at all covered by JSON Schema, since JSON only supports string keys. Consider declaring in the ASDF Standard that object or array keys are not permitted.
Restricting our tree to a subset of YAML would also offer the benefit of a simplified implementation if we ever decide to write our own fast YAML parser.
One option is to require complex keys be encoded as strings, perhaps with some specification of what is legal. Again, this would requires some thought. We would like to stay away from anything goes for keys.
Unlike semver for software, there isn't a clear winner as far as versioning strategies for schemas. https://github.com/snowplow/iglu/wiki/SchemaVer is one option. Review and consider revising the section in the ASDF Standard on schema versioning, Revise section on "Handling version mismatches"
The ASDF Standard documentation recommends that libraries read later versions of objects than they actually support, for "future-proofing". This may be dangerous, because new data added in later versions of a schema might be discarded by the library if unrecognized, thus corrupting the file when written back out.
We may wish to revise this section to instead recommend against attempting to handle unrecognized versions.
This is where a real url to refer to might be handy. Your version of the library is trying to read a later version. Maybe it would work, and maybe not. If the library could retrieve information about the newer version to see what older versions it is compatible with to decide whether or not to fail. If not accessible, it fails. Maybe this is a bit too fancy...
Our Python library always writes an ASDF_STANDARD comment near the top of each file with the version of the standard that was used. This comment, however, is not described in the ASDF Standard documentation. It is useful to inform implementations of the anticipated structure of the tree, particularly with regard to metadata.
We may wish to merge the ASDF_STANDARD version and the ASDF file format version – it may not be useful to maintain two separate version numbers. In that case this comment can be replaced with ASDF 2.0.0 for new files.
Nadia pointed out that a schema that anyOf-combines number and quantity would be useful, since this is a common case, particularly in the transform schemas.
The current draft-01 yaml-schema metaschema includes three properties related to the style of the serialized YAML:
propertyOrder – specify the order in which object properties should be written
flowStyle – specify the YAML style for an array or an object
style - specify the YAML style for a string
Some of these may represent early ideas that did not turn out to be useful.
- We can probably drop support for the old history format.
- It seems odd to store these as strings instead of two numeric fields.
Used as a utility to indicate that value is a literal constant.
- ???
- Don't see any evidence of use.
- Propose to drop this schema for reasons described above
Allow referencing of array-like objects in external files. These files can be any type of file and in any absolute or relative location to the asdf file. Loading of these files into arrays is not handled by asdf.
- Is this useful?
- By definition the asdf library won't handle loading the external array, so custom code is always required.
- Why not keep the schema alongside that custom code?
- By definition the asdf library won't handle loading the external array, so custom code is always required.
- Propose to drop this schema for reasons described above
Defines a new unit... The new unit must be defined before any unit tags that use it.
- The tag class for this schema was never implemented
- What does it mean to "define before" in a tree?
- The current quantity schema permits either a single number quantity or an array. In some cases (maybe most?) users are going to be expecting one or the other and not actually want both It may be helpful to provide support for an array of quantities in a separate schema.