You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'll open a PR on the RST file if I have time, but I'd like to quickly share a discovery about the Row Size subheader that should make everyone's life easier detecting compressed files and also pulling out the Creator strings.
Bytes 344|672 through 380|708 consist of 6-byte text references into Column Text! They have the same structure as the Column Name pointers, but are unpadded: 2 bytes for the index, 2 bytes for the offset, 2 bytes for the length.
Specifically:
Bytes 350|678 through 356|684: Text reference (index, offset, length) into Creator Software string
Bytes 362|690 through 368|696: Text reference (index, offset, length) into Compression string ("SASYZCRL" or "SASYZCR2")
Bytes 374|702 through 380|708: Text reference (index, offset, length) into Creator PROC step name
This should help get rid of the awkward heuristics around detecting data before the column names begin, since now we have exact offsets for these strings. This also helps explain why SASYZCRL appears where it does. (If the Compression string has an offset/length of 0, it means that the file is uncompressed.)
I've implemented this logic in ReadStat, and it allowed me to rip out several lines of code. So far it seems to work well with test files.
As I said, I will try to get around to writing this up more formally, but in the meantime I wanted others to benefit from this small bit of knowledge.
The text was updated successfully, but these errors were encountered:
I'll open a PR on the RST file if I have time, but I'd like to quickly share a discovery about the Row Size subheader that should make everyone's life easier detecting compressed files and also pulling out the Creator strings.
Bytes 344|672 through 380|708 consist of 6-byte text references into Column Text! They have the same structure as the Column Name pointers, but are unpadded: 2 bytes for the index, 2 bytes for the offset, 2 bytes for the length.
Specifically:
Bytes 350|678 through 356|684: Text reference (index, offset, length) into Creator Software string
Bytes 362|690 through 368|696: Text reference (index, offset, length) into Compression string ("SASYZCRL" or "SASYZCR2")
Bytes 374|702 through 380|708: Text reference (index, offset, length) into Creator PROC step name
This should help get rid of the awkward heuristics around detecting data before the column names begin, since now we have exact offsets for these strings. This also helps explain why SASYZCRL appears where it does. (If the Compression string has an offset/length of 0, it means that the file is uncompressed.)
I've implemented this logic in ReadStat, and it allowed me to rip out several lines of code. So far it seems to work well with test files.
As I said, I will try to get around to writing this up more formally, but in the meantime I wanted others to benefit from this small bit of knowledge.
The text was updated successfully, but these errors were encountered: