Notes on the Row Size subheader #14

evanmiller · 2020-11-30T18:49:01Z

I'll open a PR on the RST file if I have time, but I'd like to quickly share a discovery about the Row Size subheader that should make everyone's life easier detecting compressed files and also pulling out the Creator strings.

Bytes 344|672 through 380|708 consist of 6-byte text references into Column Text! They have the same structure as the Column Name pointers, but are unpadded: 2 bytes for the index, 2 bytes for the offset, 2 bytes for the length.

Specifically:

Bytes 350|678 through 356|684: Text reference (index, offset, length) into Creator Software string

Bytes 362|690 through 368|696: Text reference (index, offset, length) into Compression string ("SASYZCRL" or "SASYZCR2")

Bytes 374|702 through 380|708: Text reference (index, offset, length) into Creator PROC step name

This should help get rid of the awkward heuristics around detecting data before the column names begin, since now we have exact offsets for these strings. This also helps explain why SASYZCRL appears where it does. (If the Compression string has an offset/length of 0, it means that the file is uncompressed.)

I've implemented this logic in ReadStat, and it allowed me to rip out several lines of code. So far it seems to work well with test files.

As I said, I will try to get around to writing this up more formally, but in the meantime I wanted others to benefit from this small bit of knowledge.

BioStatMatt · 2020-12-02T15:02:55Z

Excellent. Thank you.

evanmiller mentioned this issue Dec 4, 2020

Row Size Text Pointers #15

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Notes on the Row Size subheader #14

Notes on the Row Size subheader #14

evanmiller commented Nov 30, 2020

BioStatMatt commented Dec 2, 2020

Notes on the Row Size subheader #14

Notes on the Row Size subheader #14

Comments

evanmiller commented Nov 30, 2020

BioStatMatt commented Dec 2, 2020