-
Notifications
You must be signed in to change notification settings - Fork 427
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Files json #1501
Files json #1501
Conversation
137eab7
to
1cb086d
Compare
Looks like the same appveyor tests are failing on master 😞 |
You have some failures on appveyor besides the persistent ones. The only test that is known to be failing on master right now is |
no dice. Looks like something is getting truncated somehow, or is not quoted and is dropping at spaces. https://ci.appveyor.com/project/ContinuumAnalyticsFOSS/conda-build/build/1.0.957/job/txfsj9rp1puk3fb5#L1106 |
hmmm 👀 but I can't reproduce this on my windows vm 🤕 |
OK, no problem. Let's look at it together in the office tomorrow. On Wed, Nov 9, 2016 at 4:51 PM Sophia Castellarin notifications@github.com
|
TravisCI is failing test: Appveryor is failing test: |
"sha256": sha256_checksum(path), | ||
"size_in_bytes": os.path.getsize(path), | ||
"file_type": getattr(file_type(path), "name"), | ||
"prefix_placeholder": prefix_placeholder, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prefix_placeholder
should only exist in the dict if the has_prefix
extists
"size_in_bytes": os.path.getsize(path), | ||
"file_type": getattr(file_type(path), "name"), | ||
"prefix_placeholder": prefix_placeholder, | ||
"file_mode": file_mode, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as prefix_placeholder
For win32 py27: os.link is not available here, so skip tests that require that for setup
New spec says: - prefix_placeholder, no_link, and file_mode should only exist if they have vaules - inode_first_path should instead be inode_paths that is a list of inodes that the file shares inodes with
Looks like the appveyor failures are the git_submodule test failing. |
Looks good. Thanks for all the work on this. |
class FileType(Enum): | ||
softlink = "softlink" | ||
hardlink = "hardlink" | ||
directory = "directory" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're going to have to do some magic on enum construction once we move this into conda. See https://github.com/conda/conda/blob/master/conda/base/constants.py#L63
Actually, according to that, let's also call it LinkType
. Might just be better to copy that class here. It still needs more work though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense for conda-build to import LinkType
through conda-interface
in order to decrease the amount of duplication?
Though if we decide to go that route, LinkType
will need to be included in the 4.2.x
branch of conda
|
||
ignore_files = m.ignore_prefix_files() | ||
ignore_types = set() | ||
if not hasattr(ignore_files, "__iter__"): | ||
if ignore_files == True: | ||
if ignore_files is True: | ||
ignore_types.update(('text', 'binary')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FileMode
is here: https://github.com/conda/conda/blob/master/conda/base/constants.py#L55
That should probably be used also (start by copying it in to conda-build).
ignore_types.update(('text', 'binary')) | ||
ignore_files = [] | ||
if not m.get_value('build/detect_binary_files_with_prefix', True): | ||
ignore_types.update(('binary',)) | ||
ignore_files.extend([f[2] for f in files_with_prefix if f[1] in ignore_types and f[2] not in ignore_files]) | ||
ignore_files.extend( | ||
[f[2] for f in files_with_prefix if f[1] in ignore_types and f[2] not in ignore_files]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need to make this a list. The generator expression alone should work...
ignore_files.extend(f[2] for f in files_with_prefix
if f[1] in ignore_types and f[2] not in ignore_files)
It's just less efficient to make a full list object, only to immediately throw it away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, can you add a short comment here about what is actually at each index in files_with_prefix
.
@@ -438,6 +454,99 @@ def create_info_files(m, files, config, prefix): | |||
config.timeout) | |||
|
|||
|
|||
def get_short_path(m, target_file): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still fine with calling this 'short_path' in the code. 👍
return target_file | ||
|
||
|
||
def sha256_checksum(filename, buffersize=65536): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool implementation. Is 65536
the recommended default buffer size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I read, yes!
|
||
def create_info_files_json(m, info_dir, prefix, files, files_with_prefix): | ||
files_json_fields = ["short_path", "sha256", "size_in_bytes", "file_type", "file_mode", | ||
"prefix_placeholder", "no_link", "inode_first_path"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess here we're just calling it path
, not short_path
now. For serializing the enums
, let's chat tomorrow about what the serialized form should look like; I have split feelings. inode_first_path
should now be inode_paths
.
return 1 | ||
|
||
|
||
def create_info_files_json(m, info_dir, prefix, files, files_with_prefix): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would name this function create_info_files_json_v1
and then hard-code the 1
int.
def build_info_files_json(m, prefix, files, files_with_prefix): | ||
no_link = m.get_value('build/no_link') | ||
files_json = [] | ||
for fi in files: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sure files
is sorted here. Might be better just to do it again.
for fi in sorted(files):
prefix_placeholder, file_mode = has_prefix(fi, files_with_prefix) | ||
path = os.path.join(prefix, fi) | ||
file_info = { | ||
"short_path": get_short_path(m, fi), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^^^^ this is exactly why I want to call it short_path
!!!!! Because two lines up there's a path
that's completely different!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
"short_path": get_short_path(m, fi), | ||
"sha256": sha256_checksum(path), | ||
"size_in_bytes": os.path.getsize(path), | ||
"file_type": getattr(file_type(path), "name"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, here. We can talk about how to serialize tomorrow.
Having this be a Maybe here's why: Allowing this document to be json loosens up the extensibility requirements too much. You need a lot of extra information, like |
Oh, and I didn't realize this was merged already @msarahan :-/ |
This is a pretty insignificant metadata file. I think you're heavily overthinking it. Your additions have been good, but IMHO this is really not worth any more of anyone's time. We should discuss this in Thursday's meeting to put it to bed. |
Mike this is insignificant to you for conda-build. It's extremely On Mon, Nov 14, 2016 at 6:19 PM, Mike Sarahan notifications@github.com
Kale J. Franz, PhD http://continuum.io/ http://continuum.io/ |
Actually, this bothers me @msarahan. If it's so insignificant to you, then why has there been so much argument about it? It is significant to me. I wouldn't be pushing on it if it wasn't. |
It bothers me that you have left this with zero input for 2 weeks - and longer than that since our discussion, which was arguably the shortest, most conclusive, least contentious meeting our group has ever had. If you really want something, you need to do a better job pushing for it consistently, and from the start. You at least need to communicate your concerns and interests, if you lack time to actually be involved. I have pushed primarily against what I saw as a bizarre, esoteric file format (PSV). I saw that as a wart on conda-build. I am unconvinced by your technical arguments for choosing this format, especially after your earlier admission that you preferred the PSV file format primarily for readability. There are many ways to validate data (you've written quite a few), and I don't see how this is an exceptional challenge. |
That's disingenuous. I've been active with discussions with Sophia. I just found out this was ready for review today when Sophia pinged me on flowdock today. She actually also pinged me on Friday too. Just looked back to check because I miss things on Flowdock. But I wouldn't consider the Friday to Monday gap negligent.
As you're the tech lead of conda-build, and Sophia has been the primary implementer, I've made sure she's working closely with you. I hope you'd consider that at a minimum a courtesy, if not generally a requirement. You again mistake that just because you and I haven't communicated more directly, I haven't been involved in the discussion. That really couldn't be further from the truth.
yet
How can I, on one hand, be over-thinking it with trying to understand and explain why my subconscious is rebelling so strongly, but on the other be held to just my initial reasoning that you found insufficient?
Which is it? Are you unconvinced by the technical arguments, or is the argument against because you think the file format is bizarre and esoteric? You're moving the goal line from one end of the field to the other. I guess let's address both again. First the PSV wart. If it's the And a technical reason: A tabular format (akin to a relational database table) enforces correctness much more than a throw-whatever-you-want into it document format. The tabular format has a more-directly enforceable and verifiable schema, but correctness also comes more naturally in the former's construction, so mistakes are less likely. Yes, when developing on conda, I can enforce and verify. I'd hope in conda-build you'd also want to do everything you can to ensure correctness, so that if/when I do verify and enforce, we can be as sure as possible that we're not building packages that are not installable. It seems to me correctness is even more important for conda-build, since you don't want to have to ask users to rebuild packages to correct bugs.
That seems aggressive and dangerous given last Thursday's post-weekly discussion. It's still a good point. For basically a year now I've been fighting a code base that has prioritized implementation simplicity over correctness, defensiveness, and robustness. Simplicity is optimal for architectural-level design decisions. It's naive and dangerous at the implementation level. It leads to endless edge cases. I have over 1000 open issues--most bugs--to prove it. I'm sick of the fire hose of bug reports. I'm passionate about these "small" decisions I've been bringing to you because these are tools that are meaningful and helpful at an implementation level.
I agree that we need to communicate more. |
Thanks for the links. The format is more common than I thought. I'm still not convinced. The technical argument that I find suspect is that leaving the door open to customization of the format is by definition evil and makes verification harder. You can write schema for json that are equally restrictive to the constraints of a tabular format. Proliferation of file formats within conda-build (and conda for that matter) is something that I am very hostile to. I'm not arguing about most of your code improvements here - they pretty much universally improve code quality, and I appreciate them. Just to pile on to the link party, here's several links about why one might choose JSON over *SV: https://www.reddit.com/r/golang/comments/46leew/csv_vs_json_which_is_better_to_read_data_from/ This one captures a bit about what I see in JSON: http://inquidia.com/news-and-info/hadoop-file-formats-its-not-just-csv-anymore - see the bit about JSON Records. I do not want a willy-nilly-no-schema-free-for-all. I want greater freedom to adapt to needs as we encounter them. Ultimately, I care much more about not picking up a new file format than anything else. As you say, conda and conda-build are a lot of edge cases. One more file format is another edge case in my eyes. |
I didn't say that at all. I think the format should be extensible. A tabular format allows that.
Yes, but nobody has done that here. A change to a tabular format would be less than five lines of additional code. And the schema is built in when a header is included. The JSON Record discussion you pointed to is NOT what we're doing here. JSON Records as described in your link have a full JSON Schema (http://json-schema.org) at the beginning of the file, and then each record is a full, valid json object, separated by a line break. Are you proposing we implement that here? Note that, with a JSON Record file, you can't just
I don't blame you. We're deprecating here three various files, and I'd be fine removing four eventually (the fourth being It strikes me here that we added
Thanks for the acknowledgement :)
I didn't find anything I connected with there. Is there something in particular that you do? Other than the headline asking the question, this link seems to advocate for tabular format over json/xml. E.g.
I really don't think you're advocating implementation of the JSON Record format here. It WOULD be a different file type; using the Of course, some type of binary format--either standardized or custom--would also just be adding another file type. From that article, you may have connected with this:
We DO NOT WANT flexibly structured data here. We want our data to have a structured, identifiable schema. That schema can evolve over time, but doing so is a deliberate effort. One that you employ schema migration strategies with. Not an "oh just stick it there" strategy. That's not this file. (Although we do need a file like that, as I said above.) This structure is native to tabular data with header. Doing it correctly with json takes a lot more logic and code.
In a lot of ways, without the proper controls around json (implementing a JSON Record format would definitely qualify, but that's not JSON), you're one json file format that can "just stick it there" morph over time introduces many more opportunities for edge cases than a format with more inherent structure. For this particular data, we need to balance extensibility with structure. A tabular format strikes the right balance. |
Hi there, thank you for your contribution! This pull request has been automatically locked because it has not had recent activity after being closed. Please open a new issue or pull request if needed. Thanks! |
ref: #1486