Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Files json #1501

Merged
merged 5 commits into from
Nov 14, 2016
Merged

Files json #1501

merged 5 commits into from
Nov 14, 2016

Conversation

soapy1
Copy link
Contributor

@soapy1 soapy1 commented Nov 1, 2016

ref: #1486

@soapy1 soapy1 force-pushed the files-json branch 2 times, most recently from 137eab7 to 1cb086d Compare November 8, 2016 15:37
@soapy1 soapy1 changed the title [WIP] Files json Files json Nov 8, 2016
@soapy1
Copy link
Contributor Author

soapy1 commented Nov 8, 2016

Looks like the same appveyor tests are failing on master 😞

@soapy1 soapy1 added the 4_Needs_Review [deprecated] use milestones/project boards label Nov 8, 2016
@msarahan
Copy link
Contributor

msarahan commented Nov 9, 2016

You have some failures on appveyor besides the persistent ones. The only test that is known to be failing on master right now is test_relative_git_url_submodule_clone but you have a few others. Please take a look.

@msarahan
Copy link
Contributor

msarahan commented Nov 9, 2016

no dice. Looks like something is getting truncated somehow, or is not quoted and is dropping at spaces. https://ci.appveyor.com/project/ContinuumAnalyticsFOSS/conda-build/build/1.0.957/job/txfsj9rp1puk3fb5#L1106

@soapy1
Copy link
Contributor Author

soapy1 commented Nov 9, 2016

hmmm 👀 but I can't reproduce this on my windows vm 🤕

@msarahan
Copy link
Contributor

msarahan commented Nov 9, 2016

OK, no problem. Let's look at it together in the office tomorrow.

On Wed, Nov 9, 2016 at 4:51 PM Sophia Castellarin notifications@github.com
wrote:

hmmm 👀 but I can't reproduce this on my windows vm 🤕


You are receiving this because you commented.

Reply to this email directly, view it on GitHub
#1501 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AACV-aFw6EQgTLxMeR0dycoZvyc6rzYYks5q8k5cgaJpZM4KmpEZ
.

@soapy1
Copy link
Contributor Author

soapy1 commented Nov 11, 2016

TravisCI is failing test: tests/test_api_build.py::test_build_with_activate_does_activate, tests/test_api_build.py::test_dirty_variable_available_in_build_scripts and tests/test_api_build.py::test_build_with_no_activate_does_not_activate for python2.7 on master as well 😞 (these look flaky and are related to filelock)

Appveryor is failing test: tests/test_api_build.py::test_relative_git_url_submodule_clone on python3x_64 and tests/test_api_build.py::test_build_msvc_compiler[14.0] on python35_64 this is consistent with other test except for test_build_msvc_compiler test (thought this looks like this may be a flaky error).

"sha256": sha256_checksum(path),
"size_in_bytes": os.path.getsize(path),
"file_type": getattr(file_type(path), "name"),
"prefix_placeholder": prefix_placeholder,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefix_placeholder should only exist in the dict if the has_prefix extists

@soapy1 soapy1 removed the 4_Needs_Review [deprecated] use milestones/project boards label Nov 11, 2016
"size_in_bytes": os.path.getsize(path),
"file_type": getattr(file_type(path), "name"),
"prefix_placeholder": prefix_placeholder,
"file_mode": file_mode,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as prefix_placeholder

For win32 py27:
os.link is not available here, so skip tests that require that for setup
New spec says:
- prefix_placeholder, no_link, and file_mode should only exist if they
  have vaules
- inode_first_path should instead be inode_paths that is a list of
  inodes that the file shares inodes with
@soapy1
Copy link
Contributor Author

soapy1 commented Nov 14, 2016

Looks like the appveyor failures are the git_submodule test failing.
TravisCI python2.7 tests are failing the same way that master is failing.

@soapy1 soapy1 added the 4_Needs_Review [deprecated] use milestones/project boards label Nov 14, 2016
@msarahan
Copy link
Contributor

Looks good. Thanks for all the work on this.

@msarahan msarahan merged commit ba8c7f3 into conda:master Nov 14, 2016
@msarahan msarahan added this to the 2.1.0 milestone Nov 14, 2016
@soapy1 soapy1 deleted the files-json branch November 14, 2016 18:12
class FileType(Enum):
softlink = "softlink"
hardlink = "hardlink"
directory = "directory"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're going to have to do some magic on enum construction once we move this into conda. See https://github.com/conda/conda/blob/master/conda/base/constants.py#L63

Actually, according to that, let's also call it LinkType. Might just be better to copy that class here. It still needs more work though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense for conda-build to import LinkType through conda-interface in order to decrease the amount of duplication?
Though if we decide to go that route, LinkType will need to be included in the 4.2.x branch of conda


ignore_files = m.ignore_prefix_files()
ignore_types = set()
if not hasattr(ignore_files, "__iter__"):
if ignore_files == True:
if ignore_files is True:
ignore_types.update(('text', 'binary'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FileMode is here: https://github.com/conda/conda/blob/master/conda/base/constants.py#L55

That should probably be used also (start by copying it in to conda-build).

ignore_types.update(('text', 'binary'))
ignore_files = []
if not m.get_value('build/detect_binary_files_with_prefix', True):
ignore_types.update(('binary',))
ignore_files.extend([f[2] for f in files_with_prefix if f[1] in ignore_types and f[2] not in ignore_files])
ignore_files.extend(
[f[2] for f in files_with_prefix if f[1] in ignore_types and f[2] not in ignore_files])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need to make this a list. The generator expression alone should work...

ignore_files.extend(f[2] for f in files_with_prefix
                    if f[1] in ignore_types and f[2] not in ignore_files)

It's just less efficient to make a full list object, only to immediately throw it away.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, can you add a short comment here about what is actually at each index in files_with_prefix.

@@ -438,6 +454,99 @@ def create_info_files(m, files, config, prefix):
config.timeout)


def get_short_path(m, target_file):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still fine with calling this 'short_path' in the code. 👍

return target_file


def sha256_checksum(filename, buffersize=65536):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool implementation. Is 65536 the recommended default buffer size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I read, yes!


def create_info_files_json(m, info_dir, prefix, files, files_with_prefix):
files_json_fields = ["short_path", "sha256", "size_in_bytes", "file_type", "file_mode",
"prefix_placeholder", "no_link", "inode_first_path"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess here we're just calling it path, not short_path now. For serializing the enums, let's chat tomorrow about what the serialized form should look like; I have split feelings. inode_first_path should now be inode_paths.

return 1


def create_info_files_json(m, info_dir, prefix, files, files_with_prefix):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would name this function create_info_files_json_v1 and then hard-code the 1 int.

def build_info_files_json(m, prefix, files, files_with_prefix):
no_link = m.get_value('build/no_link')
files_json = []
for fi in files:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sure files is sorted here. Might be better just to do it again.

for fi in sorted(files):

prefix_placeholder, file_mode = has_prefix(fi, files_with_prefix)
path = os.path.join(prefix, fi)
file_info = {
"short_path": get_short_path(m, fi),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^^^^ this is exactly why I want to call it short_path!!!!! Because two lines up there's a path that's completely different!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

"short_path": get_short_path(m, fi),
"sha256": sha256_checksum(path),
"size_in_bytes": os.path.getsize(path),
"file_type": getattr(file_type(path), "name"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, here. We can talk about how to serialize tomorrow.

@kalefranz kalefranz removed the 4_Needs_Review [deprecated] use milestones/project boards label Nov 14, 2016
@kalefranz
Copy link
Contributor

Having this be a .json file still makes me uncomfortable.

Maybe here's why:

Allowing this document to be json loosens up the extensibility requirements too much. You need a lot of extra information, like version and fields, and then it still make me nervous to trust that's correct. It's easy to compare this to a document database record. You end up being forced into extra error checking and application-layer logic when you're worried about safety and correctness (which not worrying about in the past has been lead to unpredictable bugs in all sorts of places with conda). My gut/intuition is literally yelling at me that it's safer to think of this data (and file format) more like a relational database table, with a strictly defined schema. Yes, we give up the ability to use more advanced data structures like maps and lists--although those can still be serialized into a column if you really want to. At least it doesn't encourage it though.

@kalefranz
Copy link
Contributor

Oh, and I didn't realize this was merged already @msarahan :-/

@msarahan
Copy link
Contributor

This is a pretty insignificant metadata file. I think you're heavily overthinking it. Your additions have been good, but IMHO this is really not worth any more of anyone's time.

We should discuss this in Thursday's meeting to put it to bed.

@kalefranz
Copy link
Contributor

Mike this is insignificant to you for conda-build. It's extremely
significant to me for conda. Given that, I've been a little confused why
you've been so opinionated about it.

On Mon, Nov 14, 2016 at 6:19 PM, Mike Sarahan notifications@github.com
wrote:

This is a pretty insignificant metadata file. I think you're heavily
overthinking it. Your additions have been good, but IMHO this is really not
worth any more of anyone's time.

We should discuss this in Thursday's meeting to put it to bed.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1501 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABWks9vkKBNB1tfI5MTzrZDL3xDgJFi8ks5q-PqPgaJpZM4KmpEZ
.

Kale J. Franz, PhD
Conda Tech Lead
kfranz@continuum.io kfranz@continuum.io
@kalefranz https://twitter.com/kalefranz

http://continuum.io/ http://continuum.io/
http://continuum.io/
221 W 6th St | Suite 1550 | Austin, TX 78701

@kalefranz
Copy link
Contributor

Actually, this bothers me @msarahan. If it's so insignificant to you, then why has there been so much argument about it? It is significant to me. I wouldn't be pushing on it if it wasn't.

@msarahan
Copy link
Contributor

It bothers me that you have left this with zero input for 2 weeks - and longer than that since our discussion, which was arguably the shortest, most conclusive, least contentious meeting our group has ever had. If you really want something, you need to do a better job pushing for it consistently, and from the start. You at least need to communicate your concerns and interests, if you lack time to actually be involved.

I have pushed primarily against what I saw as a bizarre, esoteric file format (PSV). I saw that as a wart on conda-build. I am unconvinced by your technical arguments for choosing this format, especially after your earlier admission that you preferred the PSV file format primarily for readability. There are many ways to validate data (you've written quite a few), and I don't see how this is an exceptional challenge.

@kalefranz
Copy link
Contributor

It bothers me that you have left this with zero input for 2 weeks

That's disingenuous. I've been active with discussions with Sophia. I just found out this was ready for review today when Sophia pinged me on flowdock today. She actually also pinged me on Friday too. Just looked back to check because I miss things on Flowdock. But I wouldn't consider the Friday to Monday gap negligent.

You at least need to communicate your concerns and interests, if you lack time to actually be involved.

As you're the tech lead of conda-build, and Sophia has been the primary implementer, I've made sure she's working closely with you. I hope you'd consider that at a minimum a courtesy, if not generally a requirement. You again mistake that just because you and I haven't communicated more directly, I haven't been involved in the discussion. That really couldn't be further from the truth.

I am unconvinced by your technical arguments for choosing this format, especially after your earlier admission that you preferred the PSV file format primarily for readability.

yet

This is a pretty insignificant metadata file. I think you're heavily overthinking it.

How can I, on one hand, be over-thinking it with trying to understand and explain why my subconscious is rebelling so strongly, but on the other be held to just my initial reasoning that you found insufficient?

what I saw as a bizarre, esoteric file format (PSV). I saw that as a wart on conda-build.

Which is it? Are you unconvinced by the technical arguments, or is the argument against because you think the file format is bizarre and esoteric? You're moving the goal line from one end of the field to the other.

I guess let's address both again.

First the PSV wart. If it's the .psv extension itself, we could always leave the extension off. If it's the use of a pipe as a separator, I think google shows it's not as esoteric as you perceive. Even still, my comments in #1486 show that a tabular format has been important to me from the beginning. CSV would be fine too, but it seems dangerous.

And a technical reason: A tabular format (akin to a relational database table) enforces correctness much more than a throw-whatever-you-want into it document format. The tabular format has a more-directly enforceable and verifiable schema, but correctness also comes more naturally in the former's construction, so mistakes are less likely. Yes, when developing on conda, I can enforce and verify. I'd hope in conda-build you'd also want to do everything you can to ensure correctness, so that if/when I do verify and enforce, we can be as sure as possible that we're not building packages that are not installable. It seems to me correctness is even more important for conda-build, since you don't want to have to ask users to rebuild packages to correct bugs.

If you really want something, you need to do a better job pushing for it consistently, and from the start.

That seems aggressive and dangerous given last Thursday's post-weekly discussion.

It's still a good point. For basically a year now I've been fighting a code base that has prioritized implementation simplicity over correctness, defensiveness, and robustness. Simplicity is optimal for architectural-level design decisions. It's naive and dangerous at the implementation level. It leads to endless edge cases. I have over 1000 open issues--most bugs--to prove it.

I'm sick of the fire hose of bug reports. I'm passionate about these "small" decisions I've been bringing to you because these are tools that are meaningful and helpful at an implementation level.

You at least need to communicate your concerns and interests

I agree that we need to communicate more.

@msarahan
Copy link
Contributor

Thanks for the links. The format is more common than I thought. I'm still not convinced.

The technical argument that I find suspect is that leaving the door open to customization of the format is by definition evil and makes verification harder. You can write schema for json that are equally restrictive to the constraints of a tabular format. Proliferation of file formats within conda-build (and conda for that matter) is something that I am very hostile to. I'm not arguing about most of your code improvements here - they pretty much universally improve code quality, and I appreciate them.

Just to pile on to the link party, here's several links about why one might choose JSON over *SV:

https://www.reddit.com/r/golang/comments/46leew/csv_vs_json_which_is_better_to_read_data_from/
https://news.ycombinator.com/item?id=7796333

This one captures a bit about what I see in JSON: http://inquidia.com/news-and-info/hadoop-file-formats-its-not-just-csv-anymore - see the bit about JSON Records.

I do not want a willy-nilly-no-schema-free-for-all. I want greater freedom to adapt to needs as we encounter them. Ultimately, I care much more about not picking up a new file format than anything else. As you say, conda and conda-build are a lot of edge cases. One more file format is another edge case in my eyes.

@kalefranz
Copy link
Contributor

The technical argument that I find suspect is that leaving the door open to customization of the format is by definition evil and makes verification harder.

I didn't say that at all. I think the format should be extensible. A tabular format allows that.

You can write schema for json that are equally restrictive to the constraints of a tabular format.

Yes, but nobody has done that here. A change to a tabular format would be less than five lines of additional code. And the schema is built in when a header is included. The JSON Record discussion you pointed to is NOT what we're doing here. JSON Records as described in your link have a full JSON Schema (http://json-schema.org) at the beginning of the file, and then each record is a full, valid json object, separated by a line break. Are you proposing we implement that here? Note that, with a JSON Record file, you can't just json.loads(file_handle.read()). The logic is dramatically more complicated. A valid JSON Record file IS NOT a valid .json file.

Proliferation of file formats within conda-build (and conda for that matter) is something that I am very hostile to.

I don't blame you. We're deprecating here three various files, and I'd be fine removing four eventually (the fourth being info/files itself).

It strikes me here that we added noarch.json sort-of haphazardly. We had to add it because we didn't have any other place to put "other miscellaneous metadata" that doesn't belong in index.json but also doesn't belong on a per-file-in-package basis. Maybe THAT is something we should further generalize. I'm definitely open to changing noarch.json into something more general before it gets too far out in the wild. And in that case, the extreme extensibility of json makes sense. Not so much for the per-file-in-package information.

I'm not arguing about most of your code improvements here - they pretty much universally improve code quality, and I appreciate them.

Thanks for the acknowledgement :)

Just to pile on to the link party, here's several links about why one might choose JSON over *SV:
https://www.reddit.com/r/golang/comments/46leew/csv_vs_json_which_is_better_to_read_data_from/

I didn't find anything I connected with there. Is there something in particular that you do?

https://news.ycombinator.com/item?id=7796333

Other than the headline asking the question, this link seems to advocate for tabular format over json/xml. E.g.

  • "Unfortunately, the csv format is very easy for programs to write but it's very difficult for programs to properly read because of the tricky parsing." (This is precisely why I proposed a pipe delimiter rather than a comma.)
  • "If your data are rectangular and you care about performance, CSV is better than JSON just because it avoids repetitive key names everywhere."
  • "the moment you declare you handle json, people will send non-string data ("that is a number, of course it isn't quoted"), attempt to include nested data, leave out the opening and closing [](because people will grep a file with one array per line to filter a json file; that is no way robust, but people will do it, anyways)"
  • "The idea that JSON is the substitute made me chuckle. JSON is more verbose to boot. CSV is a poor format but JSON is not panacea, actually personally I'd never use it for anything that's not web (browser) related."
  • "CSV files are MUCH easier to search and inspect using tools like grep and less. It's the accounting people that want's Excel, but as a developer CSV is easier and more flexible."
  • "CSV is much simpler when the records are all of one type. It gets debatable which (XML vs CSV) is simpler when you get multiple record types dumping a hierarchical data structure." (multiple record types would be horrible in our particular situation)
  • "CSV is far, far more ubiquitous and much more usable in non-web settings. (e.g. desktop data analysis programs)"

http://inquidia.com/news-and-info/hadoop-file-formats-its-not-just-csv-anymore

I really don't think you're advocating implementation of the JSON Record format here. It WOULD be a different file type; using the .json extension would be disingenuous. It would also require either an additional library dependency in both conda and conda-build or a lot of logic we haven't yet built.

Of course, some type of binary format--either standardized or custom--would also just be adding another file type.

From that article, you may have connected with this:

Data with flexible structure can have fields added, changed or removed over time and even vary amongst concurrently ingested records. Many of the file format choices focus on managing structured and flexibly structured data. Our commentary will revolve on this assumption.

We DO NOT WANT flexibly structured data here. We want our data to have a structured, identifiable schema. That schema can evolve over time, but doing so is a deliberate effort. One that you employ schema migration strategies with. Not an "oh just stick it there" strategy. That's not this file. (Although we do need a file like that, as I said above.) This structure is native to tabular data with header. Doing it correctly with json takes a lot more logic and code.

Ultimately, I care much more about not picking up a new file format than anything else. As you say, conda and conda-build are a lot of edge cases. One more file format is another edge case in my eyes.

has_prefix is a csv file right now--just without the extension. Are you ok with csv?

In a lot of ways, without the proper controls around json (implementing a JSON Record format would definitely qualify, but that's not JSON), you're one json file format that can "just stick it there" morph over time introduces many more opportunities for edge cases than a format with more inherent structure.

For this particular data, we need to balance extensibility with structure. A tabular format strikes the right balance.

@github-actions
Copy link

github-actions bot commented May 9, 2022

Hi there, thank you for your contribution!

This pull request has been automatically locked because it has not had recent activity after being closed.

Please open a new issue or pull request if needed.

Thanks!

@github-actions github-actions bot added the locked [bot] locked due to inactivity label May 9, 2022
@github-actions github-actions bot locked as resolved and limited conversation to collaborators May 9, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
locked [bot] locked due to inactivity
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants