Files json #1501

soapy1 · 2016-11-01T22:09:17Z

soapy1 · 2016-11-08T18:57:54Z

Looks like the same appveyor tests are failing on master 😞

msarahan · 2016-11-09T20:01:28Z

You have some failures on appveyor besides the persistent ones. The only test that is known to be failing on master right now is test_relative_git_url_submodule_clone but you have a few others. Please take a look.

msarahan · 2016-11-09T22:44:26Z

no dice. Looks like something is getting truncated somehow, or is not quoted and is dropping at spaces. https://ci.appveyor.com/project/ContinuumAnalyticsFOSS/conda-build/build/1.0.957/job/txfsj9rp1puk3fb5#L1106

soapy1 · 2016-11-09T22:51:08Z

hmmm 👀 but I can't reproduce this on my windows vm 🤕

msarahan · 2016-11-09T22:54:59Z

OK, no problem. Let's look at it together in the office tomorrow.

On Wed, Nov 9, 2016 at 4:51 PM Sophia Castellarin notifications@github.com
wrote:

hmmm 👀 but I can't reproduce this on my windows vm 🤕

—
You are receiving this because you commented.

Reply to this email directly, view it on GitHub
#1501 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AACV-aFw6EQgTLxMeR0dycoZvyc6rzYYks5q8k5cgaJpZM4KmpEZ
.

soapy1 · 2016-11-11T16:47:13Z

TravisCI is failing test: tests/test_api_build.py::test_build_with_activate_does_activate, tests/test_api_build.py::test_dirty_variable_available_in_build_scripts and tests/test_api_build.py::test_build_with_no_activate_does_not_activate for python2.7 on master as well 😞 (these look flaky and are related to filelock)

Appveryor is failing test: tests/test_api_build.py::test_relative_git_url_submodule_clone on python3x_64 and tests/test_api_build.py::test_build_msvc_compiler[14.0] on python35_64 this is consistent with other test except for test_build_msvc_compiler test (thought this looks like this may be a flaky error).

soapy1 · 2016-11-11T20:35:32Z

conda_build/build.py

+            "sha256": sha256_checksum(path),
+            "size_in_bytes": os.path.getsize(path),
+            "file_type": getattr(file_type(path), "name"),
+            "prefix_placeholder": prefix_placeholder,


prefix_placeholder should only exist in the dict if the has_prefix extists

soapy1 · 2016-11-11T20:54:46Z

conda_build/build.py

+            "size_in_bytes": os.path.getsize(path),
+            "file_type": getattr(file_type(path), "name"),
+            "prefix_placeholder": prefix_placeholder,
+            "file_mode": file_mode,


same as prefix_placeholder

ref: conda#1486

For win32 py27: os.link is not available here, so skip tests that require that for setup

New spec says: - prefix_placeholder, no_link, and file_mode should only exist if they have vaules - inode_first_path should instead be inode_paths that is a list of inodes that the file shares inodes with

soapy1 · 2016-11-14T18:07:50Z

Looks like the appveyor failures are the git_submodule test failing.
TravisCI python2.7 tests are failing the same way that master is failing.

msarahan · 2016-11-14T18:12:15Z

Looks good. Thanks for all the work on this.

kalefranz · 2016-11-14T22:21:25Z

conda_build/build.py

+class FileType(Enum):
+    softlink = "softlink"
+    hardlink = "hardlink"
+    directory = "directory"


We're going to have to do some magic on enum construction once we move this into conda. See https://github.com/conda/conda/blob/master/conda/base/constants.py#L63

Actually, according to that, let's also call it LinkType. Might just be better to copy that class here. It still needs more work though.

Would it make sense for conda-build to import LinkType through conda-interface in order to decrease the amount of duplication?
Though if we decide to go that route, LinkType will need to be included in the 4.2.x branch of conda

kalefranz · 2016-11-14T22:23:28Z

conda_build/build.py


    ignore_files = m.ignore_prefix_files()
    ignore_types = set()
    if not hasattr(ignore_files, "__iter__"):
-        if  ignore_files == True:
+        if ignore_files is True:
            ignore_types.update(('text', 'binary'))


FileMode is here: https://github.com/conda/conda/blob/master/conda/base/constants.py#L55

That should probably be used also (start by copying it in to conda-build).

kalefranz · 2016-11-14T22:25:00Z

conda_build/build.py

            ignore_types.update(('text', 'binary'))
        ignore_files = []
    if not m.get_value('build/detect_binary_files_with_prefix', True):
        ignore_types.update(('binary',))
-    ignore_files.extend([f[2] for f in files_with_prefix if f[1] in ignore_types and f[2] not in ignore_files])
+    ignore_files.extend(
+        [f[2] for f in files_with_prefix if f[1] in ignore_types and f[2] not in ignore_files])


I don't think you need to make this a list. The generator expression alone should work...

ignore_files.extend(f[2] for f in files_with_prefix if f[1] in ignore_types and f[2] not in ignore_files)

It's just less efficient to make a full list object, only to immediately throw it away.

Also, can you add a short comment here about what is actually at each index in files_with_prefix.

kalefranz · 2016-11-14T22:30:07Z

conda_build/build.py

@@ -438,6 +454,99 @@ def create_info_files(m, files, config, prefix):
                  config.timeout)


+def get_short_path(m, target_file):


I'm still fine with calling this 'short_path' in the code. 👍

kalefranz · 2016-11-14T22:32:08Z

conda_build/build.py

+        return target_file
+
+
+def sha256_checksum(filename, buffersize=65536):


Cool implementation. Is 65536 the recommended default buffer size?

From what I read, yes!

kalefranz · 2016-11-14T23:11:20Z

conda_build/build.py

+
+def create_info_files_json(m, info_dir, prefix, files, files_with_prefix):
+    files_json_fields = ["short_path", "sha256", "size_in_bytes", "file_type", "file_mode",
+                         "prefix_placeholder", "no_link", "inode_first_path"]


I guess here we're just calling it path, not short_path now. For serializing the enums, let's chat tomorrow about what the serialized form should look like; I have split feelings. inode_first_path should now be inode_paths.

kalefranz · 2016-11-14T23:12:11Z

conda_build/build.py

+    return 1
+
+
+def create_info_files_json(m, info_dir, prefix, files, files_with_prefix):


I would name this function create_info_files_json_v1 and then hard-code the 1 int.

kalefranz · 2016-11-14T23:14:04Z

conda_build/build.py

+def build_info_files_json(m, prefix, files, files_with_prefix):
+    no_link = m.get_value('build/no_link')
+    files_json = []
+    for fi in files:


make sure files is sorted here. Might be better just to do it again.

for fi in sorted(files):

kalefranz · 2016-11-14T23:15:16Z

conda_build/build.py

+        prefix_placeholder, file_mode = has_prefix(fi, files_with_prefix)
+        path = os.path.join(prefix, fi)
+        file_info = {
+            "short_path": get_short_path(m, fi),


^^^^ this is exactly why I want to call it short_path!!!!! Because two lines up there's a path that's completely different!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

kalefranz · 2016-11-14T23:16:16Z

conda_build/build.py

+            "short_path": get_short_path(m, fi),
+            "sha256": sha256_checksum(path),
+            "size_in_bytes": os.path.getsize(path),
+            "file_type": getattr(file_type(path), "name"),


Oh, here. We can talk about how to serialize tomorrow.

kalefranz · 2016-11-15T00:06:47Z

Having this be a .json file still makes me uncomfortable.

Maybe here's why:

Allowing this document to be json loosens up the extensibility requirements too much. You need a lot of extra information, like version and fields, and then it still make me nervous to trust that's correct. It's easy to compare this to a document database record. You end up being forced into extra error checking and application-layer logic when you're worried about safety and correctness (which not worrying about in the past has been lead to unpredictable bugs in all sorts of places with conda). My gut/intuition is literally yelling at me that it's safer to think of this data (and file format) more like a relational database table, with a strictly defined schema. Yes, we give up the ability to use more advanced data structures like maps and lists--although those can still be serialized into a column if you really want to. At least it doesn't encourage it though.

kalefranz · 2016-11-15T00:07:03Z

Oh, and I didn't realize this was merged already @msarahan :-/

msarahan · 2016-11-15T00:19:27Z

This is a pretty insignificant metadata file. I think you're heavily overthinking it. Your additions have been good, but IMHO this is really not worth any more of anyone's time.

We should discuss this in Thursday's meeting to put it to bed.

kalefranz · 2016-11-15T00:41:06Z

Mike this is insignificant to you for conda-build. It's extremely
significant to me for conda. Given that, I've been a little confused why
you've been so opinionated about it.

On Mon, Nov 14, 2016 at 6:19 PM, Mike Sarahan notifications@github.com
wrote:

This is a pretty insignificant metadata file. I think you're heavily
overthinking it. Your additions have been good, but IMHO this is really not
worth any more of anyone's time.

We should discuss this in Thursday's meeting to put it to bed.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1501 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABWks9vkKBNB1tfI5MTzrZDL3xDgJFi8ks5q-PqPgaJpZM4KmpEZ
.

Kale J. Franz, PhD
Conda Tech Lead
kfranz@continuum.io kfranz@continuum.io
@kalefranz https://twitter.com/kalefranz

http://continuum.io/ http://continuum.io/
http://continuum.io/
221 W 6th St | Suite 1550 | Austin, TX 78701

kalefranz · 2016-11-15T00:43:06Z

Actually, this bothers me @msarahan. If it's so insignificant to you, then why has there been so much argument about it? It is significant to me. I wouldn't be pushing on it if it wasn't.

msarahan · 2016-11-15T00:56:03Z

It bothers me that you have left this with zero input for 2 weeks - and longer than that since our discussion, which was arguably the shortest, most conclusive, least contentious meeting our group has ever had. If you really want something, you need to do a better job pushing for it consistently, and from the start. You at least need to communicate your concerns and interests, if you lack time to actually be involved.

I have pushed primarily against what I saw as a bizarre, esoteric file format (PSV). I saw that as a wart on conda-build. I am unconvinced by your technical arguments for choosing this format, especially after your earlier admission that you preferred the PSV file format primarily for readability. There are many ways to validate data (you've written quite a few), and I don't see how this is an exceptional challenge.

kalefranz · 2016-11-15T03:11:13Z

It bothers me that you have left this with zero input for 2 weeks

That's disingenuous. I've been active with discussions with Sophia. I just found out this was ready for review today when Sophia pinged me on flowdock today. She actually also pinged me on Friday too. Just looked back to check because I miss things on Flowdock. But I wouldn't consider the Friday to Monday gap negligent.

You at least need to communicate your concerns and interests, if you lack time to actually be involved.

As you're the tech lead of conda-build, and Sophia has been the primary implementer, I've made sure she's working closely with you. I hope you'd consider that at a minimum a courtesy, if not generally a requirement. You again mistake that just because you and I haven't communicated more directly, I haven't been involved in the discussion. That really couldn't be further from the truth.

I am unconvinced by your technical arguments for choosing this format, especially after your earlier admission that you preferred the PSV file format primarily for readability.

yet

This is a pretty insignificant metadata file. I think you're heavily overthinking it.

How can I, on one hand, be over-thinking it with trying to understand and explain why my subconscious is rebelling so strongly, but on the other be held to just my initial reasoning that you found insufficient?

what I saw as a bizarre, esoteric file format (PSV). I saw that as a wart on conda-build.

Which is it? Are you unconvinced by the technical arguments, or is the argument against because you think the file format is bizarre and esoteric? You're moving the goal line from one end of the field to the other.

I guess let's address both again.

First the PSV wart. If it's the .psv extension itself, we could always leave the extension off. If it's the use of a pipe as a separator, I think google shows it's not as esoteric as you perceive. Even still, my comments in #1486 show that a tabular format has been important to me from the beginning. CSV would be fine too, but it seems dangerous.

And a technical reason: A tabular format (akin to a relational database table) enforces correctness much more than a throw-whatever-you-want into it document format. The tabular format has a more-directly enforceable and verifiable schema, but correctness also comes more naturally in the former's construction, so mistakes are less likely. Yes, when developing on conda, I can enforce and verify. I'd hope in conda-build you'd also want to do everything you can to ensure correctness, so that if/when I do verify and enforce, we can be as sure as possible that we're not building packages that are not installable. It seems to me correctness is even more important for conda-build, since you don't want to have to ask users to rebuild packages to correct bugs.

If you really want something, you need to do a better job pushing for it consistently, and from the start.

That seems aggressive and dangerous given last Thursday's post-weekly discussion.

It's still a good point. For basically a year now I've been fighting a code base that has prioritized implementation simplicity over correctness, defensiveness, and robustness. Simplicity is optimal for architectural-level design decisions. It's naive and dangerous at the implementation level. It leads to endless edge cases. I have over 1000 open issues--most bugs--to prove it.

I'm sick of the fire hose of bug reports. I'm passionate about these "small" decisions I've been bringing to you because these are tools that are meaningful and helpful at an implementation level.

You at least need to communicate your concerns and interests

I agree that we need to communicate more.

msarahan · 2016-11-15T04:05:42Z

Thanks for the links. The format is more common than I thought. I'm still not convinced.

The technical argument that I find suspect is that leaving the door open to customization of the format is by definition evil and makes verification harder. You can write schema for json that are equally restrictive to the constraints of a tabular format. Proliferation of file formats within conda-build (and conda for that matter) is something that I am very hostile to. I'm not arguing about most of your code improvements here - they pretty much universally improve code quality, and I appreciate them.

Just to pile on to the link party, here's several links about why one might choose JSON over *SV:

https://www.reddit.com/r/golang/comments/46leew/csv_vs_json_which_is_better_to_read_data_from/
https://news.ycombinator.com/item?id=7796333

This one captures a bit about what I see in JSON: http://inquidia.com/news-and-info/hadoop-file-formats-its-not-just-csv-anymore - see the bit about JSON Records.

I do not want a willy-nilly-no-schema-free-for-all. I want greater freedom to adapt to needs as we encounter them. Ultimately, I care much more about not picking up a new file format than anything else. As you say, conda and conda-build are a lot of edge cases. One more file format is another edge case in my eyes.

kalefranz · 2016-11-15T05:34:59Z

The technical argument that I find suspect is that leaving the door open to customization of the format is by definition evil and makes verification harder.

I didn't say that at all. I think the format should be extensible. A tabular format allows that.

You can write schema for json that are equally restrictive to the constraints of a tabular format.

Yes, but nobody has done that here. A change to a tabular format would be less than five lines of additional code. And the schema is built in when a header is included. The JSON Record discussion you pointed to is NOT what we're doing here. JSON Records as described in your link have a full JSON Schema (http://json-schema.org) at the beginning of the file, and then each record is a full, valid json object, separated by a line break. Are you proposing we implement that here? Note that, with a JSON Record file, you can't just json.loads(file_handle.read()). The logic is dramatically more complicated. A valid JSON Record file IS NOT a valid .json file.

Proliferation of file formats within conda-build (and conda for that matter) is something that I am very hostile to.

I don't blame you. We're deprecating here three various files, and I'd be fine removing four eventually (the fourth being info/files itself).

It strikes me here that we added noarch.json sort-of haphazardly. We had to add it because we didn't have any other place to put "other miscellaneous metadata" that doesn't belong in index.json but also doesn't belong on a per-file-in-package basis. Maybe THAT is something we should further generalize. I'm definitely open to changing noarch.json into something more general before it gets too far out in the wild. And in that case, the extreme extensibility of json makes sense. Not so much for the per-file-in-package information.

I'm not arguing about most of your code improvements here - they pretty much universally improve code quality, and I appreciate them.

Thanks for the acknowledgement :)

Just to pile on to the link party, here's several links about why one might choose JSON over *SV:
https://www.reddit.com/r/golang/comments/46leew/csv_vs_json_which_is_better_to_read_data_from/

I didn't find anything I connected with there. Is there something in particular that you do?

https://news.ycombinator.com/item?id=7796333

Other than the headline asking the question, this link seems to advocate for tabular format over json/xml. E.g.

"Unfortunately, the csv format is very easy for programs to write but it's very difficult for programs to properly read because of the tricky parsing." (This is precisely why I proposed a pipe delimiter rather than a comma.)
"If your data are rectangular and you care about performance, CSV is better than JSON just because it avoids repetitive key names everywhere."
"the moment you declare you handle json, people will send non-string data ("that is a number, of course it isn't quoted"), attempt to include nested data, leave out the opening and closing [](because people will grep a file with one array per line to filter a json file; that is no way robust, but people will do it, anyways)"
"The idea that JSON is the substitute made me chuckle. JSON is more verbose to boot. CSV is a poor format but JSON is not panacea, actually personally I'd never use it for anything that's not web (browser) related."
"CSV files are MUCH easier to search and inspect using tools like grep and less. It's the accounting people that want's Excel, but as a developer CSV is easier and more flexible."
"CSV is much simpler when the records are all of one type. It gets debatable which (XML vs CSV) is simpler when you get multiple record types dumping a hierarchical data structure." (multiple record types would be horrible in our particular situation)
"CSV is far, far more ubiquitous and much more usable in non-web settings. (e.g. desktop data analysis programs)"

http://inquidia.com/news-and-info/hadoop-file-formats-its-not-just-csv-anymore

I really don't think you're advocating implementation of the JSON Record format here. It WOULD be a different file type; using the .json extension would be disingenuous. It would also require either an additional library dependency in both conda and conda-build or a lot of logic we haven't yet built.

Of course, some type of binary format--either standardized or custom--would also just be adding another file type.

From that article, you may have connected with this:

Data with flexible structure can have fields added, changed or removed over time and even vary amongst concurrently ingested records. Many of the file format choices focus on managing structured and flexibly structured data. Our commentary will revolve on this assumption.

We DO NOT WANT flexibly structured data here. We want our data to have a structured, identifiable schema. That schema can evolve over time, but doing so is a deliberate effort. One that you employ schema migration strategies with. Not an "oh just stick it there" strategy. That's not this file. (Although we do need a file like that, as I said above.) This structure is native to tabular data with header. Doing it correctly with json takes a lot more logic and code.

Ultimately, I care much more about not picking up a new file format than anything else. As you say, conda and conda-build are a lot of edge cases. One more file format is another edge case in my eyes.

has_prefix is a csv file right now--just without the extension. Are you ok with csv?

In a lot of ways, without the proper controls around json (implementing a JSON Record format would definitely qualify, but that's not JSON), you're one json file format that can "just stick it there" morph over time introduces many more opportunities for edge cases than a format with more inherent structure.

For this particular data, we need to balance extensibility with structure. A tabular format strikes the right balance.

github-actions · 2022-05-09T06:04:44Z

Hi there, thank you for your contribution!

This pull request has been automatically locked because it has not had recent activity after being closed.

Please open a new issue or pull request if needed.

Thanks!

soapy1 force-pushed the files-json branch 2 times, most recently from 137eab7 to 1cb086d Compare November 8, 2016 15:37

soapy1 changed the title ~~[WIP] Files json~~ Files json Nov 8, 2016

kalefranz mentioned this pull request Nov 8, 2016

allow enhancement of info/files with extra information conda/conda#3702

Closed

soapy1 added the 4_Needs_Review [deprecated] use milestones/project boards label Nov 8, 2016

soapy1 force-pushed the files-json branch from 1cb086d to d813819 Compare November 9, 2016 20:34

soapy1 commented Nov 11, 2016

View reviewed changes

soapy1 removed the 4_Needs_Review [deprecated] use milestones/project boards label Nov 11, 2016

soapy1 commented Nov 11, 2016

View reviewed changes

soapy1 force-pushed the files-json branch from 702c0ab to c8a3f27 Compare November 14, 2016 16:42

soapy1 added 5 commits November 14, 2016 10:46

Extract function for creating info/files.json

f127925

ref: conda#1486

Build files.json using the right file paths

8eaf372

More testing for files.json

b854bed

For win32 py27: os.link is not available here, so skip tests that require that for setup

Update the contents of files.json

2e3c308

New spec says: - prefix_placeholder, no_link, and file_mode should only exist if they have vaules - inode_first_path should instead be inode_paths that is a list of inodes that the file shares inodes with

Fix tests

36a3ce1

soapy1 force-pushed the files-json branch from c8a3f27 to 36a3ce1 Compare November 14, 2016 16:55

soapy1 added the 4_Needs_Review [deprecated] use milestones/project boards label Nov 14, 2016

msarahan merged commit ba8c7f3 into conda:master Nov 14, 2016

msarahan added this to the 2.1.0 milestone Nov 14, 2016

soapy1 deleted the files-json branch November 14, 2016 18:12

kalefranz reviewed Nov 14, 2016

View reviewed changes

kalefranz removed the 4_Needs_Review [deprecated] use milestones/project boards label Nov 14, 2016

kalefranz mentioned this pull request Nov 15, 2016

why are some packages also listed as pip installs conda/conda#3874

Closed

This was referenced Nov 15, 2016

Move CrossPlatformStLink and make available as export conda/conda#3887

Merged

finalizing paths.json spec #1535

Merged

msarahan mentioned this pull request Nov 17, 2016

option to use soft links instead of hard links conda/conda#3308

Closed

soapy1 restored the files-json branch December 15, 2016 18:55

github-actions bot added the locked [bot] locked due to inactivity label May 9, 2022

github-actions bot locked as resolved and limited conversation to collaborators May 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files json #1501

Files json #1501

soapy1 commented Nov 1, 2016

soapy1 commented Nov 8, 2016

msarahan commented Nov 9, 2016

msarahan commented Nov 9, 2016

soapy1 commented Nov 9, 2016

msarahan commented Nov 9, 2016

soapy1 commented Nov 11, 2016 •

edited

Loading

soapy1 Nov 11, 2016

soapy1 Nov 11, 2016

soapy1 commented Nov 14, 2016

msarahan commented Nov 14, 2016

kalefranz Nov 14, 2016

soapy1 Nov 15, 2016

kalefranz Nov 14, 2016

kalefranz Nov 14, 2016

kalefranz Nov 14, 2016

kalefranz Nov 14, 2016

kalefranz Nov 14, 2016

soapy1 Nov 15, 2016

kalefranz Nov 14, 2016

kalefranz Nov 14, 2016

kalefranz Nov 14, 2016

kalefranz Nov 14, 2016

kalefranz Nov 14, 2016

kalefranz commented Nov 15, 2016

kalefranz commented Nov 15, 2016

msarahan commented Nov 15, 2016

kalefranz commented Nov 15, 2016

kalefranz commented Nov 15, 2016

msarahan commented Nov 15, 2016

kalefranz commented Nov 15, 2016

msarahan commented Nov 15, 2016

kalefranz commented Nov 15, 2016

github-actions bot commented May 9, 2022

		@@ -438,6 +454,99 @@ def create_info_files(m, files, config, prefix):
		config.timeout)


		def get_short_path(m, target_file):

		return target_file


		def sha256_checksum(filename, buffersize=65536):

		return 1


		def create_info_files_json(m, info_dir, prefix, files, files_with_prefix):

Files json #1501

Files json #1501

Conversation

soapy1 commented Nov 1, 2016

soapy1 commented Nov 8, 2016

msarahan commented Nov 9, 2016

msarahan commented Nov 9, 2016

soapy1 commented Nov 9, 2016

msarahan commented Nov 9, 2016

soapy1 commented Nov 11, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

soapy1 commented Nov 14, 2016

msarahan commented Nov 14, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kalefranz commented Nov 15, 2016

kalefranz commented Nov 15, 2016

msarahan commented Nov 15, 2016

kalefranz commented Nov 15, 2016

kalefranz commented Nov 15, 2016

msarahan commented Nov 15, 2016

kalefranz commented Nov 15, 2016

msarahan commented Nov 15, 2016

kalefranz commented Nov 15, 2016

github-actions bot commented May 9, 2022

soapy1 commented Nov 11, 2016 •

edited

Loading