Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

uploader: add --json flag to the list command #3480

Merged
merged 11 commits into from
Apr 7, 2020

Conversation

caisq
Copy link
Contributor

@caisq caisq commented Apr 5, 2020

  • Motivation for features / changes
    • Fulfill feature request bug b/153232102
  • Technical description of changes
    • Add the --json flag to the list subcommand of tensorboard dev.
    • If the flag is used, the experiments will be printed as a JSON object mapping experiment URLs to experiment data (name, description, runs, tags, etc.)
  • Screenshots of UI changes
    • image
  • Detailed steps to verify changes work correctly (as executed by you)
    • Manually ran tensorboard dev list --json (see screenshot above)
  • Alternate designs / implementations considered
    • Output a single big json array at the end:
      • Pro: may be easier to parse programmatically
      • Con: no streaming

@caisq caisq marked this pull request as ready for review April 5, 2020 17:50
@caisq caisq requested review from davidsoergel and wchargin April 6, 2020 13:08
@bileschi
Copy link
Collaborator

bileschi commented Apr 6, 2020

Do we envision more formats being desired here? (csv?). Did you consider whether this should be a more general --output_format flag? Is there guidance or precedent we can follow?

@caisq
Copy link
Contributor Author

caisq commented Apr 6, 2020

@bileschi Good question. I opted for the current approach (i.e., --json instead of --output_format=json) for following reasons:

  1. I don't think it's a good idea to show the data in the csv format on the screen in general. The experiment description can contain newline characters, leading to potentially messy and hard-to-read outputs (see https://stackoverflow.com/questions/566052/can-you-encode-cr-lf-in-into-csv-files). JSON handles this well, by using the \n escape character in its string values.
  2. Given --json will be supported, I don't see too much extra value in supporting csv additionally.
  3. Even if we do support csv in the future, it makes sense to output the results to a file directly, in which case a separate arg such as --csv_file=... makes sense.

Copy link
Contributor

@wchargin wchargin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing this! Appreciate it.

+1 to @bileschi’s question. In my experience, precedent is a bit mixed.
The JavaScript world often uses --json (flow, webpack, npm show,
jest, …), but JSON is obviously more common there (it’s the “JS” in
“JSON”, after all). Tools like gcloud have --format json, but tools
like grpc_cli have --json_output.

Not sure that I have a strong opinion here.

"""Constructor of _ListIntent.

Args:
json: If and ony if `True`, will print the list as a pretty-formatted
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sp.: onyonly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 385 to 386
if self.json:
print(json.dumps(experiments_json, indent=2))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the output here looks like this:

[
  {
    "URL": "https://tensorboard.dev/experiment/abcdefghijklmnopqrstuv/",
    "Name": "[No Name]",
    "Description": "[No Description]",
    "Id": "abcdefghijklmnopqrstuv",
    "Created": "2019-10-25 07:22:30",
    "Updated": "2019-10-25 07:22:34",
    "Scalars": "3208",
    "Runs": "8",
    "Tags": "4"
  }
]

There are a few problems with this:

  1. The keys are the same as the human-readable form, and in particular
    are capitalized and (after uploader: display binary object bytes in tensorboard dev list output #3464) may have whitespace.
  2. Numeric values like runs and tags are strings, not ints.
  3. Name and description have dummy values instead of proper ""s.
  4. The output is slurped and not streamed.

On (1): After #3464, this will have an extra "Binary object bytes"
key. But this is pretty unidiomatic. JSON keys do not typically have
whitespace; you can’t use such a key in a jq dot-selector:

$ printf '{"one": 1, "two three": 23}' | jq '.one'
1
$ printf '{"one": 1, "two three": 23}' | jq '.two three'
jq: error: syntax error, unexpected IDENT, expecting $end (Unix shell quoting issues?) at <top-level>, line 1:
.two three     
jq: 1 compile error

On (2): The integers are being converted to string for human
presentation, but should be left as ints in the JSON output for
programmatic manipulation.

On (3): Similar to (2). The dummy values are just to prevent confusing
human-readable output with a blank space. There’s no ambiguity in using
JSON "".

On (4): I see that you considered a streaming response but didn’t like
that it was “less easy to parse programmatically”. Is it? You can
convert a streaming response to a buffered response by simply piping
through jq -s, but you can’t go the other way around without waiting
for the whole buffer. And as long as we make sure to print one record
per line, people using tools other than JQ should be able to deal with
it handily (for line in file: json.loads(line) works in Python, e.g.).

One way to resolve these could be to change data from a 2-column table
to a 4-column table with human readable name, JSON key, raw value, and
formatter used when displaying to humans: e.g.,

            data = [
                ("Name", "name", experiment.name, lambda x: x or "[No Name]"),
                ("Description", "description", experiment.description, lambda x: x or "[No Description]"),
                ("Id", "id", experiment.experiment_id, str),
                ("Created", "created", util.format_time(experiment.create_time), str),
                ("Updated", "updated", util.format_time(experiment.update_time), str),
                ("Scalars", "scalars", experiment.num_scalars, str),
                ("Runs", "runs", experiment.num_runs, str),
                ("Tags", "tags", experiment.num_tags, str),
            ]

and project out the appropriate fields in the JSON and non-JSON cases;
of course feel free to pick a different implementation if you prefer.

Copy link
Contributor Author

@caisq caisq Apr 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed comments and suggestions. I made the following changes to address them.

  1. The point that whitespace should be avoided in keys is taken. I created a Namedtuple called ExperimentMetadataField to hold the data for each row of the experiment metadata. It has the arg name for the name that's appropriate for a JSON key and the arg readable_name that holds the human-readable key.

  2. I also agree that numbers should be printed as numbers, not strings in the JSON output. In this revision, this is done by using the aforementioned new ExperimentMetadataField namedtuple, in addition to the two refactored classes JsonFormatter and ReadableFormatter. Both have the format_experiment() method to convert a list of ExpeimentMetadataFields into a (potentially multi-line) string. I also created a common ancestor abstract class for the two classes for clarity.

  3. Addressed by the above-described approach.

  4. The refactored and revised code now uses streaming consistently across the readable and json formats.

Also note:

  • the new helper classes ReadableFormatter and JsonFormatter are refactored to a new Python module tensorboard/uploader/foramtters.py
  • A unit test is written for them in the newly-added unit test module tensorboard/uploader/formatters_test.py.

print("\t%s %s" % (name.ljust(12), value))
if self.json:
experiments_json.append(
collections.OrderedDict([("URL", url)] + data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we just pass sort_keys=True instead of using an OrderedDict?
The determinism is nice, but I don’t know that we really need to enforce
the same order as with the human-readable output; sorting keys is the
standard “stable stringify”.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer using an OrderedDict and making the key order consistent between the readable and json formats.

I find printed outputs such the following slightly cognitive overloading because it takes some efforts

  • to see what the identifying fields are (url and id in this case),
  • to mentally associate the logically related by actually separated fields like name and description.
  • to mentally sort the logically sorted by actually shuffled fields like runs, tags and scalars
{
  "created": "2020-03-26 10:23:53",
  "description": "",
  "id": "dpi2D3lPTbe84YPSLw0giw",
  "name": "",
  "runs": "2",
  "scalars": "6",
  "tags": "1",
  "updated": "2020-03-26 10:23:53",
  "url": "https://tensorboard.dev/experiment/dpi2D3lPTbe84YPSLw0giw/"
}

Copy link
Contributor Author

@caisq caisq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you both for the insightful reviews!

"""Constructor of _ListIntent.

Args:
json: If and ony if `True`, will print the list as a pretty-formatted
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

tensorboard/uploader/uploader_main.py Show resolved Hide resolved
Comment on lines 385 to 386
if self.json:
print(json.dumps(experiments_json, indent=2))
Copy link
Contributor Author

@caisq caisq Apr 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed comments and suggestions. I made the following changes to address them.

  1. The point that whitespace should be avoided in keys is taken. I created a Namedtuple called ExperimentMetadataField to hold the data for each row of the experiment metadata. It has the arg name for the name that's appropriate for a JSON key and the arg readable_name that holds the human-readable key.

  2. I also agree that numbers should be printed as numbers, not strings in the JSON output. In this revision, this is done by using the aforementioned new ExperimentMetadataField namedtuple, in addition to the two refactored classes JsonFormatter and ReadableFormatter. Both have the format_experiment() method to convert a list of ExpeimentMetadataFields into a (potentially multi-line) string. I also created a common ancestor abstract class for the two classes for clarity.

  3. Addressed by the above-described approach.

  4. The refactored and revised code now uses streaming consistently across the readable and json formats.

Also note:

  • the new helper classes ReadableFormatter and JsonFormatter are refactored to a new Python module tensorboard/uploader/foramtters.py
  • A unit test is written for them in the newly-added unit test module tensorboard/uploader/formatters_test.py.

print("\t%s %s" % (name.ljust(12), value))
if self.json:
experiments_json.append(
collections.OrderedDict([("URL", url)] + data)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer using an OrderedDict and making the key order consistent between the readable and json formats.

I find printed outputs such the following slightly cognitive overloading because it takes some efforts

  • to see what the identifying fields are (url and id in this case),
  • to mentally associate the logically related by actually separated fields like name and description.
  • to mentally sort the logically sorted by actually shuffled fields like runs, tags and scalars
{
  "created": "2020-03-26 10:23:53",
  "description": "",
  "id": "dpi2D3lPTbe84YPSLw0giw",
  "name": "",
  "runs": "2",
  "scalars": "6",
  "tags": "1",
  "updated": "2020-03-26 10:23:53",
  "url": "https://tensorboard.dev/experiment/dpi2D3lPTbe84YPSLw0giw/"
}

@@ -177,6 +177,11 @@ def define_flags(parser):
"list", help="list previously uploaded experiments"
)
list_parser.set_defaults(**{SUBCOMMAND_FLAG: SUBCOMMAND_KEY_LIST})
list_parser.add_argument(
"--json",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bileschi As pointed out by @wchargin, "prior art" varies. So I opted to err on the side of conciseness. Also see my earlier comments regarding this.

@caisq caisq requested review from wchargin and davidsoergel April 7, 2020 03:36
Copy link
Member

@davidsoergel davidsoergel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM mod one remaining comment. Thanks!

EXPERIMENT_METADATA_URL_JSON_KEY = "url"
ExperimentMetadataField = collections.namedtuple(
"ExperimentMetadataField",
("json_key", "readable_name", "value", "formatter"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was confused by the name collision of "formatter". Can this one be "field_formatter" or "to_readable" or similar?

Also, it still feels a bit intertwined to pass a lambda that is specifically used only in the ReadableFormatter case. I think you can remove the field entirely and automate this. In ReadableFormatter.format_experiment, you could:

if not metadata_field.value and not typeof(metadata_field.value) == 'int':
  readable_value = "[No %s]".format(metadata_field.readable_name)
else:
  readable_value = str(metadata_field.value)

The typecheck there is a little iffy (because, as written, None will trigger the substitution even for fields that should be int, and a future float-valued field containing 0.0 would trigger too). I guess ExperimentMetadataField could carry a Type enum, but that seems overwrought. I think it'd be fine to gloss over all that for now (maybe with a code comment about the limitations).

tensorboard/uploader/uploader_main.py Show resolved Hide resolved
Returns:
A string that represents the `experiment_metadata`.
"""
raise NotImplementedError()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Python it’s generally preferable to pass in abstract base class
methods, because otherwise implementations cannot call super().(),
which precludes cooperative multiclassing.

If you want to ensure that all concrete subclasses actually implement
this, consider making this an actual abc.ABCMeta rather than just
calling it an abstract base class in the docstring.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 109 to 120
lines = output.split("\n")
self.assertLen(lines, 8)
self.assertEqual(lines[0], "{")
self.assertEqual(
lines[1], ' "url": "http://tensorboard.dev/deadbeef",'
)
self.assertEqual(lines[2], ' "name": "",')
self.assertEqual(lines[3], ' "description": "",')
self.assertEqual(lines[4], ' "runs": 8,')
self.assertEqual(lines[5], ' "tags": 12,')
self.assertEqual(lines[6], ' "binary_object_bytes": 2000')
self.assertEqual(lines[7], "}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just

expected_lines = [
    "{",
    '  "url": "http://tensorboard.dev/deadbeef",',
    # ...
    "}"
]
self.assertEqual(lines, expected_lines)

rather than asserting on each line individually? It’s easier to read,
wraps better, and gives better failure messages.

Alternative, you could just json.loads the thing and verify that the
object is correct rather than worrying about the details of whitespace.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 25 to 27
ExperimentMetadataField = collections.namedtuple(
"ExperimentMetadataField",
("json_key", "readable_name", "value", "formatter"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a formatter interface sounds reasonable to me, but this doesn’t
really feel like the right structure. The ExperimentMetadataField
struct is tightly coupled to the two implementations of formatters. The
readable_name nad formatter fields are only relevant to the
ReadableFormatter, and the json_key field is only relevant to the
JsonFormatter. And as written, although the name Experiment* is
everywhere, nothing actually knows about experiments.

Why introduce a new type with a big list of vague “fields” when we
already have the experiment proto? The interface could just be

class ExperimentFormatter(metaclass=abc.ABCMeta):
    def format(self, experiment):
        """Format an experiment.

        Args:
          experiment: An `experiment_pb2.Experiment` value.

        Returns:
          A string.
        """
        pass

The ReadableExperimentFormatter would encapsulate the readable field
names and how to format them, and the JsonExperiment would encapsulate
the JSON keys and how to order and indent them. The tests would be on
the same output that we would actually see from the binary rather than
just an arbitrary set of fields defined in the test case.

Then, you could remove the giant block of field definitions in
uploader_main.py, too: just print(formatter.format(experiment)).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done effecting the refactoring that you suggested. This revision has pushed a lot of the variables related to the formatting into formatters.py, which is indeed nicer than before.

Also note that I changed the signature of the format_experiment() method slightly so that it takes two input arguments: experiment (a experiment_pb2.Experiment proto) and experiment_url (a string). Getting the latter from an Experiment proto requires the server_info module, which I think shouldn't be known to the formatter module, which is purely concerned with string formatting.

@caisq caisq requested a review from wchargin April 7, 2020 19:13
Copy link
Contributor

@wchargin wchargin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good once tensorboard dev list --json is fixed.

Getting the [experiment URL] from an Experiment proto requires the
server_info module, which I think shouldn't be known to the
formatter module, which is purely concerned with string formatting.

Yep, agreed; sounds good to me.

@@ -348,23 +358,16 @@ def execute(self, server_info, channel):
)
gen = exporter_lib.list_experiments(api_client, fieldmask=fieldmask)
count = 0

if self.json:
formatter = formatters.JsonFormatterI()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably JsonFormatterI should be JsonFormatter?

This breaks when run:

    formatter = formatters.JsonFormatterI()
AttributeError: module 'tensorboard.uploader.formatters' has no attribute 'JsonFormatterI'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm. I don't know how this got in. I'll be more careful with my keyboard setup. Thanks for catching this.

("scalars", experiment.num_scalars),
]
return json.dumps(
collections.OrderedDict(data), indent=self._JSON_INDENT,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional, but: if you’re concerned about ease of parsing the JSON output
without jq, you could consider using indent=None (the default), so
that the output is one JSON object per line. This makes it easy to parse
with a parser that can only take a complete JSON string, like Python’s
json.loads or JS’s JSON.parse, because the user can easily identify
the framing boundaries (just split by newline). Otherwise, they have to
identify the actual object boundaries themselves, which is slightly less
trivial.

Up to you; just mentioning because you noted this concern in the
original PR description.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a test with tensorboard dev list --json | jq -s under this PR. jq seems to be able to handle the indent=2 just fine. So I'll leave it as is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, yes, jq handles it fine, hence “[…] ease of parsing the JSON
output without jq.” This is fine with me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. I think the slight difficulty of parsing with other tools should be acceptable.

import collections
import json

import six
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to support Python 2 in new srcs_version = "PY3" modules.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. There are a few other modules in tensorboard/uploader that are still using six.add_metaclass(). Those can be corrected later.

Comment on lines 92 to 93
("created", util.format_time(experiment.create_time)),
("updated", util.format_time(experiment.update_time)),
Copy link
Contributor

@wchargin wchargin Apr 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably makes more sense to use format_time_absolute for the JSON
formatter? Since the point is to have a simple, parseable format,
outputs like "just now" probably aren’t what’s wanted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Unit test is updated accordingly.

@caisq caisq requested a review from wchargin April 7, 2020 19:45
("scalars", experiment.num_scalars),
]
return json.dumps(
collections.OrderedDict(data), indent=self._JSON_INDENT,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, yes, jq handles it fine, hence “[…] ease of parsing the JSON
output without jq.” This is fine with me.

@@ -0,0 +1,99 @@
# Copyright 2019 The TensorFlow Authors. All Rights Reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: 2019 → 2020 (and test)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done here and in formatters_test.py.

@caisq caisq merged commit 2edd0ea into tensorflow:master Apr 7, 2020
bileschi pushed a commit to bileschi/tensorboard that referenced this pull request Apr 15, 2020
* Motivation for features / changes
  * Fulfill feature request bug b/153232102
* Technical description of changes
  * Add the `--json` flag to the `list` subcommand of `tensorboard dev`.
  * If the flag is used, the experiments will be printed as a JSON object mapping experiment URLs to experiment data (name, description, runs, tags, etc.)
* Screenshots of UI changes
  * ![image](https://user-images.githubusercontent.com/16824702/78626883-0f77f480-785e-11ea-88ca-b8d653d302c6.png)
* Detailed steps to verify changes work correctly (as executed by you)
  * Manually ran `tensorboard dev list --json` (see screenshot above)
* Alternate designs / implementations considered
  * Output a single big json array at the end:
    * Pro: may be easier to parse programmatically
    * Con: no streaming
bileschi pushed a commit that referenced this pull request Apr 15, 2020
* Motivation for features / changes
  * Fulfill feature request bug b/153232102
* Technical description of changes
  * Add the `--json` flag to the `list` subcommand of `tensorboard dev`.
  * If the flag is used, the experiments will be printed as a JSON object mapping experiment URLs to experiment data (name, description, runs, tags, etc.)
* Screenshots of UI changes
  * ![image](https://user-images.githubusercontent.com/16824702/78626883-0f77f480-785e-11ea-88ca-b8d653d302c6.png)
* Detailed steps to verify changes work correctly (as executed by you)
  * Manually ran `tensorboard dev list --json` (see screenshot above)
* Alternate designs / implementations considered
  * Output a single big json array at the end:
    * Pro: may be easier to parse programmatically
    * Con: no streaming
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants