-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
uploader: add --json
flag to the list
command
#3480
Conversation
Do we envision more formats being desired here? (csv?). Did you consider whether this should be a more general |
@bileschi Good question. I opted for the current approach (i.e.,
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for writing this! Appreciate it.
+1 to @bileschi’s question. In my experience, precedent is a bit mixed.
The JavaScript world often uses --json
(flow
, webpack
, npm show
,
jest
, …), but JSON is obviously more common there (it’s the “JS” in
“JSON”, after all). Tools like gcloud
have --format json
, but tools
like grpc_cli
have --json_output
.
Not sure that I have a strong opinion here.
"""Constructor of _ListIntent. | ||
|
||
Args: | ||
json: If and ony if `True`, will print the list as a pretty-formatted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sp.: ony → only
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
if self.json: | ||
print(json.dumps(experiments_json, indent=2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the output here looks like this:
[
{
"URL": "https://tensorboard.dev/experiment/abcdefghijklmnopqrstuv/",
"Name": "[No Name]",
"Description": "[No Description]",
"Id": "abcdefghijklmnopqrstuv",
"Created": "2019-10-25 07:22:30",
"Updated": "2019-10-25 07:22:34",
"Scalars": "3208",
"Runs": "8",
"Tags": "4"
}
]
There are a few problems with this:
- The keys are the same as the human-readable form, and in particular
are capitalized and (after uploader: display binary object bytes intensorboard dev list
output #3464) may have whitespace. - Numeric values like
runs
andtags
are strings, not ints. - Name and description have dummy values instead of proper
""
s. - The output is slurped and not streamed.
On (1): After #3464, this will have an extra "Binary object bytes"
key. But this is pretty unidiomatic. JSON keys do not typically have
whitespace; you can’t use such a key in a jq
dot-selector:
$ printf '{"one": 1, "two three": 23}' | jq '.one'
1
$ printf '{"one": 1, "two three": 23}' | jq '.two three'
jq: error: syntax error, unexpected IDENT, expecting $end (Unix shell quoting issues?) at <top-level>, line 1:
.two three
jq: 1 compile error
On (2): The integers are being converted to string for human
presentation, but should be left as ints in the JSON output for
programmatic manipulation.
On (3): Similar to (2). The dummy values are just to prevent confusing
human-readable output with a blank space. There’s no ambiguity in using
JSON ""
.
On (4): I see that you considered a streaming response but didn’t like
that it was “less easy to parse programmatically”. Is it? You can
convert a streaming response to a buffered response by simply piping
through jq -s
, but you can’t go the other way around without waiting
for the whole buffer. And as long as we make sure to print one record
per line, people using tools other than JQ should be able to deal with
it handily (for line in file: json.loads(line)
works in Python, e.g.).
One way to resolve these could be to change data
from a 2-column table
to a 4-column table with human readable name, JSON key, raw value, and
formatter used when displaying to humans: e.g.,
data = [
("Name", "name", experiment.name, lambda x: x or "[No Name]"),
("Description", "description", experiment.description, lambda x: x or "[No Description]"),
("Id", "id", experiment.experiment_id, str),
("Created", "created", util.format_time(experiment.create_time), str),
("Updated", "updated", util.format_time(experiment.update_time), str),
("Scalars", "scalars", experiment.num_scalars, str),
("Runs", "runs", experiment.num_runs, str),
("Tags", "tags", experiment.num_tags, str),
]
and project out the appropriate fields in the JSON and non-JSON cases;
of course feel free to pick a different implementation if you prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed comments and suggestions. I made the following changes to address them.
-
The point that whitespace should be avoided in keys is taken. I created a
Namedtuple
calledExperimentMetadataField
to hold the data for each row of the experiment metadata. It has the argname
for the name that's appropriate for a JSON key and the argreadable_name
that holds the human-readable key. -
I also agree that numbers should be printed as numbers, not strings in the JSON output. In this revision, this is done by using the aforementioned new
ExperimentMetadataField
namedtuple, in addition to the two refactored classesJsonFormatter
andReadableFormatter
. Both have theformat_experiment()
method to convert a list ofExpeimentMetadataFields
into a (potentially multi-line) string. I also created a common ancestor abstract class for the two classes for clarity. -
Addressed by the above-described approach.
-
The refactored and revised code now uses streaming consistently across the readable and json formats.
Also note:
- the new helper classes
ReadableFormatter
andJsonFormatter
are refactored to a new Python moduletensorboard/uploader/foramtters.py
- A unit test is written for them in the newly-added unit test module
tensorboard/uploader/formatters_test.py
.
print("\t%s %s" % (name.ljust(12), value)) | ||
if self.json: | ||
experiments_json.append( | ||
collections.OrderedDict([("URL", url)] + data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we just pass sort_keys=True
instead of using an OrderedDict
?
The determinism is nice, but I don’t know that we really need to enforce
the same order as with the human-readable output; sorting keys is the
standard “stable stringify”.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer using an OrderedDict
and making the key order consistent between the readable and json formats.
I find printed outputs such the following slightly cognitive overloading because it takes some efforts
- to see what the identifying fields are (
url
andid
in this case), - to mentally associate the logically related by actually separated fields like
name
anddescription
. - to mentally sort the logically sorted by actually shuffled fields like
runs
,tags
andscalars
{
"created": "2020-03-26 10:23:53",
"description": "",
"id": "dpi2D3lPTbe84YPSLw0giw",
"name": "",
"runs": "2",
"scalars": "6",
"tags": "1",
"updated": "2020-03-26 10:23:53",
"url": "https://tensorboard.dev/experiment/dpi2D3lPTbe84YPSLw0giw/"
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you both for the insightful reviews!
"""Constructor of _ListIntent. | ||
|
||
Args: | ||
json: If and ony if `True`, will print the list as a pretty-formatted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
if self.json: | ||
print(json.dumps(experiments_json, indent=2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed comments and suggestions. I made the following changes to address them.
-
The point that whitespace should be avoided in keys is taken. I created a
Namedtuple
calledExperimentMetadataField
to hold the data for each row of the experiment metadata. It has the argname
for the name that's appropriate for a JSON key and the argreadable_name
that holds the human-readable key. -
I also agree that numbers should be printed as numbers, not strings in the JSON output. In this revision, this is done by using the aforementioned new
ExperimentMetadataField
namedtuple, in addition to the two refactored classesJsonFormatter
andReadableFormatter
. Both have theformat_experiment()
method to convert a list ofExpeimentMetadataFields
into a (potentially multi-line) string. I also created a common ancestor abstract class for the two classes for clarity. -
Addressed by the above-described approach.
-
The refactored and revised code now uses streaming consistently across the readable and json formats.
Also note:
- the new helper classes
ReadableFormatter
andJsonFormatter
are refactored to a new Python moduletensorboard/uploader/foramtters.py
- A unit test is written for them in the newly-added unit test module
tensorboard/uploader/formatters_test.py
.
print("\t%s %s" % (name.ljust(12), value)) | ||
if self.json: | ||
experiments_json.append( | ||
collections.OrderedDict([("URL", url)] + data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer using an OrderedDict
and making the key order consistent between the readable and json formats.
I find printed outputs such the following slightly cognitive overloading because it takes some efforts
- to see what the identifying fields are (
url
andid
in this case), - to mentally associate the logically related by actually separated fields like
name
anddescription
. - to mentally sort the logically sorted by actually shuffled fields like
runs
,tags
andscalars
{
"created": "2020-03-26 10:23:53",
"description": "",
"id": "dpi2D3lPTbe84YPSLw0giw",
"name": "",
"runs": "2",
"scalars": "6",
"tags": "1",
"updated": "2020-03-26 10:23:53",
"url": "https://tensorboard.dev/experiment/dpi2D3lPTbe84YPSLw0giw/"
}
@@ -177,6 +177,11 @@ def define_flags(parser): | |||
"list", help="list previously uploaded experiments" | |||
) | |||
list_parser.set_defaults(**{SUBCOMMAND_FLAG: SUBCOMMAND_KEY_LIST}) | |||
list_parser.add_argument( | |||
"--json", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM mod one remaining comment. Thanks!
tensorboard/uploader/formatters.py
Outdated
EXPERIMENT_METADATA_URL_JSON_KEY = "url" | ||
ExperimentMetadataField = collections.namedtuple( | ||
"ExperimentMetadataField", | ||
("json_key", "readable_name", "value", "formatter"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was confused by the name collision of "formatter". Can this one be "field_formatter" or "to_readable" or similar?
Also, it still feels a bit intertwined to pass a lambda that is specifically used only in the ReadableFormatter case. I think you can remove the field entirely and automate this. In ReadableFormatter.format_experiment
, you could:
if not metadata_field.value and not typeof(metadata_field.value) == 'int':
readable_value = "[No %s]".format(metadata_field.readable_name)
else:
readable_value = str(metadata_field.value)
The typecheck there is a little iffy (because, as written, None
will trigger the substitution even for fields that should be int, and a future float-valued field containing 0.0
would trigger too). I guess ExperimentMetadataField
could carry a Type enum, but that seems overwrought. I think it'd be fine to gloss over all that for now (maybe with a code comment about the limitations).
tensorboard/uploader/formatters.py
Outdated
Returns: | ||
A string that represents the `experiment_metadata`. | ||
""" | ||
raise NotImplementedError() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Python it’s generally preferable to pass
in abstract base class
methods, because otherwise implementations cannot call super().()
,
which precludes cooperative multiclassing.
If you want to ensure that all concrete subclasses actually implement
this, consider making this an actual abc.ABCMeta
rather than just
calling it an abstract base class in the docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
lines = output.split("\n") | ||
self.assertLen(lines, 8) | ||
self.assertEqual(lines[0], "{") | ||
self.assertEqual( | ||
lines[1], ' "url": "http://tensorboard.dev/deadbeef",' | ||
) | ||
self.assertEqual(lines[2], ' "name": "",') | ||
self.assertEqual(lines[3], ' "description": "",') | ||
self.assertEqual(lines[4], ' "runs": 8,') | ||
self.assertEqual(lines[5], ' "tags": 12,') | ||
self.assertEqual(lines[6], ' "binary_object_bytes": 2000') | ||
self.assertEqual(lines[7], "}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just
expected_lines = [
"{",
' "url": "http://tensorboard.dev/deadbeef",',
# ...
"}"
]
self.assertEqual(lines, expected_lines)
rather than asserting on each line individually? It’s easier to read,
wraps better, and gives better failure messages.
Alternative, you could just json.loads
the thing and verify that the
object is correct rather than worrying about the details of whitespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
tensorboard/uploader/formatters.py
Outdated
ExperimentMetadataField = collections.namedtuple( | ||
"ExperimentMetadataField", | ||
("json_key", "readable_name", "value", "formatter"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a formatter interface sounds reasonable to me, but this doesn’t
really feel like the right structure. The ExperimentMetadataField
struct is tightly coupled to the two implementations of formatters. The
readable_name
nad formatter
fields are only relevant to the
ReadableFormatter
, and the json_key
field is only relevant to the
JsonFormatter
. And as written, although the name Experiment*
is
everywhere, nothing actually knows about experiments.
Why introduce a new type with a big list of vague “fields” when we
already have the experiment proto? The interface could just be
class ExperimentFormatter(metaclass=abc.ABCMeta):
def format(self, experiment):
"""Format an experiment.
Args:
experiment: An `experiment_pb2.Experiment` value.
Returns:
A string.
"""
pass
The ReadableExperimentFormatter
would encapsulate the readable field
names and how to format them, and the JsonExperiment
would encapsulate
the JSON keys and how to order and indent them. The tests would be on
the same output that we would actually see from the binary rather than
just an arbitrary set of fields defined in the test case.
Then, you could remove the giant block of field definitions in
uploader_main.py
, too: just print(formatter.format(experiment))
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done effecting the refactoring that you suggested. This revision has pushed a lot of the variables related to the formatting into formatters.py
, which is indeed nicer than before.
Also note that I changed the signature of the format_experiment()
method slightly so that it takes two input arguments: experiment
(a experiment_pb2.Experiment
proto) and experiment_url
(a string). Getting the latter from an Experiment
proto requires the server_info
module, which I think shouldn't be known to the formatter module, which is purely concerned with string formatting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good once tensorboard dev list --json
is fixed.
Getting the [experiment URL] from an
Experiment
proto requires the
server_info
module, which I think shouldn't be known to the
formatter module, which is purely concerned with string formatting.
Yep, agreed; sounds good to me.
@@ -348,23 +358,16 @@ def execute(self, server_info, channel): | |||
) | |||
gen = exporter_lib.list_experiments(api_client, fieldmask=fieldmask) | |||
count = 0 | |||
|
|||
if self.json: | |||
formatter = formatters.JsonFormatterI() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presumably JsonFormatterI
should be JsonFormatter
?
This breaks when run:
formatter = formatters.JsonFormatterI()
AttributeError: module 'tensorboard.uploader.formatters' has no attribute 'JsonFormatterI'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm. I don't know how this got in. I'll be more careful with my keyboard setup. Thanks for catching this.
("scalars", experiment.num_scalars), | ||
] | ||
return json.dumps( | ||
collections.OrderedDict(data), indent=self._JSON_INDENT, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional, but: if you’re concerned about ease of parsing the JSON output
without jq
, you could consider using indent=None
(the default), so
that the output is one JSON object per line. This makes it easy to parse
with a parser that can only take a complete JSON string, like Python’s
json.loads
or JS’s JSON.parse
, because the user can easily identify
the framing boundaries (just split by newline). Otherwise, they have to
identify the actual object boundaries themselves, which is slightly less
trivial.
Up to you; just mentioning because you noted this concern in the
original PR description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a test with tensorboard dev list --json | jq -s
under this PR. jq
seems to be able to handle the indent=2
just fine. So I'll leave it as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, yes, jq
handles it fine, hence “[…] ease of parsing the JSON
output without jq
.” This is fine with me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. I think the slight difficulty of parsing with other tools should be acceptable.
tensorboard/uploader/formatters.py
Outdated
import collections | ||
import json | ||
|
||
import six |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to support Python 2 in new srcs_version = "PY3"
modules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. There are a few other modules in tensorboard/uploader that are still using six.add_metaclass()
. Those can be corrected later.
tensorboard/uploader/formatters.py
Outdated
("created", util.format_time(experiment.create_time)), | ||
("updated", util.format_time(experiment.update_time)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It probably makes more sense to use format_time_absolute
for the JSON
formatter? Since the point is to have a simple, parseable format,
outputs like "just now"
probably aren’t what’s wanted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Unit test is updated accordingly.
("scalars", experiment.num_scalars), | ||
] | ||
return json.dumps( | ||
collections.OrderedDict(data), indent=self._JSON_INDENT, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, yes, jq
handles it fine, hence “[…] ease of parsing the JSON
output without jq
.” This is fine with me.
tensorboard/uploader/formatters.py
Outdated
@@ -0,0 +1,99 @@ | |||
# Copyright 2019 The TensorFlow Authors. All Rights Reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: 2019 → 2020 (and test)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done here and in formatters_test.py
.
* Motivation for features / changes * Fulfill feature request bug b/153232102 * Technical description of changes * Add the `--json` flag to the `list` subcommand of `tensorboard dev`. * If the flag is used, the experiments will be printed as a JSON object mapping experiment URLs to experiment data (name, description, runs, tags, etc.) * Screenshots of UI changes * ![image](https://user-images.githubusercontent.com/16824702/78626883-0f77f480-785e-11ea-88ca-b8d653d302c6.png) * Detailed steps to verify changes work correctly (as executed by you) * Manually ran `tensorboard dev list --json` (see screenshot above) * Alternate designs / implementations considered * Output a single big json array at the end: * Pro: may be easier to parse programmatically * Con: no streaming
* Motivation for features / changes * Fulfill feature request bug b/153232102 * Technical description of changes * Add the `--json` flag to the `list` subcommand of `tensorboard dev`. * If the flag is used, the experiments will be printed as a JSON object mapping experiment URLs to experiment data (name, description, runs, tags, etc.) * Screenshots of UI changes * ![image](https://user-images.githubusercontent.com/16824702/78626883-0f77f480-785e-11ea-88ca-b8d653d302c6.png) * Detailed steps to verify changes work correctly (as executed by you) * Manually ran `tensorboard dev list --json` (see screenshot above) * Alternate designs / implementations considered * Output a single big json array at the end: * Pro: may be easier to parse programmatically * Con: no streaming
--json
flag to thelist
subcommand oftensorboard dev
.tensorboard dev list --json
(see screenshot above)