Shard eval dataset and aggregate eval metrics #10

jysohn23 · 2020-04-01T23:34:31Z

Also, instead of calling eval_loss.item() every time do summation with
tensors on device.

Also, instead of calling `eval_loss.item()` every time do summation with tensors on device.

taylanbil

did you test this e2e?

taylanbil · 2020-04-01T23:38:37Z

examples/run_glue_tpu.py

@@ -256,14 +257,29 @@ def evaluate(args, model, tokenizer, prefix="", disable_logging=False):
            preds = np.squeeze(preds)
        result = compute_metrics(eval_task, preds, out_label_ids)
        results.update(result)
+        results['eval_loss'] = eval_loss.item()
+
+        # Average all metrics from each shard


what are some of the metrics? does it make sense to avg all of them? some metrics may be additive.

f1, accuracy, eval_loss, acc_and_f1 (avg of the two)

I checked they all make sense averaged.

Yep ignore that :D Not additive as discussed. Will update PR.

I'd be surprised if global f1 == np.mean(local f1s). That's probably not true, let's verify on paper what the true formula to get global f1 is.

Updated to sync the pred/label tensors directly instead (these shouldn't be that big for finetuning tasks; single integer 0/1 per example). This way we don't have to have some custom aggregation per metric and don't touch upstream core code.

taylanbil · 2020-04-01T23:39:24Z

examples/run_glue_tpu.py

-                logger.info("  %s = %s", key, str(result[key]))
-                writer.write("%s = %s\n" % (key, str(result[key])))
-                tb_writer.add_scalar(key, result[key])
+        if xm.is_master_ordinal():


is everything being logged here already on cpu?

maybe let's add a comment? It's a subtle point that can be missed by code readers.

examples/run_glue_tpu.py

jysohn23

Yeah, e2e tested.

jysohn23 · 2020-04-01T23:46:02Z

examples/run_glue_tpu.py

@@ -256,14 +257,29 @@ def evaluate(args, model, tokenizer, prefix="", disable_logging=False):
            preds = np.squeeze(preds)
        result = compute_metrics(eval_task, preds, out_label_ids)
        results.update(result)
+        results['eval_loss'] = eval_loss.item()
+
+        # Average all metrics from each shard


f1, accuracy, eval_loss, acc_and_f1 (avg of the two)

I checked they all make sense averaged.

examples/run_glue_tpu.py

jysohn23 · 2020-04-01T23:46:20Z

examples/run_glue_tpu.py

-                logger.info("  %s = %s", key, str(result[key]))
-                writer.write("%s = %s\n" % (key, str(result[key])))
-                tb_writer.add_scalar(key, result[key])
+        if xm.is_master_ordinal():


As brought up during review some metrics like f1 cannot be aggregated via averaging. GLUE task metrics depends largely on the dataset, so instead we sync the prediction and label tensors so that the metrics can be computed accurately on those instead.

* Initial commit to get BERT + run_glue.py on TPU * Add README section for TPU and address comments. * Cleanup TPU bits from run_glue.py (pytorch-tpu#3) TPU runner is currently implemented in: https://github.com/pytorch-tpu/transformers/blob/tpu/examples/run_glue_tpu.py. We plan to upstream this directly into `huggingface/transformers` (either `master` or `tpu`) branch once it's been more thoroughly tested. * Cleanup TPU bits from run_glue.py TPU runner is currently implemented in: https://github.com/pytorch-tpu/transformers/blob/tpu/examples/run_glue_tpu.py. We plan to upstream this directly into `huggingface/transformers` (either `master` or `tpu`) branch once it's been more thoroughly tested. * No need to call `xm.mark_step()` explicitly (pytorch-tpu#4) Since for gradient accumulation we're accumulating on batches from `ParallelLoader` instance which on next() marks the step itself. * Resolve R/W conflicts from multiprocessing (pytorch-tpu#5) * Add XLNet in list of models for `run_glue_tpu.py` (pytorch-tpu#6) * Add RoBERTa to list of models in TPU GLUE (pytorch-tpu#7) * Add RoBERTa and DistilBert to list of models in TPU GLUE (pytorch-tpu#8) * Use barriers to reduce duplicate work/resources (pytorch-tpu#9) * Shard eval dataset and aggregate eval metrics (pytorch-tpu#10) * Shard eval dataset and aggregate eval metrics Also, instead of calling `eval_loss.item()` every time do summation with tensors on device. * Change defaultdict to float * Reduce the pred, label tensors instead of metrics As brought up during review some metrics like f1 cannot be aggregated via averaging. GLUE task metrics depends largely on the dataset, so instead we sync the prediction and label tensors so that the metrics can be computed accurately on those instead. * Only use tb_writer from master (pytorch-tpu#11) * Apply huggingface black code formatting * Style * Remove `--do_lower_case` as example uses cased * Add option to specify tensorboard logdir This is needed for our testing framework which checks regressions against key metrics writtern by the summary writer. * Using configuration for `xla_device` * Prefix TPU specific comments. * num_cores clarification and namespace eval metrics * Cache features file under `args.cache_dir` Instead of under `args.data_dir`. This is needed as our test infra uses data_dir with a read-only filesystem. * Rename `run_glue_tpu` to `run_tpu_glue` Co-authored-by: LysandreJik <lysandre.debut@reseau.eseo.fr>

Shard eval dataset and aggregate eval metrics

0a0d487

Also, instead of calling `eval_loss.item()` every time do summation with tensors on device.

jysohn23 requested a review from taylanbil April 1, 2020 23:34

taylanbil reviewed Apr 1, 2020

View reviewed changes

Change defaultdict to float

9db8f3b

jysohn23 commented Apr 1, 2020

View reviewed changes

jysohn23 requested a review from taylanbil April 1, 2020 23:46

taylanbil approved these changes Apr 2, 2020

View reviewed changes

jysohn23 merged commit 6e20572 into pytorch-tpu:tpu Apr 2, 2020

jysohn23 deleted the tpu branch April 2, 2020 16:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shard eval dataset and aggregate eval metrics #10

Shard eval dataset and aggregate eval metrics #10

jysohn23 commented Apr 1, 2020

taylanbil left a comment

taylanbil Apr 1, 2020

jysohn23 Apr 1, 2020

jysohn23 Apr 1, 2020

taylanbil Apr 1, 2020

jysohn23 Apr 2, 2020

taylanbil Apr 1, 2020

jysohn23 Apr 1, 2020

taylanbil Apr 1, 2020

jysohn23 left a comment

jysohn23 Apr 1, 2020

jysohn23 Apr 1, 2020

Shard eval dataset and aggregate eval metrics #10

Shard eval dataset and aggregate eval metrics #10

Conversation

jysohn23 commented Apr 1, 2020

taylanbil left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jysohn23 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment