From 1955c08f6c6821224f746c3841d365bdf81acc10 Mon Sep 17 00:00:00 2001
From: Zane Durante <zanedurante@gmail.com>
Date: Wed, 3 Jul 2019 13:53:08 -0700
Subject: [PATCH 01/11] added image captioning notebook for review

---
 .../notebooks/Image Captioning TF 2.0.ipynb   | 1100 +++++++++++++++++
 1 file changed, 1100 insertions(+)
 create mode 100644 samples/notebooks/Image Captioning TF 2.0.ipynb
diff --git a/samples/notebooks/Image Captioning TF 2.0.ipynb b/samples/notebooks/Image Captioning TF 2.0.ipynb
new file mode 100644
index 00000000000..cf009482ce9
--- /dev/null
+++ b/samples/notebooks/Image Captioning TF 2.0.ipynb	
@@ -0,0 +1,1100 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<h1> Image Captioning Using Tensorflow 2.0 </h1>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This notebook modifies an example tensorflow 2.0 notebook from\n",
+    "[here](https://colab.sandbox.google.com/github/tensorflow/docs/blob/master/site/en/r2/tutorials/text/image_captioning.ipynb)\n",
+    "to work with kubeflow pipelines"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<h3> Download dataset and upload to GCS </h3>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "First, we have to download the [MS COCO dataset](http://cocodataset.org/#download).  This sample uses both the 2014 train images and 2014 train/val annotations.  If you downloaded and extracted the dataset on your local system, you can upload it to GCS using `gsutil -m cp -r path/to/dataset/ gs://[YOUR-BUCKET-ID]/ms-coco`."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<h3> Setup project info and imports </h3>"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Previously downloaded dataset and put onto GCS\n",
+    "GCS_DATASET_PATH = 'gs://artifacts.intro-to-kubeflow-1.appspot.com/ms-coco'"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Kubeflow project settings\n",
+    "EXPERIMENT_NAME = 'Image Captioning'\n",
+    "PROJECT_NAME = 'intro-to-kubeflow-1' \n",
+    "PIPELINE_STORAGE_PATH = 'gs://artifacts.intro-to-kubeflow-1.appspot.com/ms-coco/components' # path to save pipeline component images\n",
+    "BASE_IMAGE = 'tensorflow/tensorflow:2.0.0b0-py3' # using tensorflow 2.0.0"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import kfp\n",
+    "import kfp.dsl as dsl\n",
+    "from kfp import compiler\n",
+    "from kfp.gcp import use_gcp_secret"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<h4> (Optional) Execute components in notebook </h4>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Uncomment the following lines to install the required libraries if you want to run the components (python functions) within the notebook."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "# !pip install sklearn\n",
+    "# !pip install tensorflow==2.0.0-beta1\n",
+    "# !pip install matplotlib"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# import tensorflow as tf\n",
+    "# print(tf.executing_eagerly()) # Should print True if tf 2.0 is installed "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<h3> Create pipeline components </h3>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<h4> Data preprocessing component </h4>"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "@dsl.python_component(\n",
+    "    name='img_data_preprocessing',\n",
+    "    description='preprocesses images with inceptionV3',\n",
+    "    base_image=BASE_IMAGE\n",
+    ")\n",
+    "def preprocess(dataset_path: str, num_examples: int, OUTPUT_DIR: str, \n",
+    "        batch_size: int) -> str:\n",
+    "    import json\n",
+    "    import numpy as np\n",
+    "    import tensorflow as tf\n",
+    "    from tensorflow.python.lib.io import file_io\n",
+    "    from sklearn.utils import shuffle\n",
+    "    \n",
+    "    if OUTPUT_DIR == 'default':\n",
+    "        OUTPUT_DIR = dataset_path + '/preprocess/'\n",
+    "    \n",
+    "    annotation_file = dataset_path + '/annotations_trainval2014/annotations/captions_train2014.json'\n",
+    "    PATH = dataset_path + '/train2014/train2014/'\n",
+    "    \n",
+    "    # Read the json file (CHANGED FROM open() TO file_io.FileIO)\n",
+    "    with file_io.FileIO(annotation_file, 'r') as f:\n",
+    "        annotations = json.load(f)\n",
+    "\n",
+    "    # Store captions and image names in vectors\n",
+    "    all_captions = []\n",
+    "    all_img_name_vector = []\n",
+    "\n",
+    "    for annot in annotations['annotations']:\n",
+    "        caption = '<start> ' + annot['caption'] + ' <end>'\n",
+    "        image_id = annot['image_id']\n",
+    "        full_coco_image_path = PATH + 'COCO_train2014_' + '%012d.jpg' % (image_id)\n",
+    "\n",
+    "        all_img_name_vector.append(full_coco_image_path)\n",
+    "        all_captions.append(caption)\n",
+    "\n",
+    "    # Shuffle captions and image_names together\n",
+    "    train_captions, img_name_vector = shuffle(all_captions,\n",
+    "                                              all_img_name_vector,\n",
+    "                                              random_state=1)\n",
+    "\n",
+    "    # Select the first num_examples captions/imgs from the shuffled set\n",
+    "    train_captions = train_captions[:num_examples]\n",
+    "    img_name_vector = img_name_vector[:num_examples]\n",
+    "    \n",
+    "    # Preprocess the images before feeding into inceptionV3\n",
+    "    def load_image(image_path):\n",
+    "        img = tf.io.read_file(image_path)\n",
+    "        img = tf.image.decode_jpeg(img, channels=3)\n",
+    "        img = tf.image.resize(img, (299, 299))\n",
+    "        img = tf.keras.applications.inception_v3.preprocess_input(img)\n",
+    "        return img, image_path\n",
+    "    \n",
+    "    image_model = tf.keras.applications.InceptionV3(include_top=False,\n",
+    "                                                weights='imagenet')\n",
+    "    new_input = image_model.input\n",
+    "    hidden_layer = image_model.layers[-1].output\n",
+    "\n",
+    "    image_features_extract_model = tf.keras.Model(new_input, hidden_layer)\n",
+    "    \n",
+    "    # Save extracted features in GCS\n",
+    "    # Get unique images\n",
+    "    encode_train = sorted(set(img_name_vector))\n",
+    "    \n",
+    "    image_dataset = tf.data.Dataset.from_tensor_slices(encode_train)\n",
+    "    image_dataset = image_dataset.map(\n",
+    "        load_image, num_parallel_calls=tf.data.experimental.AUTOTUNE).batch(batch_size)\n",
+    "    \n",
+    "    for img, path in image_dataset:\n",
+    "        batch_features = image_features_extract_model(img)\n",
+    "        batch_features = tf.reshape(batch_features,\n",
+    "                              (batch_features.shape[0], -1, batch_features.shape[3]))\n",
+    "\n",
+    "        for bf, p in zip(batch_features, path):\n",
+    "            path_of_feature = p.numpy().decode(\"utf-8\")\n",
+    "            # Save to a different location and as numpy array\n",
+    "            path_of_feature = path_of_feature.replace('.jpg', '.npy')\n",
+    "            path_of_feature = path_of_feature.replace(PATH, OUTPUT_DIR)\n",
+    "            np.save(file_io.FileIO(path_of_feature, 'w'), bf.numpy())\n",
+    "    \n",
+    "    # Create array for locations of preprocessed images\n",
+    "    preprocessed_imgs = [img.replace('.jpg', '.npy') for img in img_name_vector]\n",
+    "    preprocessed_imgs = [img.replace(PATH, OUTPUT_DIR) for img in preprocessed_imgs]\n",
+    "    \n",
+    "    # Save train_captions and preprocessed_imgs to file\n",
+    "    train_cap_path = OUTPUT_DIR + 'train_captions.npy' # array of captions\n",
+    "    preprocessed_imgs_path = OUTPUT_DIR + 'preprocessed_imgs.py'# array of paths to preprocessed images\n",
+    "    \n",
+    "    train_captions = np.array(train_captions)\n",
+    "    np.save(file_io.FileIO(train_cap_path, 'w'), train_captions)\n",
+    "    \n",
+    "    preprocessed_imgs = np.array(preprocessed_imgs)\n",
+    "    np.save(file_io.FileIO(preprocessed_imgs_path, 'w'), preprocessed_imgs)\n",
+    "    \n",
+    "    return (train_cap_path, preprocessed_imgs_path)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "TARGET_IMAGE = 'gcr.io/%s/preprocessing:latest' % PROJECT_NAME\n",
+    "preprocessing_img_op = compiler.build_python_component(\n",
+    "    component_func=preprocess,\n",
+    "    staging_gcs_path=PIPELINE_STORAGE_PATH,\n",
+    "    base_image=BASE_IMAGE,\n",
+    "    dependency=[kfp.compiler.VersionedDependency(name='scikit-learn', version='0.21.2')],\n",
+    "    target_image=TARGET_IMAGE)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Uncomment the following lines if you want to test the python function from the notebook\n",
+    "# num_examples = 100\n",
+    "# preprocess_output_dir = 'default' # Can change if you want a specific directory\n",
+    "# batch_size = 16\n",
+    "# preprocess_output = preprocess(GCS_DATASET_PATH, num_examples, preprocess_output_dir, batch_size)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<h4> Tokenizing component </h4>"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "@dsl.python_component(\n",
+    "    name='tokenize_captions',\n",
+    "    description='Tokenize captions to create training data',\n",
+    "    base_image=BASE_IMAGE\n",
+    ")\n",
+    "def tokenize_captions(dataset_path: str, preprocess_output: str, OUTPUT_DIR: str,\n",
+    "        top_k: int) -> str:\n",
+    "    import pickle\n",
+    "    import tensorflow as tf\n",
+    "    import numpy as np\n",
+    "    from tensorflow.python.lib.io import file_io\n",
+    "    from io import BytesIO\n",
+    "    from ast import literal_eval as make_tuple\n",
+    "    \n",
+    "    \n",
+    "    # Convert output from string to tuple and unpack\n",
+    "    preprocess_output = make_tuple(preprocess_output)\n",
+    "    train_caption_path = preprocess_output[0]\n",
+    "    \n",
+    "    if OUTPUT_DIR == 'default':\n",
+    "        OUTPUT_DIR = dataset_path + '/tokenize/'\n",
+    "    \n",
+    "    tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=top_k,\n",
+    "                                                  oov_token=\"<unk>\",\n",
+    "                                                  filters='!\"#$%&()*+.,-/:;=?@[\\]^_`{|}~ ')\n",
+    "    f = BytesIO(file_io.read_file_to_string(train_caption_path, \n",
+    "                                            binary_mode=True))\n",
+    "    train_captions = np.load(f)\n",
+    "    \n",
+    "    # Tokenize captions\n",
+    "    tokenizer.fit_on_texts(train_captions)\n",
+    "    train_seqs = tokenizer.texts_to_sequences(train_captions)\n",
+    "    tokenizer.word_index['<pad>'] = 0\n",
+    "    tokenizer.index_word[0] = '<pad>'\n",
+    "    \n",
+    "    cap_vector = tf.keras.preprocessing.sequence.pad_sequences(train_seqs, padding='post')\n",
+    "    \n",
+    "    # Find the maximum length of any caption in our dataset\n",
+    "    def calc_max_length(tensor):\n",
+    "        return max(len(t) for t in tensor)\n",
+    "    \n",
+    "    max_length = calc_max_length(train_seqs)\n",
+    "    \n",
+    "    # Save tokenizer\n",
+    "    tokenizer_file_path = OUTPUT_DIR + 'tokenizer.pickle'\n",
+    "    with file_io.FileIO(tokenizer_file_path, 'wb') as output:\n",
+    "        pickle.dump(tokenizer, output, protocol=pickle.HIGHEST_PROTOCOL)\n",
+    "        \n",
+    "    # Save train_seqs\n",
+    "    cap_vector_file_path = OUTPUT_DIR + 'cap_vector.npy'\n",
+    "    np.save(file_io.FileIO(cap_vector_file_path, 'w'), cap_vector)\n",
+    "    \n",
+    "    \n",
+    "    return str(max_length), tokenizer_file_path, cap_vector_file_path"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "TARGET_IMAGE = 'gcr.io/%s/tokenizer:latest' % PROJECT_NAME\n",
+    "tokenize_captions_op = compiler.build_python_component(\n",
+    "    component_func=tokenize_captions,\n",
+    "    staging_gcs_path=PIPELINE_STORAGE_PATH,\n",
+    "    base_image=BASE_IMAGE,\n",
+    "    target_image=TARGET_IMAGE)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Uncomment the following lines if you want to test the python function from the notebook\n",
+    "# num_examples = 100\n",
+    "# tokenizing_output_dir = 'default' # Can change if you want a specific directory\n",
+    "# vocab_size = 1000\n",
+    "# tokenizing_output = tokenize_captions(GCS_DATASET_PATH, str(preprocess_output), tokenizing_output_dir, vocab_size)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<h4> Component for training model (and saving it)</h4>"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "@dsl.python_component(\n",
+    "    name='model_training',\n",
+    "    description='Trains image captioning model',\n",
+    "    base_image=BASE_IMAGE\n",
+    ")\n",
+    "def train_model(dataset_path: str, preprocess_output: str, \n",
+    "        tokenizing_output: str, train_output_dir: str, valid_output_dir: str, \n",
+    "        batch_size: int, embedding_dim: int, units: int, EPOCHS: int)-> str:\n",
+    "    import time\n",
+    "    import pickle\n",
+    "    import numpy as np\n",
+    "    import tensorflow as tf\n",
+    "    from io import BytesIO\n",
+    "    from sklearn.model_selection import train_test_split\n",
+    "    from tensorflow.python.lib.io import file_io\n",
+    "    from ast import literal_eval as make_tuple\n",
+    "    \n",
+    "    # Convert output from string to tuple and unpack\n",
+    "    preprocess_output = make_tuple(preprocess_output)\n",
+    "    tokenizing_output = make_tuple(tokenizing_output)\n",
+    "    \n",
+    "    # Unpack tuples\n",
+    "    preprocessed_imgs_path = preprocess_output[1]\n",
+    "    \n",
+    "    tokenizer_path = tokenizing_output[1]\n",
+    "    cap_vector_file_path = tokenizing_output[2]\n",
+    "    \n",
+    "    if valid_output_dir == 'default':\n",
+    "        valid_output_dir = dataset_path + '/valid/'\n",
+    "    \n",
+    "    if train_output_dir == 'default':\n",
+    "        train_output_dir = dataset_path + '/train/checkpoints/'\n",
+    "    \n",
+    "    # load img_name_vector\n",
+    "    f = BytesIO(file_io.read_file_to_string(preprocessed_imgs_path, binary_mode=True))\n",
+    "    img_name_vector = np.load(f)\n",
+    "    \n",
+    "    # Load cap_vector\n",
+    "    f = BytesIO(file_io.read_file_to_string(cap_vector_file_path, binary_mode=True))\n",
+    "    cap_vector = np.load(f)\n",
+    "    \n",
+    "    # Load tokenizer\n",
+    "    with file_io.FileIO(tokenizer_path, 'rb') as src:\n",
+    "        tokenizer = pickle.load(src)\n",
+    "    \n",
+    "    # Split data into training and testing\n",
+    "    img_name_train, img_name_val, cap_train, cap_val = train_test_split(\n",
+    "                                                            img_name_vector,\n",
+    "                                                            cap_vector,\n",
+    "                                                            test_size=0.2,\n",
+    "                                                            random_state=0)\n",
+    "    \n",
+    "    # Create tf.data dataset for training\n",
+    "    BUFFER_SIZE = 1000 # Common size used for shuffling dataset\n",
+    "    vocab_size = len(tokenizer.word_index) + 1\n",
+    "    num_steps = len(img_name_train) // batch_size\n",
+    "    \n",
+    "    # Shape of the vector extracted from InceptionV3 is (64, 2048)\n",
+    "    features_shape = 2048\n",
+    "    \n",
+    "    # Load the numpy files\n",
+    "    def map_func(img_name, cap):\n",
+    "        f = BytesIO(file_io.read_file_to_string(img_name.decode('utf-8'), binary_mode=True))\n",
+    "        img_tensor = np.load(f)\n",
+    "        return img_tensor, cap\n",
+    "    \n",
+    "    dataset = tf.data.Dataset.from_tensor_slices((img_name_train, cap_train))\n",
+    "\n",
+    "    # Use map to load the numpy files in parallel\n",
+    "    dataset = dataset.map(lambda item1, item2: tf.numpy_function(\n",
+    "              map_func, [item1, item2], [tf.float32, tf.int32]),\n",
+    "              num_parallel_calls=tf.data.experimental.AUTOTUNE)\n",
+    "\n",
+    "    # Shuffle and batch\n",
+    "    dataset = dataset.shuffle(BUFFER_SIZE).batch(batch_size)\n",
+    "    dataset = dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)\n",
+    "    \n",
+    "    # Create models\n",
+    "    # Attention model\n",
+    "    class BahdanauAttention(tf.keras.Model):\n",
+    "        def __init__(self, units):\n",
+    "            super(BahdanauAttention, self).__init__()\n",
+    "            self.W1 = tf.keras.layers.Dense(units)\n",
+    "            self.W2 = tf.keras.layers.Dense(units)\n",
+    "            self.V = tf.keras.layers.Dense(1)\n",
+    "    \n",
+    "        def call(self, features, hidden):\n",
+    "            # features(CNN_encoder output) shape == (batch_size, 64, embedding_dim)\n",
+    "\n",
+    "            # hidden shape == (batch_size, hidden_size)\n",
+    "            # hidden_with_time_axis shape == (batch_size, 1, hidden_size)\n",
+    "            hidden_with_time_axis = tf.expand_dims(hidden, 1)\n",
+    "\n",
+    "            # score shape == (batch_size, 64, hidden_size)\n",
+    "            score = tf.nn.tanh(self.W1(features) + self.W2(hidden_with_time_axis))\n",
+    "\n",
+    "            # attention_weights shape == (batch_size, 64, 1)\n",
+    "            # you get 1 at the last axis because you are applying score to self.V\n",
+    "            attention_weights = tf.nn.softmax(self.V(score), axis=1)\n",
+    "\n",
+    "            # context_vector shape after sum == (batch_size, hidden_size)\n",
+    "            context_vector = attention_weights * features\n",
+    "            context_vector = tf.reduce_sum(context_vector, axis=1)\n",
+    "\n",
+    "            return context_vector, attention_weights\n",
+    "    \n",
+    "    # CNN Encoder model\n",
+    "    class CNN_Encoder(tf.keras.Model):\n",
+    "        # Since you have already extracted the features and dumped it using pickle\n",
+    "        # This encoder passes those features through a Fully connected layer\n",
+    "        def __init__(self, embedding_dim):\n",
+    "            super(CNN_Encoder, self).__init__()\n",
+    "            # shape after fc == (batch_size, 64, embedding_dim)\n",
+    "            self.fc = tf.keras.layers.Dense(embedding_dim)\n",
+    "\n",
+    "        def call(self, x):\n",
+    "            x = self.fc(x)\n",
+    "            x = tf.nn.relu(x)\n",
+    "            return x\n",
+    "    \n",
+    "    # RNN Decoder model\n",
+    "    class RNN_Decoder(tf.keras.Model):\n",
+    "        def __init__(self, embedding_dim, units, vocab_size):\n",
+    "            super(RNN_Decoder, self).__init__()\n",
+    "            self.units = units\n",
+    "\n",
+    "            self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)\n",
+    "            self.gru = tf.keras.layers.GRU(self.units,\n",
+    "                                           return_sequences=True,\n",
+    "                                           return_state=True,\n",
+    "                                           recurrent_initializer='glorot_uniform')\n",
+    "            self.fc1 = tf.keras.layers.Dense(self.units)\n",
+    "            self.fc2 = tf.keras.layers.Dense(vocab_size)\n",
+    "\n",
+    "            self.attention = BahdanauAttention(self.units)\n",
+    "\n",
+    "        def call(self, x, features, hidden):\n",
+    "            # defining attention as a separate model\n",
+    "            context_vector, attention_weights = self.attention(features, hidden)\n",
+    "\n",
+    "            # x shape after passing through embedding == (batch_size, 1, embedding_dim)\n",
+    "            x = self.embedding(x)\n",
+    "\n",
+    "            # x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)\n",
+    "            x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)\n",
+    "\n",
+    "            # passing the concatenated vector to the GRU\n",
+    "            output, state = self.gru(x)\n",
+    "\n",
+    "            # shape == (batch_size, max_length, hidden_size)\n",
+    "            x = self.fc1(output)\n",
+    "\n",
+    "            # x shape == (batch_size * max_length, hidden_size)\n",
+    "            x = tf.reshape(x, (-1, x.shape[2]))\n",
+    "\n",
+    "            # output shape == (batch_size * max_length, vocab)\n",
+    "            x = self.fc2(x)\n",
+    "\n",
+    "            return x, state, attention_weights\n",
+    "\n",
+    "        def reset_state(self, batch_size):\n",
+    "            return tf.zeros((batch_size, self.units))\n",
+    "        \n",
+    "    encoder = CNN_Encoder(embedding_dim)\n",
+    "    decoder = RNN_Decoder(embedding_dim, units, vocab_size)\n",
+    "    \n",
+    "    optimizer = tf.keras.optimizers.Adam()\n",
+    "    loss_object = tf.keras.losses.SparseCategoricalCrossentropy(\n",
+    "        from_logits=True, reduction='none')\n",
+    "    \n",
+    "    # Create loss function\n",
+    "    def loss_function(real, pred):\n",
+    "        mask = tf.math.logical_not(tf.math.equal(real, 0))\n",
+    "        loss_ = loss_object(real, pred)\n",
+    "\n",
+    "        mask = tf.cast(mask, dtype=loss_.dtype)\n",
+    "        loss_ *= mask\n",
+    "\n",
+    "        return tf.reduce_mean(loss_)\n",
+    "    \n",
+    "    # Create check point for training model\n",
+    "    ckpt = tf.train.Checkpoint(encoder=encoder,\n",
+    "                           decoder=decoder,\n",
+    "                           optimizer = optimizer)\n",
+    "    ckpt_manager = tf.train.CheckpointManager(ckpt, train_output_dir, max_to_keep=5)\n",
+    "    start_epoch = 0\n",
+    "    if ckpt_manager.latest_checkpoint:\n",
+    "        start_epoch = int(ckpt_manager.latest_checkpoint.split('-')[-1])\n",
+    "            \n",
+    "    # Create training step\n",
+    "    loss_plot = []\n",
+    "    @tf.function\n",
+    "    def train_step(img_tensor, target):\n",
+    "        loss = 0\n",
+    "\n",
+    "        # initializing the hidden state for each batch\n",
+    "        # because the captions are not related from image to image\n",
+    "        hidden = decoder.reset_state(batch_size=target.shape[0])\n",
+    "\n",
+    "        dec_input = tf.expand_dims([tokenizer.word_index['<start>']] * batch_size, 1)\n",
+    "\n",
+    "        with tf.GradientTape() as tape:\n",
+    "            features = encoder(img_tensor)\n",
+    "\n",
+    "            for i in range(1, target.shape[1]):\n",
+    "                # passing the features through the decoder\n",
+    "                predictions, hidden, _ = decoder(dec_input, features, hidden)\n",
+    "\n",
+    "                loss += loss_function(target[:, i], predictions)\n",
+    "\n",
+    "                # using teacher forcing\n",
+    "                dec_input = tf.expand_dims(target[:, i], 1)\n",
+    "\n",
+    "        total_loss = (loss / int(target.shape[1]))\n",
+    "\n",
+    "        trainable_variables = encoder.trainable_variables + decoder.trainable_variables\n",
+    "\n",
+    "        gradients = tape.gradient(loss, trainable_variables)\n",
+    "\n",
+    "        optimizer.apply_gradients(zip(gradients, trainable_variables))\n",
+    "\n",
+    "        return loss, total_loss\n",
+    "    \n",
+    "    # Create summary writers for plotting loss in tensorboard\n",
+    "    train_summary_writer = tf.summary.create_file_writer('./logs/train/')\n",
+    "    \n",
+    "    # Train model\n",
+    "    path_to_most_recent_ckpt = None\n",
+    "    for epoch in range(start_epoch, EPOCHS):\n",
+    "        start = time.time()\n",
+    "        total_loss = 0\n",
+    "\n",
+    "        for (batch, (img_tensor, target)) in enumerate(dataset):\n",
+    "            batch_loss, t_loss = train_step(img_tensor, target)\n",
+    "            total_loss += t_loss\n",
+    "\n",
+    "            if batch % 100 == 0:\n",
+    "                print ('Epoch {} Batch {} Loss {:.4f}'.format(\n",
+    "                  epoch + 1, batch, batch_loss.numpy() / int(target.shape[1])))\n",
+    "        \n",
+    "        # storing the epoch end loss value to plot in tensorboard\n",
+    "        with train_summary_writer.as_default():\n",
+    "            tf.summary.scalar('loss', total_loss / num_steps, step=epoch)\n",
+    "\n",
+    "        if epoch % 5 == 0:\n",
+    "            path_to_most_recent_ckpt = ckpt_manager.save()\n",
+    "\n",
+    "        print ('Epoch {} Loss {:.6f}'.format(epoch + 1,\n",
+    "                                             total_loss/num_steps))\n",
+    "        print ('Time taken for 1 epoch {} sec\\n'.format(time.time() - start))\n",
+    "    \n",
+    "    # Could add plot of loss (loss_plot)?\n",
+    "    # Adding tensorboard plot of loss function here (available in UI)\n",
+    "    \n",
+    "    # Save validation data to use for predictions\n",
+    "    val_cap_path = valid_output_dir + 'captions.npy'\n",
+    "    np.save(file_io.FileIO(val_cap_path, 'w'), cap_val)\n",
+    "    \n",
+    "    val_img_path = valid_output_dir + 'images.npy'\n",
+    "    np.save(file_io.FileIO(val_img_path, 'w'), img_name_val)\n",
+    "    \n",
+    "    return path_to_most_recent_ckpt, val_cap_path, val_img_path"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "TARGET_IMAGE = 'gcr.io/%s/trainer:latest' % PROJECT_NAME\n",
+    "model_train_op = compiler.build_python_component(\n",
+    "    component_func=train_model,\n",
+    "    staging_gcs_path=PIPELINE_STORAGE_PATH,\n",
+    "    base_image=BASE_IMAGE,\n",
+    "    dependency=[kfp.compiler.VersionedDependency(name='scikit-learn', version='0.21.2')],\n",
+    "    target_image=TARGET_IMAGE)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "# Uncomment the following lines if you want to test the python function from the notebook\n",
+    "# num_examples = 100\n",
+    "# train_output_dir = 'default' # Can change if you want a specific directory\n",
+    "# valid_output_dir = 'default' # Can change if you want a specific directory\n",
+    "# batch_size = 16\n",
+    "# embedding_dim = 256\n",
+    "# hidden_state_size = 512\n",
+    "# epochs = 20\n",
+    "\n",
+    "# train_output = train_model(GCS_DATASET_PATH, str(preprocess_output), str(tokenizing_output), \n",
+    "#                           train_output_dir, valid_output_dir, batch_size, embedding_dim,\n",
+    "#                           hidden_state_size, epochs)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<h4> Component for model prediction </h4>"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "@dsl.python_component(\n",
+    "    name='model_predictions',\n",
+    "    description='Predicts on images in validation set',\n",
+    "    base_image=BASE_IMAGE\n",
+    ")\n",
+    "def predict(dataset_path: str, tokenizing_output: str, model_train_output: str, \n",
+    "        preprocess_output_dir: str, embedding_dim: int, units: int):\n",
+    "    import pickle\n",
+    "    import matplotlib.pyplot as plt\n",
+    "    import numpy as np\n",
+    "    import tensorflow as tf\n",
+    "    from math import ceil \n",
+    "    from io import BytesIO\n",
+    "    from tensorflow.python.lib.io import file_io\n",
+    "    from ast import literal_eval as make_tuple\n",
+    "    \n",
+    "    tokenizing_output = make_tuple(tokenizing_output)\n",
+    "    model_train_output = make_tuple(model_train_output)\n",
+    "    \n",
+    "    # Unpack tuples\n",
+    "    max_length = int(tokenizing_output[0])\n",
+    "    tokenizer_path = tokenizing_output[1]\n",
+    "    model_path = model_train_output[0]\n",
+    "    val_cap_path = model_train_output[1]\n",
+    "    val_img_path = model_train_output[2]\n",
+    "    \n",
+    "    if preprocess_output_dir == 'default':\n",
+    "        preprocess_output_dir = dataset_path + '/preprocess/'\n",
+    "    \n",
+    "    # Load tokenizer, model, test_captions, and test_imgs\n",
+    "    \"\"\" CHANGE: don't reuse code here: not sure how though..? \"\"\"\n",
+    "    # Attention model\n",
+    "    class BahdanauAttention(tf.keras.Model):\n",
+    "        def __init__(self, units):\n",
+    "            super(BahdanauAttention, self).__init__()\n",
+    "            self.W1 = tf.keras.layers.Dense(units)\n",
+    "            self.W2 = tf.keras.layers.Dense(units)\n",
+    "            self.V = tf.keras.layers.Dense(1)\n",
+    "    \n",
+    "        def call(self, features, hidden):\n",
+    "            # features(CNN_encoder output) shape == (batch_size, 64, embedding_dim)\n",
+    "\n",
+    "            # hidden shape == (batch_size, hidden_size)\n",
+    "            # hidden_with_time_axis shape == (batch_size, 1, hidden_size)\n",
+    "            hidden_with_time_axis = tf.expand_dims(hidden, 1)\n",
+    "\n",
+    "            # score shape == (batch_size, 64, hidden_size)\n",
+    "            score = tf.nn.tanh(self.W1(features) + self.W2(hidden_with_time_axis))\n",
+    "\n",
+    "            # attention_weights shape == (batch_size, 64, 1)\n",
+    "            # you get 1 at the last axis because you are applying score to self.V\n",
+    "            attention_weights = tf.nn.softmax(self.V(score), axis=1)\n",
+    "\n",
+    "            # context_vector shape after sum == (batch_size, hidden_size)\n",
+    "            context_vector = attention_weights * features\n",
+    "            context_vector = tf.reduce_sum(context_vector, axis=1)\n",
+    "\n",
+    "            return context_vector, attention_weights\n",
+    "    \n",
+    "    # CNN Encoder model\n",
+    "    class CNN_Encoder(tf.keras.Model):\n",
+    "        # Since you have already extracted the features and dumped it using pickle\n",
+    "        # This encoder passes those features through a Fully connected layer\n",
+    "        def __init__(self, embedding_dim):\n",
+    "            super(CNN_Encoder, self).__init__()\n",
+    "            # shape after fc == (batch_size, 64, embedding_dim)\n",
+    "            self.fc = tf.keras.layers.Dense(embedding_dim)\n",
+    "\n",
+    "        def call(self, x):\n",
+    "            x = self.fc(x)\n",
+    "            x = tf.nn.relu(x)\n",
+    "            return x\n",
+    "    \n",
+    "    # RNN Decoder model\n",
+    "    class RNN_Decoder(tf.keras.Model):\n",
+    "        def __init__(self, embedding_dim, units, vocab_size):\n",
+    "            super(RNN_Decoder, self).__init__()\n",
+    "            self.units = units\n",
+    "\n",
+    "            self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)\n",
+    "            self.gru = tf.keras.layers.GRU(self.units,\n",
+    "                                           return_sequences=True,\n",
+    "                                           return_state=True,\n",
+    "                                           recurrent_initializer='glorot_uniform')\n",
+    "            self.fc1 = tf.keras.layers.Dense(self.units)\n",
+    "            self.fc2 = tf.keras.layers.Dense(vocab_size)\n",
+    "\n",
+    "            self.attention = BahdanauAttention(self.units)\n",
+    "\n",
+    "        def call(self, x, features, hidden):\n",
+    "            # defining attention as a separate model\n",
+    "            context_vector, attention_weights = self.attention(features, hidden)\n",
+    "\n",
+    "            # x shape after passing through embedding == (batch_size, 1, embedding_dim)\n",
+    "            x = self.embedding(x)\n",
+    "\n",
+    "            # x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)\n",
+    "            x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)\n",
+    "\n",
+    "            # passing the concatenated vector to the GRU\n",
+    "            output, state = self.gru(x)\n",
+    "\n",
+    "            # shape == (batch_size, max_length, hidden_size)\n",
+    "            x = self.fc1(output)\n",
+    "\n",
+    "            # x shape == (batch_size * max_length, hidden_size)\n",
+    "            x = tf.reshape(x, (-1, x.shape[2]))\n",
+    "\n",
+    "            # output shape == (batch_size * max_length, vocab)\n",
+    "            x = self.fc2(x)\n",
+    "\n",
+    "            return x, state, attention_weights\n",
+    "\n",
+    "        def reset_state(self, batch_size):\n",
+    "            return tf.zeros((batch_size, self.units))\n",
+    "    \n",
+    "    # Load tokenizer\n",
+    "    with file_io.FileIO(tokenizer_path, 'rb') as src:\n",
+    "        tokenizer = pickle.load(src)\n",
+    "    \n",
+    "    vocab_size = len(tokenizer.word_index) + 1\n",
+    "    \n",
+    "    # Shape of the vector extracted from InceptionV3 is (64, 2048)\n",
+    "    attention_features_shape = 64\n",
+    "    features_shape = 2048\n",
+    "    \n",
+    "    encoder = CNN_Encoder(embedding_dim)\n",
+    "    decoder = RNN_Decoder(embedding_dim, units, vocab_size)\n",
+    "    \n",
+    "    # load model from checkpoint (encoder, decoder)\n",
+    "    optimizer = tf.keras.optimizers.Adam()\n",
+    "    ckpt = tf.train.Checkpoint(encoder=encoder,\n",
+    "                           decoder=decoder, optimizer=optimizer)\n",
+    "    ckpt.restore(model_path).expect_partial()\n",
+    "    \n",
+    "    # load test captions\n",
+    "    f = BytesIO(file_io.read_file_to_string(val_cap_path, \n",
+    "                                            binary_mode=True))\n",
+    "    cap_val = np.load(f)\n",
+    "    \n",
+    "    # load test images\n",
+    "    f = BytesIO(file_io.read_file_to_string(val_img_path, \n",
+    "                                            binary_mode=True))\n",
+    "    img_name_val = np.load(f)\n",
+    "    \n",
+    "    # To get original image locations, replace .npy extension with .jpg and \n",
+    "    # replace preprocessed path with path original images\n",
+    "    PATH = dataset_path + '/train2014/train2014/'\n",
+    "    img_name_val = [img.replace('.npy', '.jpg') for img in img_name_val]\n",
+    "    img_name_val = [img.replace(preprocess_output_dir, PATH) for img in img_name_val]\n",
+    "    \n",
+    "    image_model = tf.keras.applications.InceptionV3(include_top=False,\n",
+    "                                                weights='imagenet')\n",
+    "    new_input = image_model.input\n",
+    "    hidden_layer = image_model.layers[-1].output\n",
+    "\n",
+    "    image_features_extract_model = tf.keras.Model(new_input, hidden_layer)\n",
+    "    \n",
+    "    # Preprocess the images using InceptionV3\n",
+    "    def load_image(image_path):\n",
+    "        img = tf.io.read_file(image_path)\n",
+    "        img = tf.image.decode_jpeg(img, channels=3)\n",
+    "        img = tf.image.resize(img, (299, 299))\n",
+    "        img = tf.keras.applications.inception_v3.preprocess_input(img)\n",
+    "        return img, image_path\n",
+    "    \n",
+    "    # Run predictions\n",
+    "    def evaluate(image):\n",
+    "        attention_plot = np.zeros((max_length, attention_features_shape))\n",
+    "\n",
+    "        hidden = decoder.reset_state(batch_size=1)\n",
+    "\n",
+    "        temp_input = tf.expand_dims(load_image(image)[0], 0)\n",
+    "        img_tensor_val = image_features_extract_model(temp_input)\n",
+    "        img_tensor_val = tf.reshape(img_tensor_val, (img_tensor_val.shape[0], -1, img_tensor_val.shape[3]))\n",
+    "\n",
+    "        features = encoder(img_tensor_val)\n",
+    "\n",
+    "        dec_input = tf.expand_dims([tokenizer.word_index['<start>']], 0)\n",
+    "        result = []\n",
+    "\n",
+    "        for i in range(max_length):\n",
+    "            predictions, hidden, attention_weights = decoder(dec_input, features, hidden)\n",
+    "\n",
+    "            attention_plot[i] = tf.reshape(attention_weights, (-1, )).numpy()\n",
+    "\n",
+    "            predicted_id = tf.argmax(predictions[0]).numpy()\n",
+    "            result.append(tokenizer.index_word[predicted_id])\n",
+    "\n",
+    "            if tokenizer.index_word[predicted_id] == '<end>':\n",
+    "                return result, attention_plot\n",
+    "\n",
+    "            dec_input = tf.expand_dims([predicted_id], 0)\n",
+    "\n",
+    "        attention_plot = attention_plot[:len(result), :]\n",
+    "        return result, attention_plot\n",
+    "    \n",
+    "    def plot_attention(image, result, attention_plot):\n",
+    "        img = tf.io.read_file(image)\n",
+    "        img = tf.image.decode_jpeg(img, channels=3)\n",
+    "        \n",
+    "        temp_image = np.array(img.numpy())\n",
+    "\n",
+    "        fig = plt.figure(figsize=(10, 10))\n",
+    "\n",
+    "        len_result = len(result)\n",
+    "        for l in range(len_result):\n",
+    "            temp_att = np.resize(attention_plot[l], (8, 8))\n",
+    "            ax = fig.add_subplot(len_result//2, len_result//2, l+1)\n",
+    "            ax.set_title(result[l])\n",
+    "            img = ax.imshow(temp_image)\n",
+    "            ax.imshow(temp_att, cmap='gray', alpha=0.6, extent=img.get_extent())\n",
+    "\n",
+    "        plt.tight_layout()\n",
+    "        plt.show()\n",
+    "    \n",
+    "    # captions on the validation set\n",
+    "    rid = np.random.randint(0, len(img_name_val))\n",
+    "    image = img_name_val[rid]\n",
+    "    real_caption = ' '.join([tokenizer.index_word[i] for i in cap_val[rid] if i not in [0]])\n",
+    "    result, attention_plot = evaluate(image)\n",
+    "\n",
+    "    print ('Real Caption:', real_caption)\n",
+    "    print ('Prediction Caption:', ' '.join(result))\n",
+    "    plot_attention(image, result, attention_plot)\n",
+    "    \n",
+    "    # Is there a way to plot imgs in kubeflow ui?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "TARGET_IMAGE = 'gcr.io/%s/predict:latest' % PROJECT_NAME\n",
+    "predict_op = compiler.build_python_component(\n",
+    "    component_func=predict,\n",
+    "    staging_gcs_path=PIPELINE_STORAGE_PATH,\n",
+    "    base_image=BASE_IMAGE,\n",
+    "    dependency=[kfp.compiler.VersionedDependency(name='matplotlib', version='3.1.0')],\n",
+    "    target_image=TARGET_IMAGE)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Uncomment the following lines if you want to test the python function from the notebook\n",
+    "# %matplotlib inline\n",
+    "# predict(GCS_DATASET_PATH, str(tokenizing_output), \n",
+    "#         str(train_output), preprocess_output_dir, embedding_dim, hidden_state_size )"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<h3> Create and run pipeline </h3>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<h4> Create pipeline </h4>"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "@dsl.pipeline(\n",
+    "    name='Image Captioning Pipeline',\n",
+    "    description='A pipeline that trains a model to caption images'\n",
+    ")\n",
+    "def caption_pipeline(\n",
+    "dataset_path=GCS_DATASET_PATH,\n",
+    "num_training_examples=30000,\n",
+    "epochs=20,\n",
+    "training_batch_size=64,\n",
+    "hidden_state_size=512,\n",
+    "vocab_size=5000,\n",
+    "embedding_dim=256,\n",
+    "preprocessing_batch_size=16,\n",
+    "preprocessing_output_dir='default',\n",
+    "tokenizing_output_dir='default',\n",
+    "training_output_dir='default',\n",
+    "validation_output_dir='default',\n",
+    "): # have num_examples small for testing \n",
+    "    \n",
+    "    preprocessing_img_task = preprocessing_img_op(\n",
+    "        dataset_path, \n",
+    "        output_dir=preprocessing_output_dir,\n",
+    "        batch_size=preprocessing_batch_size, \n",
+    "        num_examples=num_training_examples).apply(\n",
+    "        use_gcp_secret('user-gcp-sa'))\n",
+    "    \n",
+    "    tokenize_captions_task = tokenize_captions_op(\n",
+    "        dataset_path, \n",
+    "        preprocessing_img_task.output, \n",
+    "        output_dir=tokenizing_output_dir, \n",
+    "        top_k=vocab_size).apply(use_gcp_secret('user-gcp-sa'))\n",
+    "    \n",
+    "    model_train_task = model_train_op(\n",
+    "        dataset_path, \n",
+    "        preprocessing_img_task.output,\n",
+    "        tokenize_captions_task.output,\n",
+    "        train_output_dir=training_output_dir, \n",
+    "        valid_output_dir=validation_output_dir,\n",
+    "        batch_size=training_batch_size, \n",
+    "        embedding_dim=embedding_dim, \n",
+    "        units=hidden_state_size, epochs=epochs).apply(\n",
+    "        use_gcp_secret('user-gcp-sa'))\n",
+    "    \n",
+    "    predict_task = predict_op(\n",
+    "        dataset_path,\n",
+    "        tokenize_captions_task.output, \n",
+    "        model_train_task.output,\n",
+    "        preprocess_output_dir=preprocessing_output_dir,\n",
+    "        embedding_dim=embedding_dim,\n",
+    "        units=hidden_state_size).apply(\n",
+    "        use_gcp_secret('user-gcp-sa'))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pipeline_func = caption_pipeline"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pipeline_filename = pipeline_func.__name__ + '.pipeline.zip'\n",
+    "compiler.Compiler().compile(pipeline_func, pipeline_filename)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "client = kfp.Client()\n",
+    "experiment = client.create_experiment(EXPERIMENT_NAME)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<h4> Run pipeline </h4>"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Test run to make sure all parts of the pipeline are working properly\n",
+    "arguments = {\n",
+    "    'dataset_path': GCS_DATASET_PATH, \n",
+    "    'num_training_examples': 100, # Small test to make sure pipeline functions properly\n",
+    "    'training_batch_size': 16, # has to be smaller since only training on 80 examples \n",
+    "}\n",
+    "run_name = pipeline_func.__name__ + ' run'\n",
+    "run_result = client.run_pipeline(experiment.id, run_name, pipeline_filename,\n",
+    "                                params=arguments)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Model checkpoints are saved at train_output_dir, which is `GCS_DATASET_PATH/train/checkpoints/` by default."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.7"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}

From 26da0d197ea7bffcab4bb9bce9b4d68c4279dcce Mon Sep 17 00:00:00 2001
From: Zane Durante <zanedurante@gmail.com>
Date: Wed, 3 Jul 2019 14:18:08 -0700
Subject: [PATCH 02/11] removed project id and storage bucket name

---
 samples/notebooks/Image Captioning TF 2.0.ipynb | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/samples/notebooks/Image Captioning TF 2.0.ipynb b/samples/notebooks/Image Captioning TF 2.0.ipynb
index cf009482ce9..ef8630d874d 100644
--- a/samples/notebooks/Image Captioning TF 2.0.ipynb	
+++ b/samples/notebooks/Image Captioning TF 2.0.ipynb	
@@ -44,7 +44,7 @@
    "outputs": [],
    "source": [
     "# Previously downloaded dataset and put onto GCS\n",
-    "GCS_DATASET_PATH = 'gs://artifacts.intro-to-kubeflow-1.appspot.com/ms-coco'"
+    "GCS_DATASET_PATH = 'gs://[YOUR-BUCKET-NAME]/ms-coco'"
    ]
   },
   {
@@ -55,8 +55,8 @@
    "source": [
     "# Kubeflow project settings\n",
     "EXPERIMENT_NAME = 'Image Captioning'\n",
-    "PROJECT_NAME = 'intro-to-kubeflow-1' \n",
-    "PIPELINE_STORAGE_PATH = 'gs://artifacts.intro-to-kubeflow-1.appspot.com/ms-coco/components' # path to save pipeline component images\n",
+    "PROJECT_NAME = '[YOUR-PROJECT-ID]' \n",
+    "PIPELINE_STORAGE_PATH = 'gs://[YOUR-BUCKET-NAME]/ms-coco/components' # path to save pipeline component images\n",
     "BASE_IMAGE = 'tensorflow/tensorflow:2.0.0b0-py3' # using tensorflow 2.0.0"
    ]
   },

From 309a4b772373a56abf3f03d7048cdb87b3e4f030 Mon Sep 17 00:00:00 2001
From: Zane Durante <zanedurante@gmail.com>
Date: Wed, 10 Jul 2019 12:55:33 -0700
Subject: [PATCH 03/11] Updated notebook

---
 .../notebooks/Image Captioning TF 2.0.ipynb   | 411 +++++++++++-------
 1 file changed, 250 insertions(+), 161 deletions(-)

diff --git a/samples/notebooks/Image Captioning TF 2.0.ipynb b/samples/notebooks/Image Captioning TF 2.0.ipynb
index ef8630d874d..c23a4b577cc 100644
--- a/samples/notebooks/Image Captioning TF 2.0.ipynb	
+++ b/samples/notebooks/Image Captioning TF 2.0.ipynb	
@@ -4,16 +4,16 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "<h1> Image Captioning Using Tensorflow 2.0 </h1>"
+    "<h1> Image Captioning with Attention in Tensorflow 2.0 </h1>"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "This notebook modifies an example tensorflow 2.0 notebook from\n",
+    "This notebook modifies an example Tensorflow 2.0 notebook from\n",
     "[here](https://colab.sandbox.google.com/github/tensorflow/docs/blob/master/site/en/r2/tutorials/text/image_captioning.ipynb)\n",
-    "to work with kubeflow pipelines"
+    "to work with kubeflow pipelines.  "
    ]
   },
   {
@@ -27,7 +27,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "First, we have to download the [MS COCO dataset](http://cocodataset.org/#download).  This sample uses both the 2014 train images and 2014 train/val annotations.  If you downloaded and extracted the dataset on your local system, you can upload it to GCS using `gsutil -m cp -r path/to/dataset/ gs://[YOUR-BUCKET-ID]/ms-coco`."
+    "First, we have to download the [MS COCO dataset](http://cocodataset.org/#download).  This sample uses both the 2014 train images and 2014 train/val annotations.  If you downloaded and extracted the dataset on your local system, you can upload it to GCS using `gsutil -m cp -r path/to/dataset/ gs://[YOUR-BUCKET-NAME]/ms-coco`."
    ]
   },
   {
@@ -39,7 +39,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 58,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -49,64 +49,37 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 59,
    "metadata": {},
    "outputs": [],
    "source": [
     "# Kubeflow project settings\n",
     "EXPERIMENT_NAME = 'Image Captioning'\n",
-    "PROJECT_NAME = '[YOUR-PROJECT-ID]' \n",
+    "PROJECT_NAME = '[YOUR-PROJECT-NAME]' \n",
     "PIPELINE_STORAGE_PATH = 'gs://[YOUR-BUCKET-NAME]/ms-coco/components' # path to save pipeline component images\n",
     "BASE_IMAGE = 'tensorflow/tensorflow:2.0.0b0-py3' # using tensorflow 2.0.0"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 136,
    "metadata": {},
    "outputs": [],
    "source": [
-    "import kfp\n",
-    "import kfp.dsl as dsl\n",
-    "from kfp import compiler\n",
-    "from kfp.gcp import use_gcp_secret"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "<h4> (Optional) Execute components in notebook </h4>"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Uncomment the following lines to install the required libraries if you want to run the components (python functions) within the notebook."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "scrolled": true
-   },
-   "outputs": [],
-   "source": [
-    "# !pip install sklearn\n",
-    "# !pip install tensorflow==2.0.0-beta1\n",
-    "# !pip install matplotlib"
+    "# Used to save tensorboard files to a different directory each run\n",
+    "RUN_NUMBER = 0 "
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 60,
    "metadata": {},
    "outputs": [],
    "source": [
-    "# import tensorflow as tf\n",
-    "# print(tf.executing_eagerly()) # Should print True if tf 2.0 is installed "
+    "import kfp\n",
+    "import kfp.dsl as dsl\n",
+    "from kfp import compiler\n",
+    "from kfp.gcp import use_gcp_secret"
    ]
   },
   {
@@ -125,7 +98,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 63,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -148,7 +121,7 @@
     "    annotation_file = dataset_path + '/annotations_trainval2014/annotations/captions_train2014.json'\n",
     "    PATH = dataset_path + '/train2014/train2014/'\n",
     "    \n",
-    "    # Read the json file (CHANGED FROM open() TO file_io.FileIO)\n",
+    "    # Read the json file (CHANGE open() TO file_io.FileIO to use GCS)\n",
     "    with file_io.FileIO(annotation_file, 'r') as f:\n",
     "        annotations = json.load(f)\n",
     "\n",
@@ -203,6 +176,7 @@
     "\n",
     "        for bf, p in zip(batch_features, path):\n",
     "            path_of_feature = p.numpy().decode(\"utf-8\")\n",
+    "            \n",
     "            # Save to a different location and as numpy array\n",
     "            path_of_feature = path_of_feature.replace('.jpg', '.npy')\n",
     "            path_of_feature = path_of_feature.replace(PATH, OUTPUT_DIR)\n",
@@ -227,11 +201,38 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 64,
    "metadata": {
     "scrolled": true
    },
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "2019-07-10 17:32:56:INFO:Build an image that is based on tensorflow/tensorflow:2.0.0b0-py3 and push the image to gcr.io/intro-to-kubeflow-1/preprocessing:latest\n",
+      "2019-07-10 17:32:56:INFO:Checking path: gs://artifacts.intro-to-kubeflow-1.appspot.com/ms-coco/components...\n",
+      "2019-07-10 17:32:56:INFO:Generate entrypoint and serialization codes.\n",
+      "2019-07-10 17:32:56:INFO:Generate build files.\n",
+      "2019-07-10 17:32:56:INFO:Start a kaniko job for build.\n",
+      "2019-07-10 17:32:56:INFO:Cannot Find local kubernetes config. Trying in-cluster config.\n",
+      "2019-07-10 17:32:56:INFO:Initialized with in-cluster config.\n",
+      "2019-07-10 17:33:01:INFO:5 seconds: waiting for job to complete\n",
+      "2019-07-10 17:33:06:INFO:10 seconds: waiting for job to complete\n",
+      "2019-07-10 17:33:11:INFO:15 seconds: waiting for job to complete\n",
+      "2019-07-10 17:33:16:INFO:20 seconds: waiting for job to complete\n",
+      "2019-07-10 17:33:22:INFO:25 seconds: waiting for job to complete\n",
+      "2019-07-10 17:33:27:INFO:30 seconds: waiting for job to complete\n",
+      "2019-07-10 17:33:32:INFO:35 seconds: waiting for job to complete\n",
+      "2019-07-10 17:33:37:INFO:40 seconds: waiting for job to complete\n",
+      "2019-07-10 17:33:42:INFO:45 seconds: waiting for job to complete\n",
+      "2019-07-10 17:33:47:INFO:50 seconds: waiting for job to complete\n",
+      "2019-07-10 17:33:52:INFO:55 seconds: waiting for job to complete\n",
+      "2019-07-10 17:33:52:INFO:Kaniko job complete.\n",
+      "2019-07-10 17:33:52:INFO:Build component complete.\n"
+     ]
+    }
+   ],
    "source": [
     "TARGET_IMAGE = 'gcr.io/%s/preprocessing:latest' % PROJECT_NAME\n",
     "preprocessing_img_op = compiler.build_python_component(\n",
@@ -242,19 +243,6 @@
     "    target_image=TARGET_IMAGE)"
    ]
   },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Uncomment the following lines if you want to test the python function from the notebook\n",
-    "# num_examples = 100\n",
-    "# preprocess_output_dir = 'default' # Can change if you want a specific directory\n",
-    "# batch_size = 16\n",
-    "# preprocess_output = preprocess(GCS_DATASET_PATH, num_examples, preprocess_output_dir, batch_size)"
-   ]
-  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -264,7 +252,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 66,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -326,11 +314,37 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 67,
    "metadata": {
     "scrolled": true
    },
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "2019-07-10 17:33:52:INFO:Build an image that is based on tensorflow/tensorflow:2.0.0b0-py3 and push the image to gcr.io/intro-to-kubeflow-1/tokenizer:latest\n",
+      "2019-07-10 17:33:52:INFO:Checking path: gs://artifacts.intro-to-kubeflow-1.appspot.com/ms-coco/components...\n",
+      "2019-07-10 17:33:52:INFO:Generate entrypoint and serialization codes.\n",
+      "2019-07-10 17:33:52:INFO:Generate build files.\n",
+      "2019-07-10 17:33:52:INFO:Start a kaniko job for build.\n",
+      "2019-07-10 17:33:52:INFO:Cannot Find local kubernetes config. Trying in-cluster config.\n",
+      "2019-07-10 17:33:52:INFO:Initialized with in-cluster config.\n",
+      "2019-07-10 17:33:57:INFO:5 seconds: waiting for job to complete\n",
+      "2019-07-10 17:34:02:INFO:10 seconds: waiting for job to complete\n",
+      "2019-07-10 17:34:07:INFO:15 seconds: waiting for job to complete\n",
+      "2019-07-10 17:34:12:INFO:20 seconds: waiting for job to complete\n",
+      "2019-07-10 17:34:17:INFO:25 seconds: waiting for job to complete\n",
+      "2019-07-10 17:34:22:INFO:30 seconds: waiting for job to complete\n",
+      "2019-07-10 17:34:27:INFO:35 seconds: waiting for job to complete\n",
+      "2019-07-10 17:34:32:INFO:40 seconds: waiting for job to complete\n",
+      "2019-07-10 17:34:37:INFO:45 seconds: waiting for job to complete\n",
+      "2019-07-10 17:34:42:INFO:50 seconds: waiting for job to complete\n",
+      "2019-07-10 17:34:42:INFO:Kaniko job complete.\n",
+      "2019-07-10 17:34:43:INFO:Build component complete.\n"
+     ]
+    }
+   ],
    "source": [
     "TARGET_IMAGE = 'gcr.io/%s/tokenizer:latest' % PROJECT_NAME\n",
     "tokenize_captions_op = compiler.build_python_component(\n",
@@ -340,19 +354,6 @@
     "    target_image=TARGET_IMAGE)"
    ]
   },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Uncomment the following lines if you want to test the python function from the notebook\n",
-    "# num_examples = 100\n",
-    "# tokenizing_output_dir = 'default' # Can change if you want a specific directory\n",
-    "# vocab_size = 1000\n",
-    "# tokenizing_output = tokenize_captions(GCS_DATASET_PATH, str(preprocess_output), tokenizing_output_dir, vocab_size)"
-   ]
-  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -362,7 +363,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 137,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -373,7 +374,8 @@
     ")\n",
     "def train_model(dataset_path: str, preprocess_output: str, \n",
     "        tokenizing_output: str, train_output_dir: str, valid_output_dir: str, \n",
-    "        batch_size: int, embedding_dim: int, units: int, EPOCHS: int)-> str:\n",
+    "        batch_size: int, embedding_dim: int, units: int, EPOCHS: int, run_number : int)-> str:\n",
+    "    import json\n",
     "    import time\n",
     "    import pickle\n",
     "    import numpy as np\n",
@@ -397,7 +399,7 @@
     "        valid_output_dir = dataset_path + '/valid/'\n",
     "    \n",
     "    if train_output_dir == 'default':\n",
-    "        train_output_dir = dataset_path + '/train/checkpoints/'\n",
+    "        train_output_dir = dataset_path + '/train/'\n",
     "    \n",
     "    # load img_name_vector\n",
     "    f = BytesIO(file_io.read_file_to_string(preprocessed_imgs_path, binary_mode=True))\n",
@@ -419,7 +421,7 @@
     "                                                            random_state=0)\n",
     "    \n",
     "    # Create tf.data dataset for training\n",
-    "    BUFFER_SIZE = 1000 # Common size used for shuffling dataset\n",
+    "    BUFFER_SIZE = 1000 # common size used for shuffling dataset\n",
     "    vocab_size = len(tokenizer.word_index) + 1\n",
     "    num_steps = len(img_name_train) // batch_size\n",
     "    \n",
@@ -550,7 +552,7 @@
     "    ckpt = tf.train.Checkpoint(encoder=encoder,\n",
     "                           decoder=decoder,\n",
     "                           optimizer = optimizer)\n",
-    "    ckpt_manager = tf.train.CheckpointManager(ckpt, train_output_dir, max_to_keep=5)\n",
+    "    ckpt_manager = tf.train.CheckpointManager(ckpt, train_output_dir + 'checkpoints/', max_to_keep=5)\n",
     "    start_epoch = 0\n",
     "    if ckpt_manager.latest_checkpoint:\n",
     "        start_epoch = int(ckpt_manager.latest_checkpoint.split('-')[-1])\n",
@@ -589,8 +591,10 @@
     "\n",
     "        return loss, total_loss\n",
     "    \n",
-    "    # Create summary writers for plotting loss in tensorboard\n",
-    "    train_summary_writer = tf.summary.create_file_writer('./logs/train/')\n",
+    "    # Create summary writers and loss for plotting loss in tensorboard\n",
+    "    tensorboard_dir = train_output_dir + 'logs' + str(run_number) + '/'\n",
+    "    train_summary_writer = tf.summary.create_file_writer(tensorboard_dir)\n",
+    "    train_loss = tf.keras.metrics.Mean('train_loss', dtype=tf.float32)\n",
     "    \n",
     "    # Train model\n",
     "    path_to_most_recent_ckpt = None\n",
@@ -601,15 +605,19 @@
     "        for (batch, (img_tensor, target)) in enumerate(dataset):\n",
     "            batch_loss, t_loss = train_step(img_tensor, target)\n",
     "            total_loss += t_loss\n",
-    "\n",
+    "            train_loss(t_loss)\n",
     "            if batch % 100 == 0:\n",
     "                print ('Epoch {} Batch {} Loss {:.4f}'.format(\n",
     "                  epoch + 1, batch, batch_loss.numpy() / int(target.shape[1])))\n",
     "        \n",
-    "        # storing the epoch end loss value to plot in tensorboard\n",
+    "        \n",
+    "        \n",
+    "        # Storing the epoch end loss value to plot in tensorboard\n",
     "        with train_summary_writer.as_default():\n",
-    "            tf.summary.scalar('loss', total_loss / num_steps, step=epoch)\n",
-    "\n",
+    "            tf.summary.scalar('loss per epoch', train_loss.result(), step=epoch)\n",
+    "        \n",
+    "        train_loss.reset_states()\n",
+    "        \n",
     "        if epoch % 5 == 0:\n",
     "            path_to_most_recent_ckpt = ckpt_manager.save()\n",
     "\n",
@@ -617,8 +625,15 @@
     "                                             total_loss/num_steps))\n",
     "        print ('Time taken for 1 epoch {} sec\\n'.format(time.time() - start))\n",
     "    \n",
-    "    # Could add plot of loss (loss_plot)?\n",
-    "    # Adding tensorboard plot of loss function here (available in UI)\n",
+    "    # Add plot of loss in tensorboard\n",
+    "    metadata ={\n",
+    "        'outputs': [{\n",
+    "            'type': 'tensorboard',\n",
+    "            'source': tensorboard_dir,\n",
+    "        }]\n",
+    "    }\n",
+    "    with open('/mlpipeline-ui-metadata.json', 'w') as f:\n",
+    "        json.dump(metadata, f)\n",
     "    \n",
     "    # Save validation data to use for predictions\n",
     "    val_cap_path = valid_output_dir + 'captions.npy'\n",
@@ -632,11 +647,38 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 138,
    "metadata": {
     "scrolled": true
    },
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "2019-07-10 19:23:41:INFO:Build an image that is based on tensorflow/tensorflow:2.0.0b0-py3 and push the image to gcr.io/intro-to-kubeflow-1/trainer:latest\n",
+      "2019-07-10 19:23:41:INFO:Checking path: gs://artifacts.intro-to-kubeflow-1.appspot.com/ms-coco/components...\n",
+      "2019-07-10 19:23:41:INFO:Generate entrypoint and serialization codes.\n",
+      "2019-07-10 19:23:41:INFO:Generate build files.\n",
+      "2019-07-10 19:23:41:INFO:Start a kaniko job for build.\n",
+      "2019-07-10 19:23:41:INFO:Cannot Find local kubernetes config. Trying in-cluster config.\n",
+      "2019-07-10 19:23:41:INFO:Initialized with in-cluster config.\n",
+      "2019-07-10 19:23:46:INFO:5 seconds: waiting for job to complete\n",
+      "2019-07-10 19:23:51:INFO:10 seconds: waiting for job to complete\n",
+      "2019-07-10 19:23:56:INFO:15 seconds: waiting for job to complete\n",
+      "2019-07-10 19:24:01:INFO:20 seconds: waiting for job to complete\n",
+      "2019-07-10 19:24:06:INFO:25 seconds: waiting for job to complete\n",
+      "2019-07-10 19:24:11:INFO:30 seconds: waiting for job to complete\n",
+      "2019-07-10 19:24:16:INFO:35 seconds: waiting for job to complete\n",
+      "2019-07-10 19:24:21:INFO:40 seconds: waiting for job to complete\n",
+      "2019-07-10 19:24:26:INFO:45 seconds: waiting for job to complete\n",
+      "2019-07-10 19:24:31:INFO:50 seconds: waiting for job to complete\n",
+      "2019-07-10 19:24:36:INFO:55 seconds: waiting for job to complete\n",
+      "2019-07-10 19:24:36:INFO:Kaniko job complete.\n",
+      "2019-07-10 19:24:37:INFO:Build component complete.\n"
+     ]
+    }
+   ],
    "source": [
     "TARGET_IMAGE = 'gcr.io/%s/trainer:latest' % PROJECT_NAME\n",
     "model_train_op = compiler.build_python_component(\n",
@@ -647,28 +689,6 @@
     "    target_image=TARGET_IMAGE)"
    ]
   },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "scrolled": true
-   },
-   "outputs": [],
-   "source": [
-    "# Uncomment the following lines if you want to test the python function from the notebook\n",
-    "# num_examples = 100\n",
-    "# train_output_dir = 'default' # Can change if you want a specific directory\n",
-    "# valid_output_dir = 'default' # Can change if you want a specific directory\n",
-    "# batch_size = 16\n",
-    "# embedding_dim = 256\n",
-    "# hidden_state_size = 512\n",
-    "# epochs = 20\n",
-    "\n",
-    "# train_output = train_model(GCS_DATASET_PATH, str(preprocess_output), str(tokenizing_output), \n",
-    "#                           train_output_dir, valid_output_dir, batch_size, embedding_dim,\n",
-    "#                           hidden_state_size, epochs)"
-   ]
-  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -678,7 +698,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 139,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -687,9 +707,12 @@
     "    description='Predicts on images in validation set',\n",
     "    base_image=BASE_IMAGE\n",
     ")\n",
-    "def predict(dataset_path: str, tokenizing_output: str, model_train_output: str, \n",
-    "        preprocess_output_dir: str, embedding_dim: int, units: int):\n",
+    "def predict(dataset_path: str, tokenizing_output: str, \n",
+    "        model_train_output: str, preprocess_output_dir: str, \n",
+    "        valid_output_dir: str, embedding_dim: int, units: int,\n",
+    "        run_number: int):\n",
     "    import pickle\n",
+    "    import json\n",
     "    import matplotlib.pyplot as plt\n",
     "    import numpy as np\n",
     "    import tensorflow as tf\n",
@@ -711,6 +734,12 @@
     "    if preprocess_output_dir == 'default':\n",
     "        preprocess_output_dir = dataset_path + '/preprocess/'\n",
     "    \n",
+    "    if valid_output_dir == 'default':\n",
+    "        valid_output_dir = dataset_path + '/valid/'\n",
+    "        \n",
+    "    tensorboard_dir = valid_output_dir + 'logs' + str(run_number) + '/'\n",
+    "    summary_writer = tf.summary.create_file_writer(tensorboard_dir)\n",
+    "\n",
     "    # Load tokenizer, model, test_captions, and test_imgs\n",
     "    \"\"\" CHANGE: don't reuse code here: not sure how though..? \"\"\"\n",
     "    # Attention model\n",
@@ -811,13 +840,13 @@
     "    encoder = CNN_Encoder(embedding_dim)\n",
     "    decoder = RNN_Decoder(embedding_dim, units, vocab_size)\n",
     "    \n",
-    "    # load model from checkpoint (encoder, decoder)\n",
+    "    # Load model from checkpoint (encoder, decoder)\n",
     "    optimizer = tf.keras.optimizers.Adam()\n",
     "    ckpt = tf.train.Checkpoint(encoder=encoder,\n",
     "                           decoder=decoder, optimizer=optimizer)\n",
     "    ckpt.restore(model_path).expect_partial()\n",
     "    \n",
-    "    # load test captions\n",
+    "    # Load test captions\n",
     "    f = BytesIO(file_io.read_file_to_string(val_cap_path, \n",
     "                                            binary_mode=True))\n",
     "    cap_val = np.load(f)\n",
@@ -879,45 +908,84 @@
     "        attention_plot = attention_plot[:len(result), :]\n",
     "        return result, attention_plot\n",
     "    \n",
+    "    # Modified to plot images on tensorboard\n",
     "    def plot_attention(image, result, attention_plot):\n",
     "        img = tf.io.read_file(image)\n",
     "        img = tf.image.decode_jpeg(img, channels=3)\n",
-    "        \n",
     "        temp_image = np.array(img.numpy())\n",
-    "\n",
-    "        fig = plt.figure(figsize=(10, 10))\n",
-    "\n",
+    "        \n",
     "        len_result = len(result)\n",
     "        for l in range(len_result):\n",
     "            temp_att = np.resize(attention_plot[l], (8, 8))\n",
-    "            ax = fig.add_subplot(len_result//2, len_result//2, l+1)\n",
-    "            ax.set_title(result[l])\n",
-    "            img = ax.imshow(temp_image)\n",
-    "            ax.imshow(temp_att, cmap='gray', alpha=0.6, extent=img.get_extent())\n",
-    "\n",
-    "        plt.tight_layout()\n",
-    "        plt.show()\n",
+    "            plt.title(result[l])\n",
+    "            img = plt.imshow(temp_image)\n",
+    "            plt.imshow(temp_att, cmap='gray', alpha=0.6, extent=img.get_extent())\n",
+    "            \n",
+    "            # Save plt to image to access in tensorboard\n",
+    "            buf = BytesIO()\n",
+    "            plt.savefig(buf, format='png')\n",
+    "            buf.seek(0)\n",
+    "            \n",
+    "            final_im = tf.image.decode_png(buf.getvalue(), channels=4)\n",
+    "            final_im = tf.expand_dims(final_im, 0)\n",
+    "            with summary_writer.as_default():\n",
+    "                tf.summary.image(\"attention\", final_im, step=l)\n",
     "    \n",
-    "    # captions on the validation set\n",
+    "    # Select a random image to caption from validation set\n",
     "    rid = np.random.randint(0, len(img_name_val))\n",
     "    image = img_name_val[rid]\n",
     "    real_caption = ' '.join([tokenizer.index_word[i] for i in cap_val[rid] if i not in [0]])\n",
     "    result, attention_plot = evaluate(image)\n",
-    "\n",
+    "    print ('Image:', image)\n",
     "    print ('Real Caption:', real_caption)\n",
     "    print ('Prediction Caption:', ' '.join(result))\n",
     "    plot_attention(image, result, attention_plot)\n",
     "    \n",
-    "    # Is there a way to plot imgs in kubeflow ui?"
+    "    # Plot attention images on tensorboard\n",
+    "    metadata = {\n",
+    "        'outputs': [{\n",
+    "            'type': 'tensorboard',\n",
+    "            'source': tensorboard_dir,\n",
+    "        }]\n",
+    "    }\n",
+    "    with open('/mlpipeline-ui-metadata.json', 'w') as f:\n",
+    "        json.dump(metadata, f)\n",
+    "    "
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 140,
    "metadata": {
     "scrolled": true
    },
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "2019-07-10 19:24:37:INFO:Build an image that is based on tensorflow/tensorflow:2.0.0b0-py3 and push the image to gcr.io/intro-to-kubeflow-1/predict:latest\n",
+      "2019-07-10 19:24:37:INFO:Checking path: gs://artifacts.intro-to-kubeflow-1.appspot.com/ms-coco/components...\n",
+      "2019-07-10 19:24:37:INFO:Generate entrypoint and serialization codes.\n",
+      "2019-07-10 19:24:37:INFO:Generate build files.\n",
+      "2019-07-10 19:24:37:INFO:Start a kaniko job for build.\n",
+      "2019-07-10 19:24:37:INFO:Cannot Find local kubernetes config. Trying in-cluster config.\n",
+      "2019-07-10 19:24:37:INFO:Initialized with in-cluster config.\n",
+      "2019-07-10 19:24:42:INFO:5 seconds: waiting for job to complete\n",
+      "2019-07-10 19:24:47:INFO:10 seconds: waiting for job to complete\n",
+      "2019-07-10 19:24:52:INFO:15 seconds: waiting for job to complete\n",
+      "2019-07-10 19:24:57:INFO:20 seconds: waiting for job to complete\n",
+      "2019-07-10 19:25:02:INFO:25 seconds: waiting for job to complete\n",
+      "2019-07-10 19:25:07:INFO:30 seconds: waiting for job to complete\n",
+      "2019-07-10 19:25:12:INFO:35 seconds: waiting for job to complete\n",
+      "2019-07-10 19:25:17:INFO:40 seconds: waiting for job to complete\n",
+      "2019-07-10 19:25:22:INFO:45 seconds: waiting for job to complete\n",
+      "2019-07-10 19:25:27:INFO:50 seconds: waiting for job to complete\n",
+      "2019-07-10 19:25:27:INFO:Kaniko job complete.\n",
+      "2019-07-10 19:25:27:INFO:Build component complete.\n"
+     ]
+    }
+   ],
    "source": [
     "TARGET_IMAGE = 'gcr.io/%s/predict:latest' % PROJECT_NAME\n",
     "predict_op = compiler.build_python_component(\n",
@@ -928,18 +996,6 @@
     "    target_image=TARGET_IMAGE)"
    ]
   },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Uncomment the following lines if you want to test the python function from the notebook\n",
-    "# %matplotlib inline\n",
-    "# predict(GCS_DATASET_PATH, str(tokenizing_output), \n",
-    "#         str(train_output), preprocess_output_dir, embedding_dim, hidden_state_size )"
-   ]
-  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -956,7 +1012,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 144,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -966,7 +1022,7 @@
     ")\n",
     "def caption_pipeline(\n",
     "dataset_path=GCS_DATASET_PATH,\n",
-    "num_training_examples=30000,\n",
+    "num_examples=30000,\n",
     "epochs=20,\n",
     "training_batch_size=64,\n",
     "hidden_state_size=512,\n",
@@ -977,13 +1033,14 @@
     "tokenizing_output_dir='default',\n",
     "training_output_dir='default',\n",
     "validation_output_dir='default',\n",
-    "): # have num_examples small for testing \n",
+    "run_number=0,\n",
+    "): \n",
     "    \n",
     "    preprocessing_img_task = preprocessing_img_op(\n",
     "        dataset_path, \n",
     "        output_dir=preprocessing_output_dir,\n",
     "        batch_size=preprocessing_batch_size, \n",
-    "        num_examples=num_training_examples).apply(\n",
+    "        num_examples=num_examples).apply(\n",
     "        use_gcp_secret('user-gcp-sa'))\n",
     "    \n",
     "    tokenize_captions_task = tokenize_captions_op(\n",
@@ -1000,7 +1057,9 @@
     "        valid_output_dir=validation_output_dir,\n",
     "        batch_size=training_batch_size, \n",
     "        embedding_dim=embedding_dim, \n",
-    "        units=hidden_state_size, epochs=epochs).apply(\n",
+    "        units=hidden_state_size, \n",
+    "        epochs=epochs,\n",
+    "        run_number=run_number).apply(\n",
     "        use_gcp_secret('user-gcp-sa'))\n",
     "    \n",
     "    predict_task = predict_op(\n",
@@ -1008,14 +1067,16 @@
     "        tokenize_captions_task.output, \n",
     "        model_train_task.output,\n",
     "        preprocess_output_dir=preprocessing_output_dir,\n",
+    "        valid_output_dir=validation_output_dir,\n",
     "        embedding_dim=embedding_dim,\n",
-    "        units=hidden_state_size).apply(\n",
+    "        units=hidden_state_size,\n",
+    "        run_number=run_number).apply(\n",
     "        use_gcp_secret('user-gcp-sa'))"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 145,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -1024,7 +1085,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 146,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -1034,11 +1095,24 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 147,
    "metadata": {
     "scrolled": true
    },
-   "outputs": [],
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "Experiment link <a href=\"/pipeline/#/experiments/details/04501b0c-f82d-457c-97b8-a4a9fdf3ff1a\" target=\"_blank\" >here</a>"
+      ],
+      "text/plain": [
+       "<IPython.core.display.HTML object>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
    "source": [
     "client = kfp.Client()\n",
     "experiment = client.create_experiment(EXPERIMENT_NAME)"
@@ -1053,26 +1127,41 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 148,
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "Run link <a href=\"/pipeline/#/runs/details/93bcf3ec-a348-11e9-ac63-42010a80017c\" target=\"_blank\" >here</a>"
+      ],
+      "text/plain": [
+       "<IPython.core.display.HTML object>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
    "source": [
     "# Test run to make sure all parts of the pipeline are working properly\n",
     "arguments = {\n",
     "    'dataset_path': GCS_DATASET_PATH, \n",
-    "    'num_training_examples': 100, # Small test to make sure pipeline functions properly\n",
+    "    'num_examples': 100, # Small test to make sure pipeline functions properly\n",
     "    'training_batch_size': 16, # has to be smaller since only training on 80 examples \n",
+    "    'run_number': RUN_NUMBER,\n",
     "}\n",
-    "run_name = pipeline_func.__name__ + ' run'\n",
+    "run_name = pipeline_func.__name__ + ' run' + str(RUN_NUMBER)\n",
     "run_result = client.run_pipeline(experiment.id, run_name, pipeline_filename,\n",
-    "                                params=arguments)"
+    "                                params=arguments)\n",
+    "RUN_NUMBER += 1"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Model checkpoints are saved at train_output_dir, which is `GCS_DATASET_PATH/train/checkpoints/` by default."
+    "Model checkpoints are saved at training_output_dir, which is `GCS_DATASET_PATH/train/checkpoints/` by default."
    ]
   }
  ],

From ec5ae5e80e07f15a1f9711505695abff47d86651 Mon Sep 17 00:00:00 2001
From: Zane Durante <zanedurante@gmail.com>
Date: Wed, 10 Jul 2019 16:08:37 -0700
Subject: [PATCH 04/11] Cleared outputs

---
 .../notebooks/Image Captioning TF 2.0.ipynb   | 180 +++---------------
 1 file changed, 24 insertions(+), 156 deletions(-)

diff --git a/samples/notebooks/Image Captioning TF 2.0.ipynb b/samples/notebooks/Image Captioning TF 2.0.ipynb
index c23a4b577cc..98952b9234c 100644
--- a/samples/notebooks/Image Captioning TF 2.0.ipynb	
+++ b/samples/notebooks/Image Captioning TF 2.0.ipynb	
@@ -39,7 +39,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 58,
+   "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -49,7 +49,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 59,
+   "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -62,7 +62,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 136,
+   "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -72,7 +72,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 60,
+   "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -98,7 +98,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 63,
+   "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -201,38 +201,11 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 64,
+   "execution_count": null,
    "metadata": {
     "scrolled": true
    },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "2019-07-10 17:32:56:INFO:Build an image that is based on tensorflow/tensorflow:2.0.0b0-py3 and push the image to gcr.io/intro-to-kubeflow-1/preprocessing:latest\n",
-      "2019-07-10 17:32:56:INFO:Checking path: gs://artifacts.intro-to-kubeflow-1.appspot.com/ms-coco/components...\n",
-      "2019-07-10 17:32:56:INFO:Generate entrypoint and serialization codes.\n",
-      "2019-07-10 17:32:56:INFO:Generate build files.\n",
-      "2019-07-10 17:32:56:INFO:Start a kaniko job for build.\n",
-      "2019-07-10 17:32:56:INFO:Cannot Find local kubernetes config. Trying in-cluster config.\n",
-      "2019-07-10 17:32:56:INFO:Initialized with in-cluster config.\n",
-      "2019-07-10 17:33:01:INFO:5 seconds: waiting for job to complete\n",
-      "2019-07-10 17:33:06:INFO:10 seconds: waiting for job to complete\n",
-      "2019-07-10 17:33:11:INFO:15 seconds: waiting for job to complete\n",
-      "2019-07-10 17:33:16:INFO:20 seconds: waiting for job to complete\n",
-      "2019-07-10 17:33:22:INFO:25 seconds: waiting for job to complete\n",
-      "2019-07-10 17:33:27:INFO:30 seconds: waiting for job to complete\n",
-      "2019-07-10 17:33:32:INFO:35 seconds: waiting for job to complete\n",
-      "2019-07-10 17:33:37:INFO:40 seconds: waiting for job to complete\n",
-      "2019-07-10 17:33:42:INFO:45 seconds: waiting for job to complete\n",
-      "2019-07-10 17:33:47:INFO:50 seconds: waiting for job to complete\n",
-      "2019-07-10 17:33:52:INFO:55 seconds: waiting for job to complete\n",
-      "2019-07-10 17:33:52:INFO:Kaniko job complete.\n",
-      "2019-07-10 17:33:52:INFO:Build component complete.\n"
-     ]
-    }
-   ],
+   "outputs": [],
    "source": [
     "TARGET_IMAGE = 'gcr.io/%s/preprocessing:latest' % PROJECT_NAME\n",
     "preprocessing_img_op = compiler.build_python_component(\n",
@@ -252,7 +225,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 66,
+   "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -314,37 +287,11 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 67,
+   "execution_count": null,
    "metadata": {
     "scrolled": true
    },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "2019-07-10 17:33:52:INFO:Build an image that is based on tensorflow/tensorflow:2.0.0b0-py3 and push the image to gcr.io/intro-to-kubeflow-1/tokenizer:latest\n",
-      "2019-07-10 17:33:52:INFO:Checking path: gs://artifacts.intro-to-kubeflow-1.appspot.com/ms-coco/components...\n",
-      "2019-07-10 17:33:52:INFO:Generate entrypoint and serialization codes.\n",
-      "2019-07-10 17:33:52:INFO:Generate build files.\n",
-      "2019-07-10 17:33:52:INFO:Start a kaniko job for build.\n",
-      "2019-07-10 17:33:52:INFO:Cannot Find local kubernetes config. Trying in-cluster config.\n",
-      "2019-07-10 17:33:52:INFO:Initialized with in-cluster config.\n",
-      "2019-07-10 17:33:57:INFO:5 seconds: waiting for job to complete\n",
-      "2019-07-10 17:34:02:INFO:10 seconds: waiting for job to complete\n",
-      "2019-07-10 17:34:07:INFO:15 seconds: waiting for job to complete\n",
-      "2019-07-10 17:34:12:INFO:20 seconds: waiting for job to complete\n",
-      "2019-07-10 17:34:17:INFO:25 seconds: waiting for job to complete\n",
-      "2019-07-10 17:34:22:INFO:30 seconds: waiting for job to complete\n",
-      "2019-07-10 17:34:27:INFO:35 seconds: waiting for job to complete\n",
-      "2019-07-10 17:34:32:INFO:40 seconds: waiting for job to complete\n",
-      "2019-07-10 17:34:37:INFO:45 seconds: waiting for job to complete\n",
-      "2019-07-10 17:34:42:INFO:50 seconds: waiting for job to complete\n",
-      "2019-07-10 17:34:42:INFO:Kaniko job complete.\n",
-      "2019-07-10 17:34:43:INFO:Build component complete.\n"
-     ]
-    }
-   ],
+   "outputs": [],
    "source": [
     "TARGET_IMAGE = 'gcr.io/%s/tokenizer:latest' % PROJECT_NAME\n",
     "tokenize_captions_op = compiler.build_python_component(\n",
@@ -363,7 +310,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 137,
+   "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -647,38 +594,11 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 138,
+   "execution_count": null,
    "metadata": {
     "scrolled": true
    },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "2019-07-10 19:23:41:INFO:Build an image that is based on tensorflow/tensorflow:2.0.0b0-py3 and push the image to gcr.io/intro-to-kubeflow-1/trainer:latest\n",
-      "2019-07-10 19:23:41:INFO:Checking path: gs://artifacts.intro-to-kubeflow-1.appspot.com/ms-coco/components...\n",
-      "2019-07-10 19:23:41:INFO:Generate entrypoint and serialization codes.\n",
-      "2019-07-10 19:23:41:INFO:Generate build files.\n",
-      "2019-07-10 19:23:41:INFO:Start a kaniko job for build.\n",
-      "2019-07-10 19:23:41:INFO:Cannot Find local kubernetes config. Trying in-cluster config.\n",
-      "2019-07-10 19:23:41:INFO:Initialized with in-cluster config.\n",
-      "2019-07-10 19:23:46:INFO:5 seconds: waiting for job to complete\n",
-      "2019-07-10 19:23:51:INFO:10 seconds: waiting for job to complete\n",
-      "2019-07-10 19:23:56:INFO:15 seconds: waiting for job to complete\n",
-      "2019-07-10 19:24:01:INFO:20 seconds: waiting for job to complete\n",
-      "2019-07-10 19:24:06:INFO:25 seconds: waiting for job to complete\n",
-      "2019-07-10 19:24:11:INFO:30 seconds: waiting for job to complete\n",
-      "2019-07-10 19:24:16:INFO:35 seconds: waiting for job to complete\n",
-      "2019-07-10 19:24:21:INFO:40 seconds: waiting for job to complete\n",
-      "2019-07-10 19:24:26:INFO:45 seconds: waiting for job to complete\n",
-      "2019-07-10 19:24:31:INFO:50 seconds: waiting for job to complete\n",
-      "2019-07-10 19:24:36:INFO:55 seconds: waiting for job to complete\n",
-      "2019-07-10 19:24:36:INFO:Kaniko job complete.\n",
-      "2019-07-10 19:24:37:INFO:Build component complete.\n"
-     ]
-    }
-   ],
+   "outputs": [],
    "source": [
     "TARGET_IMAGE = 'gcr.io/%s/trainer:latest' % PROJECT_NAME\n",
     "model_train_op = compiler.build_python_component(\n",
@@ -698,7 +618,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 139,
+   "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -955,37 +875,11 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 140,
+   "execution_count": null,
    "metadata": {
     "scrolled": true
    },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "2019-07-10 19:24:37:INFO:Build an image that is based on tensorflow/tensorflow:2.0.0b0-py3 and push the image to gcr.io/intro-to-kubeflow-1/predict:latest\n",
-      "2019-07-10 19:24:37:INFO:Checking path: gs://artifacts.intro-to-kubeflow-1.appspot.com/ms-coco/components...\n",
-      "2019-07-10 19:24:37:INFO:Generate entrypoint and serialization codes.\n",
-      "2019-07-10 19:24:37:INFO:Generate build files.\n",
-      "2019-07-10 19:24:37:INFO:Start a kaniko job for build.\n",
-      "2019-07-10 19:24:37:INFO:Cannot Find local kubernetes config. Trying in-cluster config.\n",
-      "2019-07-10 19:24:37:INFO:Initialized with in-cluster config.\n",
-      "2019-07-10 19:24:42:INFO:5 seconds: waiting for job to complete\n",
-      "2019-07-10 19:24:47:INFO:10 seconds: waiting for job to complete\n",
-      "2019-07-10 19:24:52:INFO:15 seconds: waiting for job to complete\n",
-      "2019-07-10 19:24:57:INFO:20 seconds: waiting for job to complete\n",
-      "2019-07-10 19:25:02:INFO:25 seconds: waiting for job to complete\n",
-      "2019-07-10 19:25:07:INFO:30 seconds: waiting for job to complete\n",
-      "2019-07-10 19:25:12:INFO:35 seconds: waiting for job to complete\n",
-      "2019-07-10 19:25:17:INFO:40 seconds: waiting for job to complete\n",
-      "2019-07-10 19:25:22:INFO:45 seconds: waiting for job to complete\n",
-      "2019-07-10 19:25:27:INFO:50 seconds: waiting for job to complete\n",
-      "2019-07-10 19:25:27:INFO:Kaniko job complete.\n",
-      "2019-07-10 19:25:27:INFO:Build component complete.\n"
-     ]
-    }
-   ],
+   "outputs": [],
    "source": [
     "TARGET_IMAGE = 'gcr.io/%s/predict:latest' % PROJECT_NAME\n",
     "predict_op = compiler.build_python_component(\n",
@@ -1012,7 +906,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 144,
+   "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -1076,7 +970,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 145,
+   "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -1085,7 +979,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 146,
+   "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -1095,24 +989,11 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 147,
+   "execution_count": null,
    "metadata": {
     "scrolled": true
    },
-   "outputs": [
-    {
-     "data": {
-      "text/html": [
-       "Experiment link <a href=\"/pipeline/#/experiments/details/04501b0c-f82d-457c-97b8-a4a9fdf3ff1a\" target=\"_blank\" >here</a>"
-      ],
-      "text/plain": [
-       "<IPython.core.display.HTML object>"
-      ]
-     },
-     "metadata": {},
-     "output_type": "display_data"
-    }
-   ],
+   "outputs": [],
    "source": [
     "client = kfp.Client()\n",
     "experiment = client.create_experiment(EXPERIMENT_NAME)"
@@ -1127,22 +1008,9 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 148,
+   "execution_count": null,
    "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/html": [
-       "Run link <a href=\"/pipeline/#/runs/details/93bcf3ec-a348-11e9-ac63-42010a80017c\" target=\"_blank\" >here</a>"
-      ],
-      "text/plain": [
-       "<IPython.core.display.HTML object>"
-      ]
-     },
-     "metadata": {},
-     "output_type": "display_data"
-    }
-   ],
+   "outputs": [],
    "source": [
     "# Test run to make sure all parts of the pipeline are working properly\n",
     "arguments = {\n",
@@ -1181,7 +1049,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.7"
+   "version": "3.7.3"
   }
  },
  "nbformat": 4,

From 51e323ca31cbbd6e82b4a4c68eaf723bd406a781 Mon Sep 17 00:00:00 2001
From: Zane Durante <zanedurante@gmail.com>
Date: Mon, 15 Jul 2019 16:23:55 -0700
Subject: [PATCH 05/11] refactored notebook and added image src

---
 .../Image Captioning TF 2.0.ipynb             | 256 ++++--------------
 .../notebooks/image-captioning-gcp/README.md  |  22 ++
 .../image-captioning-gcp/src/Dockerfile       |   2 +
 .../image-captioning-gcp/src/models.py        | 102 +++++++
 4 files changed, 172 insertions(+), 210 deletions(-)
 rename samples/notebooks/{ => image-captioning-gcp}/Image Captioning TF 2.0.ipynb (76%)
 create mode 100644 samples/notebooks/image-captioning-gcp/README.md
 create mode 100644 samples/notebooks/image-captioning-gcp/src/Dockerfile
 create mode 100644 samples/notebooks/image-captioning-gcp/src/models.py

diff --git a/samples/notebooks/Image Captioning TF 2.0.ipynb b/samples/notebooks/image-captioning-gcp/Image Captioning TF 2.0.ipynb
similarity index 76%
rename from samples/notebooks/Image Captioning TF 2.0.ipynb
rename to samples/notebooks/image-captioning-gcp/Image Captioning TF 2.0.ipynb
index 98952b9234c..f13be31b3c7 100644
--- a/samples/notebooks/Image Captioning TF 2.0.ipynb	
+++ b/samples/notebooks/image-captioning-gcp/Image Captioning TF 2.0.ipynb	
@@ -11,9 +11,8 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "This notebook modifies an example Tensorflow 2.0 notebook from\n",
-    "[here](https://colab.sandbox.google.com/github/tensorflow/docs/blob/master/site/en/r2/tutorials/text/image_captioning.ipynb)\n",
-    "to work with kubeflow pipelines.  "
+    "This notebook modifies the [Image Captioning with Attention Tensorflow 2.0 notebook](https://colab.sandbox.google.com/github/tensorflow/docs/blob/master/site/en/r2/tutorials/text/image_captioning.ipynb)\n",
+    "to work with kubeflow pipelines."
    ]
   },
   {
@@ -57,17 +56,7 @@
     "EXPERIMENT_NAME = 'Image Captioning'\n",
     "PROJECT_NAME = '[YOUR-PROJECT-NAME]' \n",
     "PIPELINE_STORAGE_PATH = 'gs://[YOUR-BUCKET-NAME]/ms-coco/components' # path to save pipeline component images\n",
-    "BASE_IMAGE = 'tensorflow/tensorflow:2.0.0b0-py3' # using tensorflow 2.0.0"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Used to save tensorboard files to a different directory each run\n",
-    "RUN_NUMBER = 0 "
+    "BASE_IMAGE = 'gcr.io/intro-to-kubeflow-1/img-cap:latest' # using image created in README instructions"
    ]
   },
   {
@@ -154,11 +143,11 @@
     "        img = tf.keras.applications.inception_v3.preprocess_input(img)\n",
     "        return img, image_path\n",
     "    \n",
+    "    # Create model for processing images \n",
     "    image_model = tf.keras.applications.InceptionV3(include_top=False,\n",
     "                                                weights='imagenet')\n",
     "    new_input = image_model.input\n",
     "    hidden_layer = image_model.layers[-1].output\n",
-    "\n",
     "    image_features_extract_model = tf.keras.Model(new_input, hidden_layer)\n",
     "    \n",
     "    # Save extracted features in GCS\n",
@@ -243,7 +232,6 @@
     "    from io import BytesIO\n",
     "    from ast import literal_eval as make_tuple\n",
     "    \n",
-    "    \n",
     "    # Convert output from string to tuple and unpack\n",
     "    preprocess_output = make_tuple(preprocess_output)\n",
     "    train_caption_path = preprocess_output[0]\n",
@@ -281,7 +269,6 @@
     "    cap_vector_file_path = OUTPUT_DIR + 'cap_vector.npy'\n",
     "    np.save(file_io.FileIO(cap_vector_file_path, 'w'), cap_vector)\n",
     "    \n",
-    "    \n",
     "    return str(max_length), tokenizer_file_path, cap_vector_file_path"
    ]
   },
@@ -321,13 +308,15 @@
     ")\n",
     "def train_model(dataset_path: str, preprocess_output: str, \n",
     "        tokenizing_output: str, train_output_dir: str, valid_output_dir: str, \n",
-    "        batch_size: int, embedding_dim: int, units: int, EPOCHS: int, run_number : int)-> str:\n",
+    "        batch_size: int, embedding_dim: int, units: int, EPOCHS: int)-> str:\n",
     "    import json\n",
     "    import time\n",
     "    import pickle\n",
+    "    import models\n",
     "    import numpy as np\n",
     "    import tensorflow as tf\n",
     "    from io import BytesIO\n",
+    "    from datetime import datetime\n",
     "    from sklearn.model_selection import train_test_split\n",
     "    from tensorflow.python.lib.io import file_io\n",
     "    from ast import literal_eval as make_tuple\n",
@@ -338,7 +327,6 @@
     "    \n",
     "    # Unpack tuples\n",
     "    preprocessed_imgs_path = preprocess_output[1]\n",
-    "    \n",
     "    tokenizer_path = tokenizing_output[1]\n",
     "    cap_vector_file_path = tokenizing_output[2]\n",
     "    \n",
@@ -392,94 +380,9 @@
     "    dataset = dataset.shuffle(BUFFER_SIZE).batch(batch_size)\n",
     "    dataset = dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)\n",
     "    \n",
-    "    # Create models\n",
-    "    # Attention model\n",
-    "    class BahdanauAttention(tf.keras.Model):\n",
-    "        def __init__(self, units):\n",
-    "            super(BahdanauAttention, self).__init__()\n",
-    "            self.W1 = tf.keras.layers.Dense(units)\n",
-    "            self.W2 = tf.keras.layers.Dense(units)\n",
-    "            self.V = tf.keras.layers.Dense(1)\n",
-    "    \n",
-    "        def call(self, features, hidden):\n",
-    "            # features(CNN_encoder output) shape == (batch_size, 64, embedding_dim)\n",
-    "\n",
-    "            # hidden shape == (batch_size, hidden_size)\n",
-    "            # hidden_with_time_axis shape == (batch_size, 1, hidden_size)\n",
-    "            hidden_with_time_axis = tf.expand_dims(hidden, 1)\n",
-    "\n",
-    "            # score shape == (batch_size, 64, hidden_size)\n",
-    "            score = tf.nn.tanh(self.W1(features) + self.W2(hidden_with_time_axis))\n",
-    "\n",
-    "            # attention_weights shape == (batch_size, 64, 1)\n",
-    "            # you get 1 at the last axis because you are applying score to self.V\n",
-    "            attention_weights = tf.nn.softmax(self.V(score), axis=1)\n",
-    "\n",
-    "            # context_vector shape after sum == (batch_size, hidden_size)\n",
-    "            context_vector = attention_weights * features\n",
-    "            context_vector = tf.reduce_sum(context_vector, axis=1)\n",
-    "\n",
-    "            return context_vector, attention_weights\n",
-    "    \n",
-    "    # CNN Encoder model\n",
-    "    class CNN_Encoder(tf.keras.Model):\n",
-    "        # Since you have already extracted the features and dumped it using pickle\n",
-    "        # This encoder passes those features through a Fully connected layer\n",
-    "        def __init__(self, embedding_dim):\n",
-    "            super(CNN_Encoder, self).__init__()\n",
-    "            # shape after fc == (batch_size, 64, embedding_dim)\n",
-    "            self.fc = tf.keras.layers.Dense(embedding_dim)\n",
-    "\n",
-    "        def call(self, x):\n",
-    "            x = self.fc(x)\n",
-    "            x = tf.nn.relu(x)\n",
-    "            return x\n",
-    "    \n",
-    "    # RNN Decoder model\n",
-    "    class RNN_Decoder(tf.keras.Model):\n",
-    "        def __init__(self, embedding_dim, units, vocab_size):\n",
-    "            super(RNN_Decoder, self).__init__()\n",
-    "            self.units = units\n",
-    "\n",
-    "            self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)\n",
-    "            self.gru = tf.keras.layers.GRU(self.units,\n",
-    "                                           return_sequences=True,\n",
-    "                                           return_state=True,\n",
-    "                                           recurrent_initializer='glorot_uniform')\n",
-    "            self.fc1 = tf.keras.layers.Dense(self.units)\n",
-    "            self.fc2 = tf.keras.layers.Dense(vocab_size)\n",
-    "\n",
-    "            self.attention = BahdanauAttention(self.units)\n",
-    "\n",
-    "        def call(self, x, features, hidden):\n",
-    "            # defining attention as a separate model\n",
-    "            context_vector, attention_weights = self.attention(features, hidden)\n",
-    "\n",
-    "            # x shape after passing through embedding == (batch_size, 1, embedding_dim)\n",
-    "            x = self.embedding(x)\n",
-    "\n",
-    "            # x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)\n",
-    "            x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)\n",
-    "\n",
-    "            # passing the concatenated vector to the GRU\n",
-    "            output, state = self.gru(x)\n",
-    "\n",
-    "            # shape == (batch_size, max_length, hidden_size)\n",
-    "            x = self.fc1(output)\n",
-    "\n",
-    "            # x shape == (batch_size * max_length, hidden_size)\n",
-    "            x = tf.reshape(x, (-1, x.shape[2]))\n",
-    "\n",
-    "            # output shape == (batch_size * max_length, vocab)\n",
-    "            x = self.fc2(x)\n",
-    "\n",
-    "            return x, state, attention_weights\n",
-    "\n",
-    "        def reset_state(self, batch_size):\n",
-    "            return tf.zeros((batch_size, self.units))\n",
-    "        \n",
-    "    encoder = CNN_Encoder(embedding_dim)\n",
-    "    decoder = RNN_Decoder(embedding_dim, units, vocab_size)\n",
+    "    # get models from models.py\n",
+    "    encoder = models.CNN_Encoder(embedding_dim)\n",
+    "    decoder = models.RNN_Decoder(embedding_dim, units, vocab_size)\n",
     "    \n",
     "    optimizer = tf.keras.optimizers.Adam()\n",
     "    loss_object = tf.keras.losses.SparseCategoricalCrossentropy(\n",
@@ -539,7 +442,7 @@
     "        return loss, total_loss\n",
     "    \n",
     "    # Create summary writers and loss for plotting loss in tensorboard\n",
-    "    tensorboard_dir = train_output_dir + 'logs' + str(run_number) + '/'\n",
+    "    tensorboard_dir = train_output_dir + 'logs/' + datetime.now().strftime(\"%Y%m%d-%H%M%S\")\n",
     "    train_summary_writer = tf.summary.create_file_writer(tensorboard_dir)\n",
     "    train_loss = tf.keras.metrics.Mean('train_loss', dtype=tf.float32)\n",
     "    \n",
@@ -629,14 +532,14 @@
     ")\n",
     "def predict(dataset_path: str, tokenizing_output: str, \n",
     "        model_train_output: str, preprocess_output_dir: str, \n",
-    "        valid_output_dir: str, embedding_dim: int, units: int,\n",
-    "        run_number: int):\n",
+    "        valid_output_dir: str, embedding_dim: int, units: int):\n",
     "    import pickle\n",
     "    import json\n",
+    "    import models\n",
     "    import matplotlib.pyplot as plt\n",
     "    import numpy as np\n",
     "    import tensorflow as tf\n",
-    "    from math import ceil \n",
+    "    from datetime import datetime\n",
     "    from io import BytesIO\n",
     "    from tensorflow.python.lib.io import file_io\n",
     "    from ast import literal_eval as make_tuple\n",
@@ -657,95 +560,10 @@
     "    if valid_output_dir == 'default':\n",
     "        valid_output_dir = dataset_path + '/valid/'\n",
     "        \n",
-    "    tensorboard_dir = valid_output_dir + 'logs' + str(run_number) + '/'\n",
+    "    tensorboard_dir = valid_output_dir + 'logs' + datetime.now().strftime(\"%Y%m%d-%H%M%S\")\n",
     "    summary_writer = tf.summary.create_file_writer(tensorboard_dir)\n",
     "\n",
     "    # Load tokenizer, model, test_captions, and test_imgs\n",
-    "    \"\"\" CHANGE: don't reuse code here: not sure how though..? \"\"\"\n",
-    "    # Attention model\n",
-    "    class BahdanauAttention(tf.keras.Model):\n",
-    "        def __init__(self, units):\n",
-    "            super(BahdanauAttention, self).__init__()\n",
-    "            self.W1 = tf.keras.layers.Dense(units)\n",
-    "            self.W2 = tf.keras.layers.Dense(units)\n",
-    "            self.V = tf.keras.layers.Dense(1)\n",
-    "    \n",
-    "        def call(self, features, hidden):\n",
-    "            # features(CNN_encoder output) shape == (batch_size, 64, embedding_dim)\n",
-    "\n",
-    "            # hidden shape == (batch_size, hidden_size)\n",
-    "            # hidden_with_time_axis shape == (batch_size, 1, hidden_size)\n",
-    "            hidden_with_time_axis = tf.expand_dims(hidden, 1)\n",
-    "\n",
-    "            # score shape == (batch_size, 64, hidden_size)\n",
-    "            score = tf.nn.tanh(self.W1(features) + self.W2(hidden_with_time_axis))\n",
-    "\n",
-    "            # attention_weights shape == (batch_size, 64, 1)\n",
-    "            # you get 1 at the last axis because you are applying score to self.V\n",
-    "            attention_weights = tf.nn.softmax(self.V(score), axis=1)\n",
-    "\n",
-    "            # context_vector shape after sum == (batch_size, hidden_size)\n",
-    "            context_vector = attention_weights * features\n",
-    "            context_vector = tf.reduce_sum(context_vector, axis=1)\n",
-    "\n",
-    "            return context_vector, attention_weights\n",
-    "    \n",
-    "    # CNN Encoder model\n",
-    "    class CNN_Encoder(tf.keras.Model):\n",
-    "        # Since you have already extracted the features and dumped it using pickle\n",
-    "        # This encoder passes those features through a Fully connected layer\n",
-    "        def __init__(self, embedding_dim):\n",
-    "            super(CNN_Encoder, self).__init__()\n",
-    "            # shape after fc == (batch_size, 64, embedding_dim)\n",
-    "            self.fc = tf.keras.layers.Dense(embedding_dim)\n",
-    "\n",
-    "        def call(self, x):\n",
-    "            x = self.fc(x)\n",
-    "            x = tf.nn.relu(x)\n",
-    "            return x\n",
-    "    \n",
-    "    # RNN Decoder model\n",
-    "    class RNN_Decoder(tf.keras.Model):\n",
-    "        def __init__(self, embedding_dim, units, vocab_size):\n",
-    "            super(RNN_Decoder, self).__init__()\n",
-    "            self.units = units\n",
-    "\n",
-    "            self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)\n",
-    "            self.gru = tf.keras.layers.GRU(self.units,\n",
-    "                                           return_sequences=True,\n",
-    "                                           return_state=True,\n",
-    "                                           recurrent_initializer='glorot_uniform')\n",
-    "            self.fc1 = tf.keras.layers.Dense(self.units)\n",
-    "            self.fc2 = tf.keras.layers.Dense(vocab_size)\n",
-    "\n",
-    "            self.attention = BahdanauAttention(self.units)\n",
-    "\n",
-    "        def call(self, x, features, hidden):\n",
-    "            # defining attention as a separate model\n",
-    "            context_vector, attention_weights = self.attention(features, hidden)\n",
-    "\n",
-    "            # x shape after passing through embedding == (batch_size, 1, embedding_dim)\n",
-    "            x = self.embedding(x)\n",
-    "\n",
-    "            # x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)\n",
-    "            x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)\n",
-    "\n",
-    "            # passing the concatenated vector to the GRU\n",
-    "            output, state = self.gru(x)\n",
-    "\n",
-    "            # shape == (batch_size, max_length, hidden_size)\n",
-    "            x = self.fc1(output)\n",
-    "\n",
-    "            # x shape == (batch_size * max_length, hidden_size)\n",
-    "            x = tf.reshape(x, (-1, x.shape[2]))\n",
-    "\n",
-    "            # output shape == (batch_size * max_length, vocab)\n",
-    "            x = self.fc2(x)\n",
-    "\n",
-    "            return x, state, attention_weights\n",
-    "\n",
-    "        def reset_state(self, batch_size):\n",
-    "            return tf.zeros((batch_size, self.units))\n",
     "    \n",
     "    # Load tokenizer\n",
     "    with file_io.FileIO(tokenizer_path, 'rb') as src:\n",
@@ -757,8 +575,8 @@
     "    attention_features_shape = 64\n",
     "    features_shape = 2048\n",
     "    \n",
-    "    encoder = CNN_Encoder(embedding_dim)\n",
-    "    decoder = RNN_Decoder(embedding_dim, units, vocab_size)\n",
+    "    encoder = models.CNN_Encoder(embedding_dim)\n",
+    "    decoder = models.RNN_Decoder(embedding_dim, units, vocab_size)\n",
     "    \n",
     "    # Load model from checkpoint (encoder, decoder)\n",
     "    optimizer = tf.keras.optimizers.Adam()\n",
@@ -927,7 +745,6 @@
     "tokenizing_output_dir='default',\n",
     "training_output_dir='default',\n",
     "validation_output_dir='default',\n",
-    "run_number=0,\n",
     "): \n",
     "    \n",
     "    preprocessing_img_task = preprocessing_img_op(\n",
@@ -941,7 +758,8 @@
     "        dataset_path, \n",
     "        preprocessing_img_task.output, \n",
     "        output_dir=tokenizing_output_dir, \n",
-    "        top_k=vocab_size).apply(use_gcp_secret('user-gcp-sa'))\n",
+    "        top_k=vocab_size).apply(\n",
+    "        use_gcp_secret('user-gcp-sa'))\n",
     "    \n",
     "    model_train_task = model_train_op(\n",
     "        dataset_path, \n",
@@ -952,8 +770,7 @@
     "        batch_size=training_batch_size, \n",
     "        embedding_dim=embedding_dim, \n",
     "        units=hidden_state_size, \n",
-    "        epochs=epochs,\n",
-    "        run_number=run_number).apply(\n",
+    "        epochs=epochs).apply(\n",
     "        use_gcp_secret('user-gcp-sa'))\n",
     "    \n",
     "    predict_task = predict_op(\n",
@@ -963,8 +780,7 @@
     "        preprocess_output_dir=preprocessing_output_dir,\n",
     "        valid_output_dir=validation_output_dir,\n",
     "        embedding_dim=embedding_dim,\n",
-    "        units=hidden_state_size,\n",
-    "        run_number=run_number).apply(\n",
+    "        units=hidden_state_size).apply(\n",
     "        use_gcp_secret('user-gcp-sa'))"
    ]
   },
@@ -1017,20 +833,40 @@
     "    'dataset_path': GCS_DATASET_PATH, \n",
     "    'num_examples': 100, # Small test to make sure pipeline functions properly\n",
     "    'training_batch_size': 16, # has to be smaller since only training on 80 examples \n",
-    "    'run_number': RUN_NUMBER,\n",
     "}\n",
-    "run_name = pipeline_func.__name__ + ' run' + str(RUN_NUMBER)\n",
+    "run_name = pipeline_func.__name__ + ' run'\n",
     "run_result = client.run_pipeline(experiment.id, run_name, pipeline_filename,\n",
-    "                                params=arguments)\n",
-    "RUN_NUMBER += 1"
+    "                                params=arguments)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Model checkpoints are saved at training_output_dir, which is `GCS_DATASET_PATH/train/checkpoints/` by default."
+    "Model checkpoints are saved at training_output_dir, which is `[GCS_DATASET_PATH]/train/checkpoints/` by default."
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<h3> Steps taken to convert the original TF 2.0 notebook </h3>\n",
+    "\n",
+    "\n",
+    "1. Componentize notebook to run in different steps, and not linearly.\n",
+    "2. Store the dataset in GCS to make it easily accessible in Kubeflow.\n",
+    "3. Use `file_io.FileIO()` instead of `open()` when loading files from GCS.\n",
+    "4. To pass multiple outputs downstream, pass them as a tuple of strings. Kubeflow converts this tuple to a string when you pass it downstream. So, you have to convert it from a string back to a tuple in the downstream component to get the multiple outputs.\n",
+    "5. To pass many numpy arrays to downstream components, first save them on GCS.  Put the paths to the saved numpy files in a new array, and then save that array on GCS as well.  Pass the path to this array to the downstream components.\n",
+    "6. Use `tf.io.read_file` and then `tf.image.decode_jpeg` instead of `PIL.Image` to be compatible with GCS"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
   }
  ],
  "metadata": {
diff --git a/samples/notebooks/image-captioning-gcp/README.md b/samples/notebooks/image-captioning-gcp/README.md
new file mode 100644
index 00000000000..3f41796eda4
--- /dev/null
+++ b/samples/notebooks/image-captioning-gcp/README.md
@@ -0,0 +1,22 @@
+# Image Captioning TF 2.0
+
+## About
+This notebook is an example of how to convert an existing Tensorflow notebook into a Kubeflow pipeline using jupyter notebook.  Specifically, this notebook takes an example tensorflow notebook, [image captioning with attention](https://colab.sandbox.google.com/github/tensorflow/docs/blob/master/site/en/r2/tutorials/text/image_captioning.ipynb), and creates a kubeflow pipeline.  
+
+## Setup
+
+### Setup notebook server
+This pipeline requires you to [setup a notebook server](https://www.kubeflow.org/docs/notebooks/setup/) in the Kubeflow UI.  After you are setup, upload this notebook and then run it in the notebook server.
+
+### Create a GCS bucket
+This pipeline requires a GCS bucket.  If you haven't already, [create a GCS bucket](https://cloud.google.com/storage/docs/creating-buckets) to run the notebook.  Make sure to create the storage bucket in the same project that you are running Kubeflow on to have the proper permissions by default.  You can also create a GCS bucket by running `gsutil mb -p <project_name> gs://<bucket_name>`.
+
+### Upload the notebook in the Kubeflow UI
+In order to run this pipeline, make sure to upload the notebook to your notebook server in the Kubeflow UI.  You can clone this repo in the Jupyter notebook server by connecting to the notebook server and then selecting New > Terminal.  In the terminal type `git clone https://github.com/kubeflow/pipelines.git`.
+
+## Outputs
+Below are some screenshots of the final pipeline and the model outputs.
+
+![pipeline-screenshot](https://user-images.githubusercontent.com/17008638/61160416-41694f80-a4b4-11e9-9317-5a92f625c173.png)
+
+![attention-screenshot](https://user-images.githubusercontent.com/17008638/61160441-59d96a00-a4b4-11e9-809b-f3df7cbe0dae.PNG)
\ No newline at end of file
diff --git a/samples/notebooks/image-captioning-gcp/src/Dockerfile b/samples/notebooks/image-captioning-gcp/src/Dockerfile
new file mode 100644
index 00000000000..58ae295df92
--- /dev/null
+++ b/samples/notebooks/image-captioning-gcp/src/Dockerfile
@@ -0,0 +1,2 @@
+FROM tensorflow/tensorflow:2.0.0b0-py3
+ADD models.py /ml/
diff --git a/samples/notebooks/image-captioning-gcp/src/models.py b/samples/notebooks/image-captioning-gcp/src/models.py
new file mode 100644
index 00000000000..73109d5e6cb
--- /dev/null
+++ b/samples/notebooks/image-captioning-gcp/src/models.py
@@ -0,0 +1,102 @@
+# Copyright 2019 Google Inc. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""This file contains the models used in the image captioning pipeline"""
+
+import tensorflow as tf
+
+
+class BahdanauAttention(tf.keras.Model):
+    def __init__(self, units):
+        super(BahdanauAttention, self).__init__()
+        self.W1 = tf.keras.layers.Dense(units)
+        self.W2 = tf.keras.layers.Dense(units)
+        self.V = tf.keras.layers.Dense(1)
+
+    def call(self, features, hidden):
+        # features(CNN_encoder output) shape == (batch_size, 64, embedding_dim)
+
+        # hidden shape == (batch_size, hidden_size)
+        # hidden_with_time_axis shape == (batch_size, 1, hidden_size)
+        hidden_with_time_axis = tf.expand_dims(hidden, 1)
+
+        # score shape == (batch_size, 64, hidden_size)
+        score = tf.nn.tanh(self.W1(features) + self.W2(hidden_with_time_axis))
+
+        # attention_weights shape == (batch_size, 64, 1)
+        # you get 1 at the last axis because you are applying score to self.V
+        attention_weights = tf.nn.softmax(self.V(score), axis=1)
+
+        # context_vector shape after sum == (batch_size, hidden_size)
+        context_vector = attention_weights * features
+        context_vector = tf.reduce_sum(context_vector, axis=1)
+
+        return context_vector, attention_weights
+
+# CNN Encoder model
+class CNN_Encoder(tf.keras.Model):
+    # Since you have already extracted the features and dumped it using pickle
+    # This encoder passes those features through a Fully connected layer
+    def __init__(self, embedding_dim):
+        super(CNN_Encoder, self).__init__()
+        # shape after fc == (batch_size, 64, embedding_dim)
+        self.fc = tf.keras.layers.Dense(embedding_dim)
+
+    def call(self, x):
+        x = self.fc(x)
+        x = tf.nn.relu(x)
+        return x
+
+# RNN Decoder model
+class RNN_Decoder(tf.keras.Model):
+    def __init__(self, embedding_dim, units, vocab_size):
+        super(RNN_Decoder, self).__init__()
+        self.units = units
+
+        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
+        self.gru = tf.keras.layers.GRU(self.units,
+                                       return_sequences=True,
+                                       return_state=True,
+                                       recurrent_initializer='glorot_uniform')
+        self.fc1 = tf.keras.layers.Dense(self.units)
+        self.fc2 = tf.keras.layers.Dense(vocab_size)
+
+        self.attention = BahdanauAttention(self.units)
+
+    def call(self, x, features, hidden):
+        # defining attention as a separate model
+        context_vector, attention_weights = self.attention(features, hidden)
+
+        # x shape after passing through embedding == (batch_size, 1, embedding_dim)
+        x = self.embedding(x)
+
+        # x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
+        x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)
+
+        # passing the concatenated vector to the GRU
+        output, state = self.gru(x)
+
+        # shape == (batch_size, max_length, hidden_size)
+        x = self.fc1(output)
+
+        # x shape == (batch_size * max_length, hidden_size)
+        x = tf.reshape(x, (-1, x.shape[2]))
+
+        # output shape == (batch_size * max_length, vocab)
+        x = self.fc2(x)
+
+        return x, state, attention_weights
+
+    def reset_state(self, batch_size):
+        return tf.zeros((batch_size, self.units))
\ No newline at end of file

From 65a3275ec71bda9ab0c04b240e3f9a5c80c759db Mon Sep 17 00:00:00 2001
From: Zane Durante <zanedurante@gmail.com>
Date: Thu, 18 Jul 2019 12:13:19 -0700
Subject: [PATCH 06/11] Updated changes except base image location

---
 .../Image Captioning TF 2.0.ipynb             | 232 ++++++++++++------
 .../notebooks/image-captioning-gcp/README.md  |  24 +-
 .../image-captioning-gcp/src/models.py        |   3 +-
 3 files changed, 174 insertions(+), 85 deletions(-)

diff --git a/samples/notebooks/image-captioning-gcp/Image Captioning TF 2.0.ipynb b/samples/notebooks/image-captioning-gcp/Image Captioning TF 2.0.ipynb
index f13be31b3c7..0cb45c59ab6 100644
--- a/samples/notebooks/image-captioning-gcp/Image Captioning TF 2.0.ipynb	
+++ b/samples/notebooks/image-captioning-gcp/Image Captioning TF 2.0.ipynb	
@@ -4,7 +4,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "<h1> Image Captioning with Attention in Tensorflow 2.0 </h1>"
+    "# Image Captioning with Attention in Tensorflow 2.0"
    ]
   },
   {
@@ -12,28 +12,55 @@
    "metadata": {},
    "source": [
     "This notebook modifies the [Image Captioning with Attention Tensorflow 2.0 notebook](https://colab.sandbox.google.com/github/tensorflow/docs/blob/master/site/en/r2/tutorials/text/image_captioning.ipynb)\n",
-    "to work with kubeflow pipelines."
+    "to work with kubeflow pipelines.  This pipeline creates a model that can caption an image."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "<h3> Download dataset and upload to GCS </h3>"
+    "#### Install Kubeflow pipelines\n",
+    "Install the `kfp` package if you haven't already."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip3 install kfp --upgrade"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Activate service account credentials\n",
+    "This allows for using `gsutil` from the notebook to upload the dataset to GCS."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!gcloud auth activate-service-account --key-file=${GOOGLE_APPLICATION_CREDENTIALS}"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "First, we have to download the [MS COCO dataset](http://cocodataset.org/#download).  This sample uses both the 2014 train images and 2014 train/val annotations.  If you downloaded and extracted the dataset on your local system, you can upload it to GCS using `gsutil -m cp -r path/to/dataset/ gs://[YOUR-BUCKET-NAME]/ms-coco`."
+    "### Download dataset and upload to GCS "
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "<h3> Setup project info and imports </h3>"
+    "First, we have to download the [MS COCO dataset](http://cocodataset.org/#download).  This sample uses both the 2014 train images and 2014 train/val annotations.  The following cells download a small subset (<1000 imgs) of the dataset and the annotations to the GCS bucket specified below with `GCS_DATASET_PATH`."
    ]
   },
   {
@@ -42,10 +69,58 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Previously downloaded dataset and put onto GCS\n",
+    "# Location to download dataset and put onto GCS (should be associated\n",
+    "# with Kubeflow project)\n",
     "GCS_DATASET_PATH = 'gs://[YOUR-BUCKET-NAME]/ms-coco'"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Download images\n",
+    "Downloads images to `${GCS_DATASET_PATH}/train2014/train2014`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Download images (we use -x to ignore ~99% of images)\n",
+    "!gsutil -m rsync -x \".*0\\.jpg|.*1\\.jpg|.*2\\.jpg|.*3\\.jpg|.*4\\.jpg|.*5\\.jpg|.*6\\.jpg|.*7\\.jpg|.*8\\.jpg|.*09\\.jpg|.*19\\.jpg|.*29\\.jpg|.*39\\.jpg|.*49\\.jpg|.*59\\.jpg|.*69\\.jpg|.*79\\.jpg|.*89\\.jpg\" gs://images.cocodataset.org/train2014 {GCS_DATASET_PATH}/train2014/train2014"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Download annotations\n",
+    "For some reason MS COCO blocks using `gsutil` with the annotations (GitHub issue [here](https://github.com/cocodataset/cocoapi/issues/216)).  We can work around this by downloading it, and then uploading it to GCS."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Download to local, upload to GCS, then delete local download\n",
+    "!wget http://images.cocodataset.org/annotations/annotations_trainval2014.zip\n",
+    "!unzip annotations_trainval2014.zip -d annotations_trainval2014\n",
+    "!gsutil -m cp -r annotations_trainval2014 {GCS_DATASET_PATH}\n",
+    "!rm -r annotations_trainval2014\n",
+    "!rm annotations_trainval2014.zip"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Setup project info and imports"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -56,7 +131,13 @@
     "EXPERIMENT_NAME = 'Image Captioning'\n",
     "PROJECT_NAME = '[YOUR-PROJECT-NAME]' \n",
     "PIPELINE_STORAGE_PATH = 'gs://[YOUR-BUCKET-NAME]/ms-coco/components' # path to save pipeline component images\n",
-    "BASE_IMAGE = 'gcr.io/intro-to-kubeflow-1/img-cap:latest' # using image created in README instructions"
+    "BASE_IMAGE = 'gcr.io/intro-to-kubeflow-1/img-cap:latest' # using image created in README instructions\n",
+    "\n",
+    "# Target images for creating components\n",
+    "PREPROCESS_IMG = 'gcr.io/%s/ms-coco/preprocess:latest' % PROJECT_NAME\n",
+    "TOKENIZE_IMG = 'gcr.io/%s/ms-coco/tokenize:latest' % PROJECT_NAME\n",
+    "TRAIN_IMG = 'gcr.io/%s/ms-coco/train:latest' % PROJECT_NAME\n",
+    "PREDICT_IMG = 'gcr.io/%s/ms-coco/predict:latest' % PROJECT_NAME"
    ]
   },
   {
@@ -75,14 +156,15 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "<h3> Create pipeline components </h3>"
+    "### Create pipeline components"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "<h4> Data preprocessing component </h4>"
+    "#### Data preprocessing component\n",
+    "This component takes `num_examples` images from `dataset_path` and feeds them through the deep CNN inceptionV3 (without the head).  The model outputs a tensor of shape `(64 x 2048)` that represents (2048) features obtained after dividing the image into an 8x8 (64) grid. The resulting model outputs are stored in `OUTPUT_DIR`."
    ]
   },
   {
@@ -109,6 +191,7 @@
     "    \n",
     "    annotation_file = dataset_path + '/annotations_trainval2014/annotations/captions_train2014.json'\n",
     "    PATH = dataset_path + '/train2014/train2014/'\n",
+    "    files_downloaded = tf.io.gfile.listdir(PATH)\n",
     "    \n",
     "    # Read the json file (CHANGE open() TO file_io.FileIO to use GCS)\n",
     "    with file_io.FileIO(annotation_file, 'r') as f:\n",
@@ -117,14 +200,17 @@
     "    # Store captions and image names in vectors\n",
     "    all_captions = []\n",
     "    all_img_name_vector = []\n",
-    "\n",
+    "    \n",
+    "    print('Determining which images are in storage...')\n",
     "    for annot in annotations['annotations']:\n",
     "        caption = '<start> ' + annot['caption'] + ' <end>'\n",
     "        image_id = annot['image_id']\n",
-    "        full_coco_image_path = PATH + 'COCO_train2014_' + '%012d.jpg' % (image_id)\n",
-    "\n",
-    "        all_img_name_vector.append(full_coco_image_path)\n",
-    "        all_captions.append(caption)\n",
+    "        img_name = 'COCO_train2014_' + '%012d.jpg' % (image_id)\n",
+    "        full_coco_image_path = PATH + img_name\n",
+    "        \n",
+    "        if img_name in files_downloaded: # Only have subset\n",
+    "            all_img_name_vector.append(full_coco_image_path)\n",
+    "            all_captions.append(caption)\n",
     "\n",
     "    # Shuffle captions and image_names together\n",
     "    train_captions, img_name_vector = shuffle(all_captions,\n",
@@ -135,6 +221,8 @@
     "    train_captions = train_captions[:num_examples]\n",
     "    img_name_vector = img_name_vector[:num_examples]\n",
     "    \n",
+    "\n",
+    "    \n",
     "    # Preprocess the images before feeding into inceptionV3\n",
     "    def load_image(image_path):\n",
     "        img = tf.io.read_file(image_path)\n",
@@ -151,6 +239,8 @@
     "    image_features_extract_model = tf.keras.Model(new_input, hidden_layer)\n",
     "    \n",
     "    # Save extracted features in GCS\n",
+    "    print('Extracting features from images...')\n",
+    "    \n",
     "    # Get unique images\n",
     "    encode_train = sorted(set(img_name_vector))\n",
     "    \n",
@@ -196,20 +286,26 @@
    },
    "outputs": [],
    "source": [
-    "TARGET_IMAGE = 'gcr.io/%s/preprocessing:latest' % PROJECT_NAME\n",
     "preprocessing_img_op = compiler.build_python_component(\n",
     "    component_func=preprocess,\n",
     "    staging_gcs_path=PIPELINE_STORAGE_PATH,\n",
     "    base_image=BASE_IMAGE,\n",
     "    dependency=[kfp.compiler.VersionedDependency(name='scikit-learn', version='0.21.2')],\n",
-    "    target_image=TARGET_IMAGE)"
+    "    target_image=PREPROCESS_IMG)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Tokenizing component"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "<h4> Tokenizing component </h4>"
+    "This component takes the training captions from the previous step and tokenizes them to convert them into numerical values so that they can be fed into the model as input.  It outputs the tokenized captions in `OUTPUT_DIR`."
    ]
   },
   {
@@ -280,19 +376,25 @@
    },
    "outputs": [],
    "source": [
-    "TARGET_IMAGE = 'gcr.io/%s/tokenizer:latest' % PROJECT_NAME\n",
     "tokenize_captions_op = compiler.build_python_component(\n",
     "    component_func=tokenize_captions,\n",
     "    staging_gcs_path=PIPELINE_STORAGE_PATH,\n",
     "    base_image=BASE_IMAGE,\n",
-    "    target_image=TARGET_IMAGE)"
+    "    target_image=TOKENIZE_IMG)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Component for training model (and saving it)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "<h4> Component for training model (and saving it)</h4>"
+    "This component trains the model by creating a `tf.data.Dataset` from the captions and preprocessed images.  The trained model is saved in `train_output_dir/checkpoints/`.  The training loss is plotted in tensorboard. There are various parameters of the model(s) that can be tuned, but we use the default values from the original notebook.  "
    ]
   },
   {
@@ -503,13 +605,12 @@
    },
    "outputs": [],
    "source": [
-    "TARGET_IMAGE = 'gcr.io/%s/trainer:latest' % PROJECT_NAME\n",
     "model_train_op = compiler.build_python_component(\n",
     "    component_func=train_model,\n",
     "    staging_gcs_path=PIPELINE_STORAGE_PATH,\n",
     "    base_image=BASE_IMAGE,\n",
     "    dependency=[kfp.compiler.VersionedDependency(name='scikit-learn', version='0.21.2')],\n",
-    "    target_image=TARGET_IMAGE)"
+    "    target_image=TRAIN_IMG)"
    ]
   },
   {
@@ -519,6 +620,13 @@
     "<h4> Component for model prediction </h4>"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This component uses the model to predict on a new image.  It prints the predicted caption in the logs and outputs the attention images in tensorboard."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -653,7 +761,7 @@
     "        temp_image = np.array(img.numpy())\n",
     "        \n",
     "        len_result = len(result)\n",
-    "        for l in range(len_result):\n",
+    "        for l in range(min(len_result, 10)): # Tensorboard only supports 10 imgs\n",
     "            temp_att = np.resize(attention_plot[l], (8, 8))\n",
     "            plt.title(result[l])\n",
     "            img = plt.imshow(temp_image)\n",
@@ -699,27 +807,20 @@
    },
    "outputs": [],
    "source": [
-    "TARGET_IMAGE = 'gcr.io/%s/predict:latest' % PROJECT_NAME\n",
     "predict_op = compiler.build_python_component(\n",
     "    component_func=predict,\n",
     "    staging_gcs_path=PIPELINE_STORAGE_PATH,\n",
     "    base_image=BASE_IMAGE,\n",
     "    dependency=[kfp.compiler.VersionedDependency(name='matplotlib', version='3.1.0')],\n",
-    "    target_image=TARGET_IMAGE)"
+    "    target_image=PREDICT_IMG)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "<h3> Create and run pipeline </h3>"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "<h4> Create pipeline </h4>"
+    "### Create and run pipeline\n",
+    "#### Create pipeline"
    ]
   },
   {
@@ -733,19 +834,19 @@
     "    description='A pipeline that trains a model to caption images'\n",
     ")\n",
     "def caption_pipeline(\n",
-    "dataset_path=GCS_DATASET_PATH,\n",
-    "num_examples=30000,\n",
-    "epochs=20,\n",
-    "training_batch_size=64,\n",
-    "hidden_state_size=512,\n",
-    "vocab_size=5000,\n",
-    "embedding_dim=256,\n",
-    "preprocessing_batch_size=16,\n",
-    "preprocessing_output_dir='default',\n",
-    "tokenizing_output_dir='default',\n",
-    "training_output_dir='default',\n",
-    "validation_output_dir='default',\n",
-    "): \n",
+    "    dataset_path=GCS_DATASET_PATH,\n",
+    "    num_examples=30000,\n",
+    "    epochs=20,\n",
+    "    training_batch_size=64,\n",
+    "    hidden_state_size=512,\n",
+    "    vocab_size=5000,\n",
+    "    embedding_dim=256,\n",
+    "    preprocessing_batch_size=16,\n",
+    "    preprocessing_output_dir='default',\n",
+    "    tokenizing_output_dir='default',\n",
+    "    training_output_dir='default',\n",
+    "    validation_output_dir='default',\n",
+    "    ): \n",
     "    \n",
     "    preprocessing_img_task = preprocessing_img_op(\n",
     "        dataset_path, \n",
@@ -790,16 +891,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "pipeline_func = caption_pipeline"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "pipeline_filename = pipeline_func.__name__ + '.pipeline.zip'\n",
+    "pipeline_filename = caption_pipeline.__name__ + '.pipeline.zip'\n",
     "compiler.Compiler().compile(pipeline_func, pipeline_filename)"
    ]
   },
@@ -819,7 +911,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "<h4> Run pipeline </h4>"
+    "#### Run pipeline"
    ]
   },
   {
@@ -834,7 +926,7 @@
     "    'num_examples': 100, # Small test to make sure pipeline functions properly\n",
     "    'training_batch_size': 16, # has to be smaller since only training on 80 examples \n",
     "}\n",
-    "run_name = pipeline_func.__name__ + ' run'\n",
+    "run_name = caption_pipeline.__name__ + ' run'\n",
     "run_result = client.run_pipeline(experiment.id, run_name, pipeline_filename,\n",
     "                                params=arguments)"
    ]
@@ -843,30 +935,8 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Model checkpoints are saved at training_output_dir, which is `[GCS_DATASET_PATH]/train/checkpoints/` by default."
+    "Model checkpoints are saved at `training_output_dir`, which is `[GCS_DATASET_PATH]/train/checkpoints/` by default."
    ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "<h3> Steps taken to convert the original TF 2.0 notebook </h3>\n",
-    "\n",
-    "\n",
-    "1. Componentize notebook to run in different steps, and not linearly.\n",
-    "2. Store the dataset in GCS to make it easily accessible in Kubeflow.\n",
-    "3. Use `file_io.FileIO()` instead of `open()` when loading files from GCS.\n",
-    "4. To pass multiple outputs downstream, pass them as a tuple of strings. Kubeflow converts this tuple to a string when you pass it downstream. So, you have to convert it from a string back to a tuple in the downstream component to get the multiple outputs.\n",
-    "5. To pass many numpy arrays to downstream components, first save them on GCS.  Put the paths to the saved numpy files in a new array, and then save that array on GCS as well.  Pass the path to this array to the downstream components.\n",
-    "6. Use `tf.io.read_file` and then `tf.image.decode_jpeg` instead of `PIL.Image` to be compatible with GCS"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
   }
  ],
  "metadata": {
diff --git a/samples/notebooks/image-captioning-gcp/README.md b/samples/notebooks/image-captioning-gcp/README.md
index 3f41796eda4..9232cd3a83c 100644
--- a/samples/notebooks/image-captioning-gcp/README.md
+++ b/samples/notebooks/image-captioning-gcp/README.md
@@ -1,7 +1,14 @@
 # Image Captioning TF 2.0
 
 ## About
-This notebook is an example of how to convert an existing Tensorflow notebook into a Kubeflow pipeline using jupyter notebook.  Specifically, this notebook takes an example tensorflow notebook, [image captioning with attention](https://colab.sandbox.google.com/github/tensorflow/docs/blob/master/site/en/r2/tutorials/text/image_captioning.ipynb), and creates a kubeflow pipeline.  
+This notebook is an example of how to convert an existing Tensorflow notebook into a Kubeflow pipeline using jupyter notebook.  Specifically, this notebook takes an example tensorflow notebook, [image captioning with attention](https://colab.sandbox.google.com/github/tensorflow/docs/blob/master/site/en/r2/tutorials/text/image_captioning.ipynb), and creates a kubeflow pipeline.  This pipeline produces a model that can generate captions for images.
+
+### Example generated captions
+The following example captions were created when using `num_examples = 30000`.
+
+![bus-output](https://user-images.githubusercontent.com/17008638/61419442-17989a00-a8b3-11e9-9ab3-a5a304ff96d0.PNG)
+
+![sandwich-output](https://user-images.githubusercontent.com/17008638/61419487-44e54800-a8b3-11e9-9b7f-68ccc970c10d.PNG)
 
 ## Setup
 
@@ -14,9 +21,20 @@ This pipeline requires a GCS bucket.  If you haven't already, [create a GCS buck
 ### Upload the notebook in the Kubeflow UI
 In order to run this pipeline, make sure to upload the notebook to your notebook server in the Kubeflow UI.  You can clone this repo in the Jupyter notebook server by connecting to the notebook server and then selecting New > Terminal.  In the terminal type `git clone https://github.com/kubeflow/pipelines.git`.
 
-## Outputs
+### Download dataset
+To download the dataset, run the first few cells in the notebook.
+
+## Artifacts
 Below are some screenshots of the final pipeline and the model outputs.
 
 ![pipeline-screenshot](https://user-images.githubusercontent.com/17008638/61160416-41694f80-a4b4-11e9-9317-5a92f625c173.png)
 
-![attention-screenshot](https://user-images.githubusercontent.com/17008638/61160441-59d96a00-a4b4-11e9-809b-f3df7cbe0dae.PNG)
\ No newline at end of file
+![attention-screenshot](https://user-images.githubusercontent.com/17008638/61160441-59d96a00-a4b4-11e9-809b-f3df7cbe0dae.PNG)
+
+## Steps taken to convert the original TF 2.0 notebook
+1. Componentize notebook to run in different steps, and not linearly.
+2. Store the dataset in GCS to make it easily accessible in Kubeflow.
+3. Use `file_io.FileIO()` instead of `open()` when loading files from GCS.
+4. To pass multiple outputs downstream, pass them as a tuple of strings. Kubeflow converts this tuple to a string when you pass it downstream. So, you have to convert it from a string back to a tuple in the downstream component to get the multiple outputs.
+5. To pass many numpy arrays to downstream components, first save them on GCS.  Put the paths to the saved numpy files in a new array, and then save that array on GCS as well.  Pass the path to this array to the downstream components.
+6. Use `tf.io.read_file` and then `tf.image.decode_jpeg` instead of `PIL.Image` to be compatible with GCS
\ No newline at end of file
diff --git a/samples/notebooks/image-captioning-gcp/src/models.py b/samples/notebooks/image-captioning-gcp/src/models.py
index 73109d5e6cb..b085bd645d8 100644
--- a/samples/notebooks/image-captioning-gcp/src/models.py
+++ b/samples/notebooks/image-captioning-gcp/src/models.py
@@ -99,4 +99,5 @@ def call(self, x, features, hidden):
         return x, state, attention_weights
 
     def reset_state(self, batch_size):
-        return tf.zeros((batch_size, self.units))
\ No newline at end of file
+        return tf.zeros((batch_size, self.units))
+        
\ No newline at end of file

From 5113fe83f35f6773707f9c63a0d34ef3106f7136 Mon Sep 17 00:00:00 2001
From: Zane Durante <zanedurante@gmail.com>
Date: Mon, 22 Jul 2019 13:58:07 -0700
Subject: [PATCH 07/11] Minor revisions

---
 .../Image Captioning TF 2.0.ipynb             | 20 +++++++++++--------
 .../image-captioning-gcp/src/models.py        |  1 -
 2 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/samples/notebooks/image-captioning-gcp/Image Captioning TF 2.0.ipynb b/samples/notebooks/image-captioning-gcp/Image Captioning TF 2.0.ipynb
index 0cb45c59ab6..d6d751237b8 100644
--- a/samples/notebooks/image-captioning-gcp/Image Captioning TF 2.0.ipynb	
+++ b/samples/notebooks/image-captioning-gcp/Image Captioning TF 2.0.ipynb	
@@ -60,7 +60,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "First, we have to download the [MS COCO dataset](http://cocodataset.org/#download).  This sample uses both the 2014 train images and 2014 train/val annotations.  The following cells download a small subset (<1000 imgs) of the dataset and the annotations to the GCS bucket specified below with `GCS_DATASET_PATH`."
+    "First, you have to download the [MS COCO dataset](http://cocodataset.org/#download).  This sample uses both the 2014 train images and 2014 train/val annotations.  The following cells download a small subset (<1000 imgs) of the dataset and the annotations to the GCS bucket specified below with `GCS_DATASET_PATH`."
    ]
   },
   {
@@ -88,8 +88,11 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Download images (we use -x to ignore ~99% of images)\n",
-    "!gsutil -m rsync -x \".*0\\.jpg|.*1\\.jpg|.*2\\.jpg|.*3\\.jpg|.*4\\.jpg|.*5\\.jpg|.*6\\.jpg|.*7\\.jpg|.*8\\.jpg|.*09\\.jpg|.*19\\.jpg|.*29\\.jpg|.*39\\.jpg|.*49\\.jpg|.*59\\.jpg|.*69\\.jpg|.*79\\.jpg|.*89\\.jpg\" gs://images.cocodataset.org/train2014 {GCS_DATASET_PATH}/train2014/train2014"
+    "# Download images (use -x to ignore ~99% of images)\n",
+    "!gsutil -m rsync -x \".*0\\.jpg|.*1\\.jpg|.*2\\.jpg|.*3\\.jpg|.*4\\.jpg|.*5\\.jpg|.*6\\.jpg|.*7\\.jpg|.*8\\.jpg|.*09\\.jpg|.*19\\.jpg|.*29\\.jpg|.*39\\.jpg|.*49\\.jpg|.*59\\.jpg|.*69\\.jpg|.*79\\.jpg|.*89\\.jpg\" gs://images.cocodataset.org/train2014 {GCS_DATASET_PATH}/train2014/train2014\n",
+    "\n",
+    "# To download the entire dataset uncomment and use the following command instead\n",
+    "# !gsutil -m rsync gs://images.cocodataset.org/train2014 {GCS_DATASET_PATH}/train2014/train2014"
    ]
   },
   {
@@ -97,7 +100,7 @@
    "metadata": {},
    "source": [
     "#### Download annotations\n",
-    "For some reason MS COCO blocks using `gsutil` with the annotations (GitHub issue [here](https://github.com/cocodataset/cocoapi/issues/216)).  We can work around this by downloading it, and then uploading it to GCS."
+    "For some reason MS COCO blocks using `gsutil` with the annotations (GitHub issue [here](https://github.com/cocodataset/cocoapi/issues/216)).  You can work around this by downloading it locally, and then uploading it to GCS."
    ]
   },
   {
@@ -394,7 +397,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "This component trains the model by creating a `tf.data.Dataset` from the captions and preprocessed images.  The trained model is saved in `train_output_dir/checkpoints/`.  The training loss is plotted in tensorboard. There are various parameters of the model(s) that can be tuned, but we use the default values from the original notebook.  "
+    "This component trains the model by creating a `tf.data.Dataset` from the captions and preprocessed images.  The trained model is saved in `train_output_dir/checkpoints/`.  The training loss is plotted in tensorboard. There are various parameters of the model(s) that can be tuned, but it uses the values from the original notebook by default.  "
    ]
   },
   {
@@ -624,7 +627,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "This component uses the model to predict on a new image.  It prints the predicted caption in the logs and outputs the attention images in tensorboard."
+    "This component uses the model to predict on a new image.  It prints the predicted and real caption in the logs and outputs the first 10 attention images with captions in tensorboard.  (Currently Kubeflow [only supports up to 10 outputs](https://github.com/kubeflow/pipelines/issues/1641) Tensorboard)"
    ]
   },
   {
@@ -820,7 +823,8 @@
    "metadata": {},
    "source": [
     "### Create and run pipeline\n",
-    "#### Create pipeline"
+    "#### Create pipeline\n",
+    "The pipeline parameters are specified below in the `caption pipeline` function signature.  Using the value `'default'` for the output directories saves them in a subdirectory of `GCS_DATASET_PATH`.  Use `use_gcp_secret('user-gcp-sa')` to give read/write permissions for the storage buckets. "
    ]
   },
   {
@@ -924,7 +928,7 @@
     "arguments = {\n",
     "    'dataset_path': GCS_DATASET_PATH, \n",
     "    'num_examples': 100, # Small test to make sure pipeline functions properly\n",
-    "    'training_batch_size': 16, # has to be smaller since only training on 80 examples \n",
+    "    'training_batch_size': 16, # has to be smaller since only training on 80/100 examples \n",
     "}\n",
     "run_name = caption_pipeline.__name__ + ' run'\n",
     "run_result = client.run_pipeline(experiment.id, run_name, pipeline_filename,\n",
diff --git a/samples/notebooks/image-captioning-gcp/src/models.py b/samples/notebooks/image-captioning-gcp/src/models.py
index b085bd645d8..0496dc4af5e 100644
--- a/samples/notebooks/image-captioning-gcp/src/models.py
+++ b/samples/notebooks/image-captioning-gcp/src/models.py
@@ -100,4 +100,3 @@ def call(self, x, features, hidden):
 
     def reset_state(self, batch_size):
         return tf.zeros((batch_size, self.units))
-        
\ No newline at end of file

From f003d0532b67f870a91306cc37a72fe97d3e276f Mon Sep 17 00:00:00 2001
From: Zane Durante <zanedurante@gmail.com>
Date: Tue, 23 Jul 2019 11:06:51 -0700
Subject: [PATCH 08/11] Updated with base image setup

---
 .../Image Captioning TF 2.0.ipynb                 |  5 ++++-
 samples/notebooks/image-captioning-gcp/README.md  | 15 ++++++++++++++-
 2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/samples/notebooks/image-captioning-gcp/Image Captioning TF 2.0.ipynb b/samples/notebooks/image-captioning-gcp/Image Captioning TF 2.0.ipynb
index d6d751237b8..816aed2b907 100644
--- a/samples/notebooks/image-captioning-gcp/Image Captioning TF 2.0.ipynb	
+++ b/samples/notebooks/image-captioning-gcp/Image Captioning TF 2.0.ipynb	
@@ -19,6 +19,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "### Before running notebook:\n",
+    "Make sure you completed the setup instructions in the README (including creating the base image).\n",
+    "\n",
     "#### Install Kubeflow pipelines\n",
     "Install the `kfp` package if you haven't already."
    ]
@@ -134,7 +137,7 @@
     "EXPERIMENT_NAME = 'Image Captioning'\n",
     "PROJECT_NAME = '[YOUR-PROJECT-NAME]' \n",
     "PIPELINE_STORAGE_PATH = 'gs://[YOUR-BUCKET-NAME]/ms-coco/components' # path to save pipeline component images\n",
-    "BASE_IMAGE = 'gcr.io/intro-to-kubeflow-1/img-cap:latest' # using image created in README instructions\n",
+    "BASE_IMAGE = 'gcr.io/[PROJECT-ID]/img-cap:latest' # using image created in README instructions\n",
     "\n",
     "# Target images for creating components\n",
     "PREPROCESS_IMG = 'gcr.io/%s/ms-coco/preprocess:latest' % PROJECT_NAME\n",
diff --git a/samples/notebooks/image-captioning-gcp/README.md b/samples/notebooks/image-captioning-gcp/README.md
index 9232cd3a83c..7308befa590 100644
--- a/samples/notebooks/image-captioning-gcp/README.md
+++ b/samples/notebooks/image-captioning-gcp/README.md
@@ -1,6 +1,6 @@
 # Image Captioning TF 2.0
 
-## About
+## Overview
 This notebook is an example of how to convert an existing Tensorflow notebook into a Kubeflow pipeline using jupyter notebook.  Specifically, this notebook takes an example tensorflow notebook, [image captioning with attention](https://colab.sandbox.google.com/github/tensorflow/docs/blob/master/site/en/r2/tutorials/text/image_captioning.ipynb), and creates a kubeflow pipeline.  This pipeline produces a model that can generate captions for images.
 
 ### Example generated captions
@@ -21,6 +21,19 @@ This pipeline requires a GCS bucket.  If you haven't already, [create a GCS buck
 ### Upload the notebook in the Kubeflow UI
 In order to run this pipeline, make sure to upload the notebook to your notebook server in the Kubeflow UI.  You can clone this repo in the Jupyter notebook server by connecting to the notebook server and then selecting New > Terminal.  In the terminal type `git clone https://github.com/kubeflow/pipelines.git`.
 
+### Create base image
+In order to run this pipeline, you need to first build the docker base image and upload it to a container registry.  This can be done with the following commands:
+
+`git clone https://github.com/kubeflow/pipelines.git`
+
+`cd pipelines/samples/notebooks/image-captioning-gcp/src`
+
+`docker build -t img-cap .` 
+
+`docker tag img-cap gcr.io/[PROJECT-ID]/img-cap:latest`
+
+`docker push gcr.io/[PROJECT ID]/img-cap:latest`
+
 ### Download dataset
 To download the dataset, run the first few cells in the notebook.
 

From b70a0407c69ccf46db7e51cd49f5fcd0ba7512e3 Mon Sep 17 00:00:00 2001
From: Zane Durante <zanedurante@gmail.com>
Date: Tue, 23 Jul 2019 11:08:56 -0700
Subject: [PATCH 09/11] nit

---
 samples/notebooks/image-captioning-gcp/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/samples/notebooks/image-captioning-gcp/README.md b/samples/notebooks/image-captioning-gcp/README.md
index 7308befa590..7f364c9f4c2 100644
--- a/samples/notebooks/image-captioning-gcp/README.md
+++ b/samples/notebooks/image-captioning-gcp/README.md
@@ -22,7 +22,7 @@ This pipeline requires a GCS bucket.  If you haven't already, [create a GCS buck
 In order to run this pipeline, make sure to upload the notebook to your notebook server in the Kubeflow UI.  You can clone this repo in the Jupyter notebook server by connecting to the notebook server and then selecting New > Terminal.  In the terminal type `git clone https://github.com/kubeflow/pipelines.git`.
 
 ### Create base image
-In order to run this pipeline, you need to first build the docker base image and upload it to a container registry.  This can be done with the following commands:
+In order to run this pipeline, you need to first build the docker base image and upload it to a container registry.  This can be done locally with the following commands:
 
 `git clone https://github.com/kubeflow/pipelines.git`
 

From c89251ffa676a21c919ca726c602be89e17d4306 Mon Sep 17 00:00:00 2001
From: Zane Durante <zanedurante@gmail.com>
Date: Tue, 23 Jul 2019 13:26:29 -0700
Subject: [PATCH 10/11] updated with suggestions

---
 .../image-captioning-gcp/Image Captioning TF 2.0.ipynb     | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/samples/notebooks/image-captioning-gcp/Image Captioning TF 2.0.ipynb b/samples/notebooks/image-captioning-gcp/Image Captioning TF 2.0.ipynb
index 816aed2b907..b4f63d7431f 100644
--- a/samples/notebooks/image-captioning-gcp/Image Captioning TF 2.0.ipynb	
+++ b/samples/notebooks/image-captioning-gcp/Image Captioning TF 2.0.ipynb	
@@ -74,7 +74,8 @@
    "source": [
     "# Location to download dataset and put onto GCS (should be associated\n",
     "# with Kubeflow project)\n",
-    "GCS_DATASET_PATH = 'gs://[YOUR-BUCKET-NAME]/ms-coco'"
+    "GCS_BUCKET = 'gs://[YOUR-BUCKET-NAME]'\n",
+    "GCS_DATASET_PATH = GCS_BUCKET + '/ms-coco'"
    ]
   },
   {
@@ -136,8 +137,8 @@
     "# Kubeflow project settings\n",
     "EXPERIMENT_NAME = 'Image Captioning'\n",
     "PROJECT_NAME = '[YOUR-PROJECT-NAME]' \n",
-    "PIPELINE_STORAGE_PATH = 'gs://[YOUR-BUCKET-NAME]/ms-coco/components' # path to save pipeline component images\n",
-    "BASE_IMAGE = 'gcr.io/[PROJECT-ID]/img-cap:latest' # using image created in README instructions\n",
+    "PIPELINE_STORAGE_PATH = GCS_BUCKET + '/ms-coco/components' # path to save pipeline component images\n",
+    "BASE_IMAGE = 'gcr.io/%s/img-cap:latest' % PROJECT_NAME # using image created in README instructions\n",
     "\n",
     "# Target images for creating components\n",
     "PREPROCESS_IMG = 'gcr.io/%s/ms-coco/preprocess:latest' % PROJECT_NAME\n",

From 1d96e96ac201a488c2f5eb470848b1f49953ed25 Mon Sep 17 00:00:00 2001
From: Zane Durante <zanedurante@gmail.com>
Date: Tue, 23 Jul 2019 15:31:59 -0700
Subject: [PATCH 11/11] enlarged font

---
 .../Image Captioning TF 2.0.ipynb             | 30 +++++++++----------
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/samples/notebooks/image-captioning-gcp/Image Captioning TF 2.0.ipynb b/samples/notebooks/image-captioning-gcp/Image Captioning TF 2.0.ipynb
index b4f63d7431f..4381501cac5 100644
--- a/samples/notebooks/image-captioning-gcp/Image Captioning TF 2.0.ipynb	
+++ b/samples/notebooks/image-captioning-gcp/Image Captioning TF 2.0.ipynb	
@@ -19,10 +19,10 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### Before running notebook:\n",
+    "## Before running notebook:\n",
     "Make sure you completed the setup instructions in the README (including creating the base image).\n",
     "\n",
-    "#### Install Kubeflow pipelines\n",
+    "### Install Kubeflow pipelines\n",
     "Install the `kfp` package if you haven't already."
    ]
   },
@@ -39,7 +39,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "#### Activate service account credentials\n",
+    "### Activate service account credentials\n",
     "This allows for using `gsutil` from the notebook to upload the dataset to GCS."
    ]
   },
@@ -56,7 +56,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### Download dataset and upload to GCS "
+    "## Download dataset and upload to GCS "
    ]
   },
   {
@@ -82,7 +82,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "#### Download images\n",
+    "### Download images\n",
     "Downloads images to `${GCS_DATASET_PATH}/train2014/train2014`"
    ]
   },
@@ -103,7 +103,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "#### Download annotations\n",
+    "### Download annotations\n",
     "For some reason MS COCO blocks using `gsutil` with the annotations (GitHub issue [here](https://github.com/cocodataset/cocoapi/issues/216)).  You can work around this by downloading it locally, and then uploading it to GCS."
    ]
   },
@@ -125,7 +125,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### Setup project info and imports"
+    "## Setup project info and imports"
    ]
   },
   {
@@ -163,14 +163,14 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### Create pipeline components"
+    "## Create pipeline components"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "#### Data preprocessing component\n",
+    "### Data preprocessing component\n",
     "This component takes `num_examples` images from `dataset_path` and feeds them through the deep CNN inceptionV3 (without the head).  The model outputs a tensor of shape `(64 x 2048)` that represents (2048) features obtained after dividing the image into an 8x8 (64) grid. The resulting model outputs are stored in `OUTPUT_DIR`."
    ]
   },
@@ -305,7 +305,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "#### Tokenizing component"
+    "### Tokenizing component"
    ]
   },
   {
@@ -394,7 +394,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "#### Component for training model (and saving it)"
+    "### Component for training model (and saving it)"
    ]
   },
   {
@@ -624,7 +624,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "<h4> Component for model prediction </h4>"
+    "### Component for model prediction"
    ]
   },
   {
@@ -826,8 +826,8 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### Create and run pipeline\n",
-    "#### Create pipeline\n",
+    "## Create and run pipeline\n",
+    "### Create pipeline\n",
     "The pipeline parameters are specified below in the `caption pipeline` function signature.  Using the value `'default'` for the output directories saves them in a subdirectory of `GCS_DATASET_PATH`.  Use `use_gcp_secret('user-gcp-sa')` to give read/write permissions for the storage buckets. "
    ]
   },
@@ -919,7 +919,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "#### Run pipeline"
+    "### Run pipeline"
    ]
   },
   {