This guide outlines the steps to create a Cloud Function that downloads a dataset from Kaggle, processes it, and uploads it to Google Cloud Storage.
- Google Cloud Platform account with billing enabled
- Kaggle account with API access
- Google Cloud Shell
The commands are executed under the cloud shell terminal…
First, set up the necessary environment variables in Cloud Shell:
> export PROJECT_ID=steam-link-443214-f2
> export REGION=us-central1
> export BUCKET=dataproc-composer
> export STAGING_BUCKET=dataproc-composer-staging
> export PHS_CLUSTER_NAME=phs-my
export
- Go to Kaggle Settings
- Click on "Create New API Token" to download kaggle.json
- Create a new kaggle.json file in Cloud Shell
> nano kaggle.json
And paste all the contents in kaggle.json
file downloaded from Kaggle website into the kaggle.json
file in the Cloud Shell
> gcloud services enable \ \ \ \ \ \
- Create a secret for Kaggle credentials
> gcloud secrets create kaggle-json --data-file=kaggle.json
- Grant access to the service account:
> gcloud projects add-iam-policy-binding ${PROJECT_ID} \
--member=serviceAccount:${SERVICE_ACCOUNT_EMAIL} \
- Create a directory for the Cloud Function
> mkdir -p cf_scripts
- Create requirements.txt
> nano cf_scripts/requirements.txt
- Add the following dependencies:
- Create
> nano cf_scripts/
- Add the Python code for the Cloud Function (as below)
import os
import zipfile
import pandas as pd
from import secretmanager
from import storage
import json
import functions_framework
from flask import jsonify
def get_secret(secret_name):
client = secretmanager.SecretManagerServiceClient()
project_id = os.getenv('PROJECT_ID')
secret_name_path = f"projects/{project_id}/secrets/{secret_name}/versions/latest"
response = client.access_secret_version(name=secret_name_path)
def download_and_upload(request):
# Validate environment variables
bucket_name = os.getenv('BUCKET')
if not bucket_name:
return jsonify({'error': 'BUCKET environment variable not set'}), 500
dataset_name = 'karomatovdovudkhon/co2-emissions-canada'
local_zip_path = '/tmp/'
csv_filename = 'CO2 Emissions_Canada.csv'
local_csv_path = os.path.join('/tmp', csv_filename)
gcs_path = 'dataset/co2_emissions_canada.csv'
# Define new header names
new_headers = [
"make", "model", "vehicle_class", "engine_size", "cylinders", "transmission",
"fuel_type", "fuel_consumption_city", "fuel_consumption_hwy", "fuel_consumption_comb_Lkm",
"fuel_consumption_comb_mpg", "co2_emissions"
# Get Kaggle credentials
kaggle_creds = json.loads(get_secret("kaggle-json"))
os.environ['KAGGLE_USERNAME'] = kaggle_creds['username']
os.environ['KAGGLE_KEY'] = kaggle_creds['key']
# Import kaggle after setting credentials
import kaggle
except Exception as e:
return jsonify({'error': f'Failed to get Kaggle credentials: {str(e)}'}), 500
# Download from Kaggle
print("Starting dataset download...")
kaggle.api.dataset_download_files(dataset_name, path='/tmp', unzip=True)
print("Dataset downloaded successfully")
# Check if the CSV file exists after extraction
if not os.path.exists(local_csv_path):
return jsonify({'error': f'{csv_filename} not found in {local_zip_path} after extraction'}), 500
print("Files in /tmp:", os.listdir('/tmp')) # List files in /tmp for debugging
# Read the CSV and update headers
df = pd.read_csv(local_csv_path)
# Check if the CSV columns exist
if len(df.columns) != len(new_headers):
return jsonify({'error': f"CSV file has {len(df.columns)} columns, expected {len(new_headers)}"}), 500
# Rename the columns
df.columns = new_headers
# Save the modified CSV
df.to_csv(local_csv_path, index=False)
print(f"Headers updated in {local_csv_path}")
except Exception as e:
return jsonify({'error': f'Failed to download or modify the CSV: {str(e)}'}), 500
# Upload to GCS
client = storage.Client()
bucket = client.get_bucket(bucket_name)
blob = bucket.blob(gcs_path)
# Upload the modified CSV file
print(f"Dataset uploaded to gs://{bucket_name}/{gcs_path}")
except Exception as e:
return jsonify({'error': f'Failed to upload to GCS: {str(e)}'}), 500
return jsonify({
'status': 'success',
'message': 'Download, modification, and upload completed successfully',
'destination': f'gs://{bucket_name}/{gcs_path}'
}), 200
except Exception as e:
return jsonify({'error': f'Unexpected error: {str(e)}'}), 500
- Deploy the function with the following configuration:
gcloud functions deploy download_and_upload \
--runtime python310 \
--trigger-http \
--allow-unauthenticated \
--region ${REGION} \
--source ./cf_scripts \
--timeout=540s \
After waiting for around 30s - 1 min, our script will be deployed to the Cloud Function, as follows:
> curl "https://${REGION}-${PROJECT_ID}${BUCKET}"
If you open the link directly in your browser, you will see this output:
And when you check your Google Cloud Storage, you will see the data will be downloaded:
The Cloud Function performs the following operations:
- Authenticates with Kaggle using stored credentials
- Downloads the specified dataset
- Updates the CSV headers to standardized names
- Uploads the processed file to Google Cloud Storage
You can monitor the function's execution using Cloud Logging:
> gcloud functions logs read download_and_upload
