Skip to content

Performing Exploratory Data Analysis (EDA) using PySpark, pandas, matplotlib, and scikit-learn for data manipulation, visualization, and pattern identification.

Notifications You must be signed in to change notification settings


Repository files navigation

EDA for Multi-Class Prediction of Cirrhosis Outcomes (kaggle dataset)

  • Gain Domain knowledge
  • Check for missing values
  • Check for duplicates
  • Categorical features distribution
  • Association between categorical features (Chi-square test)
  • Numerical features distribution (histograms, boxplots, violinplot)
  • Correlation between Numerical features
  • Transformation of numerical features and Normality tests (Log Normal, QuantileTransformer, Boxcox transformation, Kolmogorov-Smirnov test, qqplots)
  • Encoding values ( ordinal_encoder, label_encoder, one_hot_encoding)
  • Correlation between all features
  • PCA (Explained Variance and Cumulative Variance, loadings)

purchase_analysis_pyspark (rahnema college)

  • Which five products have the largest difference between their Popularity Index and Return Rate?
  • Which Supplier ID has the highest percentage of sales using the Shipping Method: Standard?
  • Compare the average Shipping Cost across different Category values.
  • Which Category has the highest number of sold products with a Popularity Index between 50 and 70?
  • Report the top ten Supplier ID values with the highest net sales amount (after applying Discount and Tax Rate) along with their net sales amount.
  • In which city have individuals under 35 years old paid the highest total cost for purchasing and receiving products on average?
  • In which continent has the highest Stock Level been reported, and in which has the lowest been reported?
  • Sort cities based on their average Shipping Cost.
  • In which Category has the most variation in the Price of sold products been observed?
  • What percentage of products have a Popularity Index of less than 80?
  • In which Category have Supplier ID values applied the lowest percentage of Discount?
  • Which three products have the highest Popularity Index among the Customer Age Group above 55 years?


Performing Exploratory Data Analysis (EDA) using PySpark, pandas, matplotlib, and scikit-learn for data manipulation, visualization, and pattern identification.







No releases published


No packages published