This repository contains experiments on data wrangling techniques, focusing on methods for handling missing values, filtering, aggregation, and more.
Python is a high-level, interpreted programming language widely used in data science for data manipulation, analysis, and visualization. Libraries such as Pandas and NumPy provide powerful tools for data wrangling, including handling missing values, filtering, and reshaping datasets.
Data-Wrangling/
│
├── Experiment 1 - Handling Missing Values/
│ ├── Handling_Missing_Values.ipynb
│
├── Experiment 2 - Data Filtering/
│ ├── Data_Filtering.ipynb
│ ├── Experiment 2 Document.docx
│
├── Experiment 3 - Data Aggregation/
│ ├── Data_Aggregation.ipynb
│ ├── Experiment 3 Document.docx/
│
├── Experiment 4 - Data Concatenation/
│ ├── Data_Concatenation.ipynb
│
├── Experiment 5 - Data Reshaping/
│ ├── Data_Reshaping.ipynb
│
├── Experiment 6 - Data Sampling/
│ ├── Data_Sampling.ipynb
│
├── Experiment 7 - Data Conversion/
│ ├── Data_Conversion.ipynb
│
└── README.md
Identify and fill missing values in a dataset using methods such as mean imputation or forward/backward filling to ensure data completeness and accuracy.
Filter rows or columns based on specified criteria, such as removing outliers or selecting data within a certain range to refine datasets for analysis.
Aggregate data by grouping rows based on specific attributes and computing summary statistics, such as mean, median, count, or sum. This helps to summarize large datasets for easier analysis.
Concatenate multiple datasets either along rows or columns to create a unified dataset. This method is useful when merging datasets from different sources or appending new data to an existing dataset.
Reshape data by pivoting, stacking, or unstacking to convert between wide and long formats. This technique allows for better organization and analysis of data with multiple variables.
Randomly sample rows or columns from a dataset to create a smaller subset for analysis. Sampling is useful for exploratory data analysis, testing models, or handling large datasets efficiently.
Convert data types of columns, such as changing categorical variables to numerical representations or converting numerical values into categories, enabling better processing and analysis of the data.
Clean and preprocess text data by removing punctuation, stopwords, and performing tokenization. This process helps in standardizing the text, making it ready for further analysis such as natural language processing (NLP) or text mining. Tokenization splits text into words or phrases, which can then be analyzed or converted into numerical representations for machine learning models.
Extract date or time components from datetime columns and perform operations such as calculating time differences or aggregating data by time intervals. This allows for efficient analysis of time series data and helps in understanding trends over different time periods. Techniques include extracting year, month, day, and calculating durations between timestamps.
10. Data Merging
Merge two or more datasets based on common keys or indices to combine information from different sources. This process is essential for creating comprehensive datasets that capture all relevant data points across different tables. Techniques include inner joins, outer joins, left joins, and right joins to ensure that data relationships are properly maintained during the merging process.
- Drop a 🌟 if you find this repository useful.
- If you have any doubts or suggestions, feel free to reach me.
📫 How to reach me: - Contribute and Discuss: Feel free to open issues 🐛, submit pull requests 🛠️, or start discussions 💬 to help improve this repository!