-
-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert reading_level into dummy variables #43
Conversation
WalkthroughThe changes in this pull request focus on modifying the data preprocessing logic within the Changes
Assessment against linked issues
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (1)
pmml/step1_prepare/step1_2_preprocess_data.py (1)
49-54
: Add documentation for the dummy variable formatConsider adding a comment block explaining:
- The structure of the resulting dummy variables
- The meaning of each binary column
- The expected values in each column
+# Convert reading_level into dummy variables (one-hot encoding) +# Example: reading_level '2' becomes: +# reading_level_1: 0 +# reading_level_2: 1 +# reading_level_3: 0 +# reading_level_4: 0 +# reading_level_5: 0 storybooks_dataframe = pandas.concat(
# Convert 'reading_level' into dummy variables | ||
storybooks_dataframe = pandas.concat( | ||
[storybooks_dataframe.drop(columns=['reading_level']), # Drop the original reading_level column | ||
pandas.get_dummies(storybooks_dataframe['reading_level'], prefix='reading_level')], # Add dummy columns | ||
axis=1 | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add validation and ensure consistent dummy variables across runs
While the implementation correctly converts reading levels to dummy variables, there are several improvements needed for robustness:
- Validate reading levels before conversion
- Ensure consistent dummy columns across different runs
- Consider memory efficiency
Consider this improved implementation:
# Convert 'reading_level' into dummy variables
+# Validate reading levels
+valid_levels = ['1', '2', '3', '4', '5'] # Define expected levels
+invalid_levels = set(storybooks_dataframe['reading_level']) - set(valid_levels)
+if invalid_levels:
+ raise ValueError(f"Unexpected reading levels found: {invalid_levels}")
+
+# Create dummy variables with explicit columns to ensure consistency
storybooks_dataframe = pandas.concat(
- [storybooks_dataframe.drop(columns=['reading_level']), # Drop the original reading_level column
- pandas.get_dummies(storybooks_dataframe['reading_level'], prefix='reading_level')], # Add dummy columns
+ [storybooks_dataframe.drop(columns=['reading_level']),
+ pandas.get_dummies(
+ storybooks_dataframe['reading_level'],
+ prefix='reading_level',
+ columns=valid_levels
+ )],
axis=1
)
+
+# Ensure all expected dummy columns exist
+expected_columns = [f'reading_level_{level}' for level in valid_levels]
+missing_columns = set(expected_columns) - set(storybooks_dataframe.columns)
+for col in missing_columns:
+ storybooks_dataframe[col] = 0
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
# Convert 'reading_level' into dummy variables | |
storybooks_dataframe = pandas.concat( | |
[storybooks_dataframe.drop(columns=['reading_level']), # Drop the original reading_level column | |
pandas.get_dummies(storybooks_dataframe['reading_level'], prefix='reading_level')], # Add dummy columns | |
axis=1 | |
) | |
# Convert 'reading_level' into dummy variables | |
# Validate reading levels | |
valid_levels = ['1', '2', '3', '4', '5'] # Define expected levels | |
invalid_levels = set(storybooks_dataframe['reading_level']) - set(valid_levels) | |
if invalid_levels: | |
raise ValueError(f"Unexpected reading levels found: {invalid_levels}") | |
# Create dummy variables with explicit columns to ensure consistency | |
storybooks_dataframe = pandas.concat( | |
[storybooks_dataframe.drop(columns=['reading_level']), | |
pandas.get_dummies( | |
storybooks_dataframe['reading_level'], | |
prefix='reading_level', | |
columns=valid_levels | |
)], | |
axis=1 | |
) | |
# Ensure all expected dummy columns exist | |
expected_columns = [f'reading_level_{level}' for level in valid_levels] | |
missing_columns = set(expected_columns) - set(storybooks_dataframe.columns) | |
for col in missing_columns: | |
storybooks_dataframe[col] = 0 |
💡 Codebase verification
Dummy variable conversion is incompatible with downstream usage
The code analysis reveals a critical issue: The preprocessing step converts reading_level into dummy variables, but downstream files expect it as a single column:
- Training scripts (step2_1, step2_2) try to select reading_level as a single target column
- Validation script (step3_2) directly accesses reading_level for computing mean absolute error
- The dummy variable conversion will break the pipeline as the column name 'reading_level' won't exist after preprocessing
🔗 Analysis chain
Verify data distribution and downstream impact
Please ensure that the dummy variable conversion preserves the data distribution and doesn't impact downstream analysis.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check reading level distribution and downstream dependencies
# Check reading level distribution in input file
echo "Reading level distribution in input:"
cat step1_1_storybooks.csv | awk -F',' '{print $2}' | sort | uniq -c
# Check for downstream dependencies
echo "Files potentially affected by reading_level changes:"
rg -l "reading_level" --type py
Length of output: 640
Script:
#!/bin/bash
# Let's analyze how reading_level is used in downstream files
# Check reading_level usage in preprocessing
echo "=== Preprocessing (step1_2) ==="
rg "reading_level" pmml/step1_prepare/step1_2_preprocess_data.py -B 2 -A 2
# Check how reading_level is used in training
echo -e "\n=== Training (step2_1) ==="
rg "reading_level" pmml/step2_train/step2_1_train_model_pkl.py -B 2 -A 2
echo -e "\n=== Training PMML (step2_2) ==="
rg "reading_level" pmml/step2_train/step2_2_train_model_pmml.py -B 2 -A 2
# Check prediction and validation usage
echo -e "\n=== Prediction (step3_1) ==="
rg "reading_level" pmml/step3_predict/step3_1_predict.py -B 2 -A 2
echo -e "\n=== Validation (step3_2) ==="
rg "reading_level" pmml/step3_predict/step3_2_validate.py -B 2 -A 2
Length of output: 4602
|
||
# Convert 'reading_level' into dummy variables | ||
storybooks_dataframe = pandas.concat( | ||
[storybooks_dataframe.drop(columns=['reading_level']), # Drop the original reading_level column | ||
pandas.get_dummies(storybooks_dataframe['reading_level'], prefix='reading_level')], # Add dummy columns | ||
axis=1 | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zakroum-hicham If you run your updated code, the CSV data output should be modified as well, but this PR didn't seem to contain any modifications to the CSV files?
https://github.com/elimu-ai/ml-storybook-reading-level?tab=readme-ov-file#run
Issue Number
reading_level
variable into 0/1 variables #39Purpose
reading_level
column into dummy variables (binary columns) usingpandas.get_dummies
, while retaining all other columns in the DataFrame.Technical Details
Testing Instructions
Screenshots
Summary by CodeRabbit
New Features
reading_level
column into dummy variables for improved analysis.Bug Fixes