A Python tool for detecting and anonymizing privacy-sensitive information in open-ended Dutch survey responses or other open answers.
SEAA helps identify and anonymize potentially privacy-sensitive information in text responses, particularly useful for processing survey data. Any csv file with open answers can be processed. It uses dictionary-based matching that is updated by user interaction to:
- Detect unknown words that might contain private information
- Flag known privacy-sensitive terms (names, medical conditions, etc.)
- Replace sensitive information with category markers (e.g., [NAME], [ILLNESS])
- Allow users to expand the whitelist/blacklist of words through interactive review
- User input is expanded in the dictionaries and used for future analyses
- Originally developed for the National Student Survey (NSE) data but expaned to other csv files with open answers.
NOTE: this tool can only be used for Dutch text.
%%{init: {'sequence': {'theme': 'hand'}}}%%
sequenceDiagram
participant Input as Input Files
participant Trans as NSE Transform
participant SEAA as SEAA Process
participant Dict as Dictionaries
participant User as User Review
participant Out as Output Files
alt NSE Data
Input->>Trans: Raw NSE CSV
Trans->>Trans: Transform wide to long format
Note over Trans: Convert:<br/>Q1 Q2 Q3<br/>to<br/>Answer Question_id
Trans->>SEAA: nse_transformed.csv
else Regular Data
Input->>SEAA: Standard CSV
end
activate SEAA
SEAA->>SEAA: Load & Clean Text
loop Word Check
SEAA->>Dict: Check against dictionaries
Dict-->>SEAA: Return matches
end
SEAA->>Out: Write SEAA_output.csv
SEAA->>Out: Write unknown_words.csv
deactivate SEAA
loop For each unknown word
Out->>User: Present word
User->>Dict: Add to whitelist/blacklist
end
Dict->>Dict: Update dictionaries
- Clone the repository:
git clone https://github.com/uashogeschoolutrecht/SEAA.git
cd seaa
- Install required dependencies:
pip install -r requirements.txt
Your input CSV file must:
- Use semicolon as the separator
- Contain these columns in order:
respondent_id
- Unique identifier for each respondentAnswer
- The text responses to analyzequestion_id
- Identifier for the question being answered
Example input CSV format:
respondent_id;Answer;question_id
1001;"Mijn docent Peter heeft mij enorm geholpen";Q1
1002;"Ik had moeite met concentratie tijdens de lessen";Q1
- Place your input CSV file in your working directory
- Update the path and filename in your script:
# Set your file path and name
path = r'C:\Your\Path\Here'
input_file = 'your_input_file.csv'
# Run the main function
main(path, input_file=input_file)
If you're processing NSE data, use:
path = r'C:\Your\Path\Here'
transform_nse = "nse2023.csv" # Your NSE file
input_file = None
main(path, transform_nse=transform_nse, input_file=input_file)
The tool generates several output files:
-
SEAA_output.csv
: Main analysis results containing:- Original text
- Censored text
- Privacy flags
- Detected sensitive words
-
avg_words_count.csv
: List of unknown words for review -
Updated dictionary files in
dict/
folder:whitelist.txt
: Safe wordsblacklist.txt
: Privacy-sensitive words
The SEAA_output.csv
contains the following columns:
respondent_id
: Original respondent identifierAnswer
: Original text responsequestion_id
: Original question identifieranswer_clean
: Cleaned version of the text (lowercase, normalized)contains_privacy
: Binary flag (1/0) indicating if privacy-sensitive content was detectedunknown_words
: List of words not found in the dictionary or whitelistflagged_words
: List of words matched against the privacy-sensitive dictionariesanswer_censored
: Text with privacy-sensitive words replaced by category markers (e.g., [NAME], [ILLNESS]) and unknown words replaced by [UNKOWN]total_word_count
: Total number of words in the responseunknown_word_count
: Number of words not found in dictionaries (still need to be reviewed)flagged_word_count
: Number of privacy-sensitive words detectedunknown_words_not_flagged
: Unknown words that are not in the dictionariesflagged_word_type
: Categories of privacy-sensitive content found (e.g., "name, illness")language
: Detected language of the response (e.g., 'nl' for Dutch, 'en' for English)
Example row:
respondent_id;Answer;question_id;answer_clean;contains_privacy;unknown_words;flagged_words;answer_censored;total_word_count;unknown_word_count;flagged_word_count;unknown_words_not_flagged;flagged_word_type;language
1;"Mijn docent Peter heeft mij geholpen met mijn loopbaanbegleidingstraject";"Q1";"mijn docent peter heeft mij geholpen met mijn loopbaanbegleidingstraject";1;"";peter;"Mijn docent [NAME] heeft mij geholpen met mijn [UNKOWN]";10;0;1;;"name";"nl"
The tool will present unknown words for review, allowing you to:
- Add words to the whitelist (safe words)
- Add words to the blacklist (privacy-sensitive words)
- Skip words for later review
Example interaction:
"docent" kwam 45 keer voor in de open antwoorden.
Wil je dit woord toevoegenaan de whitelist? (j/n/blacklist): j
Woord "docent" is toegevoegd aan de whitelist
"janssen" kwam 12 keer voor in de open antwoorden.
Wil je dit woord toevoegenaan de whitelist? (j/n/blacklist): blacklist
Woord "janssen" is toegevoegd aan de blacklist
The tool uses several dictionary files in the dict/
folder:
wordlist.txt
: Base dictionary of common wordswhitelist.txt
: User-approved safe wordsblacklist.txt
: Known privacy-sensitive wordsillness.txt
: Medical conditionsstudiebeperking.txt
: Study limitationsnames.txt
: Common first names plus some last names
The tool automatically detects the language of responses and processes Dutch text. Non-Dutch responses are flagged in the output.
- The tool is optimized for Dutch language text
- Dictionary-based approach may miss complex or context-dependent privacy information
- Regular maintenance of dictionaries is recommended for optimal performance