Goal: Analyze venue visitation patterns and design a scalable processing pipeline
Duration: 45 minutes - 1 hour
You have data capturing anonymous individuals visiting specific venues. The data is provided in daily batches (CSV) available in the ./daily_data/
folder with the following structure:
Data Structure The files correspond to visitation data for individual days, stored as follows:
daily_data/
├── 20241028.csv
├── 20241029.csv
├── 20241030.csv
...
Data Schema
{
'venue_id': string, # Venue identifier
'visitor_id': string, # Anonymous visitor identifier
'visit_start_time': datetime, # Visit start date and time
'visit_end_time': datetime, # Visit end date and time
'venue_type': string # Type of venue
}
-
Quick Data Analysis
- Load and inspect the provided dataset.
- Familiarize yourself with the data structure.
- Identify potential data quality issues.
-
Simple Processing Pipeline
- Build a basic pipeline to process the data on a daily basis:
- Load the corresponding batch for each day.
- Handle data inconsistencies.
- For each venue and day, perform the following:
- Compute the daily visitor count (unique visitors and total).
- Predict the next day's visitor count (based on historical values).
- Append the processed data to a CSV file named
daily_visitation_summary.csv
in the structure described below:
{ 'date': date, # Processing date 'venue_id': string, # Venue identifier 'visitor_count_unique': integer, # Number of visits by unique visitors 'visitor_count_total': integer, # Total number of visits (including repeated visits) 'visitor_count_total_prediction': float, # Prediction for total visit count for the next day }
- Focus on pipeline structure and handling rather than prediction accuracy.
- Build a basic pipeline to process the data on a daily basis:
Answer briefly:
- How would you scale this pipeline to handle:
- 20 million locations of 20 million users
- Real-time updates
- How would you store the data?
- How would you monitor data quality?
- What would your daily orchestration look like?
-
Python Notebook or Script
- Data processing code.
- Basic prediction pipeline.
- Clear comments.
- Instructions how to run the code.
-
Report with a Brief Summary
- Key findings.
- Answers to the questions above.
- Any assumptions made.
- Code structure and clarity
- Data pipeline design
- Error handling
- Scaling considerations
- For simplicity, the example includes dummy data of a small magnitude. Consider how your solution would work in a real production environment at scale.
- Focus on writing production-ready code over complex algorithms.
- You can use any standard Python data libraries
How to submit:
- Upload to a Github repository and share the project with jan.benetka@unacast.com.
Next steps:
- 15-20 min discussion of your solution with a person(s) from Unacast.
We know assignments like this take time, and we really appreciate the effort you're putting into it. It’s all about making sure our team is packed with the best and brightest, and honestly, we’re rooting for you to be one of us. We really hope this isn’t the last time we get to write some code together!
🚀 Thanks for your time and effort – we hope you’ll join us! 🚀