MoviesMatchDataGenerator

Data generator code for Movies Match.
Generates users (fictitious) ratings.

The code is meant to be a baseline and doesn't include code to stream the data to an actual system

Run Locally

Clone the project

  git clone https://github.com/Webiks/MoviesMatchDataGenerator

Go to the project directory

  cd MoviesMatchDataGenerator

Create virtual env:

  python -m venv .venv
  source .venv/bin/activate

Install dependencies

  pip install -r requirements.txt

Create local .env file

  cp .env.example .env

Run the generator

  python -m generator

Generated data

The code generates dataframe with user_id, movie_id and rating.
The user_id and movie_id matches id's from Kaggle's The Movies Dataset and can include new ids dependeing on configuration.
rating values are 0.5 to 5 (inclusive) in increments of 0.5

Example of generated data (in CSV format):

user_id,movie_id,rating
300082,4,3.0
300199,2,4.0
47427,9,3.0
210960,2,3.0
300100,8,4.5

Configuration

Environment Variables

Environment variables can be set via .env file.
The following is explanation of the variables to configure:

MOVIES_MATCH_LOG_LEVEL: Log level, default is INFO
MOVIES_MATCH_GENERATE_INTERVAL_SEC: Interval in seconds to trigger data generation
MOVIES_MATCH_GENERATE_INTERVAL_JITTER_SEC: Introducing jitter to interval to simulate delays.
- For example for interval of 60 seconds and jitter of 10 seconds, interval will be [50-70] seconds.
MOVIES_MATCH_RECORDS_TO_GENERATE: How many records to generate per interval
MOVIES_MATCH_RECORDS_TO_GENERATE_JITTER: Introducing jitter to records count to simulate different workloads.
- For example for 250 records to generate and jitter of 140, each interval will generate [110-390] records.
MOVIES_MATCH_EXISTING_USERS_FIRST_ID: First ID of existing users, defaults is 1
MOVIES_MATCH_EXISTING_USERS_LAST_ID: Last ID of existing users, defaults is 270896
MOVIES_MATCH_EXISTING_USERS_FIRST_ID: First ID of new users, defaults is 300001
MOVIES_MATCH_EXISTING_USERS_LAST_ID: Last ID of new users, defaults is 300300
MOVIES_MATCH_NEW_USER_RATING_PROB: The probability of generating a record for a new user
- Set to 0 to generate data for existing users only
- Set to 1 to generate data for new users only
- The user to generate rating for is selected at random
MOVIES_MATCH_MOVIES_RATINGS_DISTRIBUTION_FILE: Path to distribution file, defaults to movies_ratings_distribution.csv
- See details in section below
MOVIES_MATCH_RANDOM_SEED=0: Optional random seed so we'll get the same records re-running the generator.
- Defaults to 0 which means ignore, i.e., different records each run
- Mainly used in development to re-stream predictable data

Movies Rating Distribution CSV

The data generator loads movies rating distributions from file to generate new rating based on existing per-movie distribution and not just randomly.
The file is also used as the source for movies ids.
Update or create new file with new movies to better simulate real-world environment where movies and ratings are added constantly.

The file is in CSV format and includes the following columns:

(index): Running index
movieId: Movie's unique id
rating: Movie's rating ([0.5-5] in increments of 0.5)
prob: Movie's rating probability

Each movie ratings probabilities must have sum of 1 to create a valid distribution

Example:

150,16,0.5,0.0033630647755519814
151,16,1.0,0.012867378271677147
152,16,1.5,0.004484086367402642
153,16,2.0,0.037578593361602575
154,16,2.5,0.02349271335965297
155,16,3.0,0.20422088999366378
156,16,3.5,0.09772383876785105
157,16,4.0,0.34498220987473804
158,16,4.5,0.10142808402787933
159,16,5.0,0.1698591411999805

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
generator		generator
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
movies_ratings_distribution.csv		movies_ratings_distribution.csv
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MoviesMatchDataGenerator

Run Locally

Generated data

Configuration

Environment Variables

Movies Rating Distribution CSV

About

Releases

Packages

Languages

License

Webiks/MoviesMatchDataGenerator

Folders and files

Latest commit

History

Repository files navigation

MoviesMatchDataGenerator

Run Locally

Generated data

Configuration

Environment Variables

Movies Rating Distribution CSV

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages