- Architecture Diagram
- https://excalidraw.com/#json=6200156642869248,aQsoVHVum1FL9nK1G4MVwg
- A document explaining your design decisions
- README.md
- All supporting schema / queries / code
- Link to a git repository with instructions on how to set it up.
- SQS Client
- stores influencer id to be processed next
- read from SQS
- write to SQS
- MockstagramDataProvider
- Connects with mockstagram and fetches data for influencer
- Seeder
- rake script to seed sqs with influencer ids
- Influencer Analytics Service
- Fetches mockstagram data from MockstagramDataProvider
- transforms the data
- write back to SQS
- Data Pipeline
- Fetch data from Mockstagram
- rake script to read messages from
influencer_id_store
SQS and forward toInfluencer Analytics Service
- Forwards influencer id to Influencer Analytics Service to be pushed again to queue for re-processing at certain interval
- rake script to read messages from
- Ingestion to DB
- rake script to read messages from
influencer_data_points
SQS and sink to DB
- rake script to read messages from
- Fetch data from Mockstagram
- API Layer
- API to fetch and display data
- Ruby on Rails
- RSpec for Unit Tests
- Faraday gem as HTTP client
- Parallel gem for parallel + concurrent processing
- TimeScale DB as timeseries database
- Mock AWS SQS for stream processing
- Docker for dependencies
For this problem statement, where in we have to store some kind of metrics for each influencer over a time period with some resolution I understand TimeSeries Database would be the best fit.
Upon reading of different TSDB, I shortlisted 3 popular dbs.
- Influx DB
- Prometheus DB
- TimeScale DB
While all 3 support timeseries based data storage and retrieval. Influx DB would have been the best for this use case, I chose to go with TimeScale DB
Reasons:
- Given it's built on top of PSQL and I have familiarity with it.
- If I had chosen InfluxDB, there would be some learning curve to understand it's query language but with my understanding with PSQL, it was easy to get started while retaining benefits of TSDB.
- Since I used Rails to create this project, ActiveRecord ORM works pretty well with PSQL and thus TimeScale is compatible with it.
Based on above reasons and given the time, this was a tradeoff I made.
My understanding for the basic flow was we have to maintain the granularity of data points resolution. To be able to achieve that with fail safe mechanism I need to make sure even if a request fails, it can be retried ASAP.
Here in, I started with a queue that stores influencer_id and a consumer which consumes from the queue, fetches data from Mockstagram and stores to DB. At the end of processing each influencer_id, consumer make sures that it pushes back the same influencer_id to queue for re-processing in a certain time interval.
Here, I realized DB would be the bottleneck which will impact resolution in case it goes down. To decouple the behaviour, I introduced another queue which stores the data points retrieved from Mockstagram and allow another consumer to do a batch insertion to reduce load on DB yet maintaining a suitable latency for data to be available.
Now, even if DB goes down for some reason data won't be lost and we still can retain the intended granularity.
GET /influencer_analytics/<influencer_id>?start_time=<epoch_in_seconds>&end_time=<epoch_in_seconds>