- Chicago DB Use Cases
** Key Value Store
*** Write Guarantees
**** Always available writes
**** Very low latency writes
***** Time from issuing write request to ack from N quorum nodes is less than 10ms
**** Always replicated, if you receive an ack it means the write was replicated to N quorum nodes
**** Failure modes somewhat understood, recovery still lacking
*** Read Guarantees
**** If at least 1 of the N quorum nodes is available you will get a response
**** If none of the N quorum nodes is available you will get a failure
**** If you wrote the key and received a successful response, the value will be available immediately
*** Replication Guarantees
**** For a given quorum setting, key/values will be written to N quorum nodes.
**** If a node joins/leaves the cluster, the key/values in the cluster will be redistributed across the remaining nodes. Such that each key/value will exist on N quorum nodes.
**** Redistribution of Data As a client this means that if there are N nodes that have my key/value and one of them leaves the cluster, N-1 nodes will have my key/value until such time that the key/values are redistributed and once again there are N nodes with my key/value
** Streaming Time Series
-
Logging
-
Writes
- System level logs and PPFE logs needs to be published to chicago through a client.
- Writes need to be asynchronous, and best effort based.
- Each write operation should not be more than 10 ms(Not sure if that is achievable).
- incremental reads by either timestamp / offset / byte size , do not resend logs previously piped to kafka
- Restrict size of logs being uploaded to kafka via chicago in order of < 20 MB per update to avoid clogging both from host -> chicago and chicago -> kafka
- Chicago module should have HEADER in logs for each task being logged
- Create consistent format for logging
-
Reads
- Consumer from chicago should be able to read messages based on a topic name in milli seconds.
- Chicago should upload to existing kafka topic
- consistent updates and format with HEADER on updates for each update
- Consumer should be able to filter the kafka topic with task information
-
-
PPFE
-
Block DDOS attach
- In a scenario when a particular IP is sending millions of requests in short duration of time, PPFE needs to block that IP address by re-using analysis from Chicago stream logs.
- The logs sent to Chicago needs to be consumed by a consumer in order of milli seconds.
-
Block bad requests
- In case certain bad requets are sent to PPFE, which can be parsed in certain way, we need to send the logs to Chicago.
- By a stream consumer, algorithms can detect such requests and make decisions to block the requests without manual intervention.
-
-
ProdNG
- Accomodate custom requirements to Agent
- TODO grab big brother use cases
- TODO grab Zabbix use cases
- Chicago module to update logs
- Create source list and format patterns by application / system log monitoring / custom tasks
- Accomodate custom operations [ adhoc scripts / app logs ]
- Agent should be able to run scripts on demand and channel logs via chicago to Kafka topic
- Log retention time
- Report Health
- Capture all stacktraces
- Memory / cpu report parameters
- log zombie processes
- machine health reporting a. space b. load c. free memory d. app running stats
- sshd logs
- seperate logging filter from ProdNG for kafka topic
- Identify delay in logs being consumed
- Alerting on chicago server(s)
- break updates into shards for topics / breakdown for filter
- Log rotation and reporting
- init.d service monitoring
- Create chicago module for transferring data from Host to chicago
- Accomodate custom requirements to Agent
-
Kafka Setup
- Cluster Availability and maintenance
Requirements :
* Very Fast asynchronous writes.
* Ordered(Not sure if we need ordering) Reads based on a topic/column family as soon as they are wriiten.
* Consumers to tail the logs from Chicago and push to to respective topics in Kafka for all other cliets like splunk,dashboard to consume.
* Replication of messages to configurable no. of brokers
* Fault tolerant - Writes and reads are not affected by a node in cluster going down as long as the quorom is maintained.
* Retention of messaged in RocksDB based on size or time period or as soon as they are pushed to Kafka.
* configure size of logs being updated by interval and size of logs
* logs should be truncated after each update and rolled into old
* offset filter for host / topic
* offset filter for big brother / zabbix