Download Pinot Distribution from http://pinot.apache.org/download/
$ export PINOT_VERSION=0.7.0
$ tar -xvf apache-pinot-incubating-${PINOT_VERSION}-bin.tar.gz
$ cd apache-pinot-incubating-${PINOT_VERSION}-bin
bin/pinot-admin.sh StartZookeeper
bin/pinot-admin.sh StartServiceManager -zkAddress localhost:2181 -clusterName pinot-quickstart -port -1 -bootstrapConfigPaths ${PINOT_DIR}/config/pinot-controller.conf ${PINOT_DIR}/config/pinot-broker.conf ${PINOT_DIR}/config/pinot-server.conf
####### bin/pinot-admin.sh StartMinion -clusterName pinot-quickstart -zkAddress localhost:2181 -configFileName ${PINOT_DIR}/config/pinot-minion.conf
bin/kafka-topics.sh — create — bootstrap-server localhost:19092 — replication-factor 1 — partitions 1 — topic transcript-topic
-tableConfigFile ${PINOT_DIR}/transcript-table-offline.json
-schemaFile ${PINOT_DIR}/transcript-schema.json -exec
-schemaFile ${PINOT_DIR}/transcript-schema.json
-tableConfigFile ${PINOT_DIR}/transcript-table-realtime.json
-exec
bin/pinot-admin.sh LaunchDataIngestionJob
jobSpecFile ${PINOT_DIR}/batch-job-spec.yml
Pinot only allows adding new columns to the schema. In order to drop a column, change the column name or data type, a new table has to be created. Backfill jobs must run at the same granularity as the daily job with specific date. We need to specify input folder for our backfill job . The backfill job will then generate segments with the same name as the original job (with the new data). When uploading those segments to Pinot, the controller will replace the old segments with the new ones (segment names act like primary keys within Pinot) one by one. If the input directory contains main file under one segment and backfill input directory contains single file. In that case , single file will update and the rest file remain unchanged under segment. In case the raw data is modified in such a way that the original time bucket has fewer input files than the first ingestion run, backfill will fail.
— broker-list localhost:19092
— topic transcript-topic < /${PINOT_DIR}/upsert_transcript.json
upsert poses the additional requirement that all segments of the same partition must be served from the same server to ensure the data consistency across the segments. Accordingly, it requires to use strictReplicaGroup as the routing strategy.
There are some limitations for the upsert Pinot tables. First, the high-level consumer is not allowed for the input stream ingestion, which means stream.kafka.consumer.typemust be lowLevel. Second, the star-tree index cannot be used for indexing, as the star-tree index performs pre-aggregation during the ingestion.