Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QOLDEV-863 Fix solr HA #454

Merged
merged 4 commits into from
Aug 7, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 9 additions & 11 deletions files/default/solr-sync.sh
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ set -x
BACKUP_NAME="$CORE_NAME-$(date +'%Y-%m-%dT%H:%M')"
SNAPSHOT_NAME="snapshot.$BACKUP_NAME"
LOCAL_SNAPSHOT="$LOCAL_DIR/$SNAPSHOT_NAME"
SYNC_SNAPSHOT="$SYNC_DIR/$SNAPSHOT_NAME"
SYNC_SNAPSHOT="$SYNC_DIR/${SNAPSHOT_NAME}.tgz"
MINUTE=$(date +%M)

function set_dns_primary () {
Expand Down Expand Up @@ -52,18 +52,18 @@ function export_snapshot () {
if [ "$REPLICATION_STATUS" != "0" ]; then
return $REPLICATION_STATUS
fi
sudo -u solr sh -c "$LUCENE_CHECK $LOCAL_SNAPSHOT && rsync -a --delete $LOCAL_SNAPSHOT/ $SYNC_SNAPSHOT/" || return 1
sh -c "$LUCENE_CHECK $LOCAL_SNAPSHOT && sudo -u solr tar --force-local --exclude=write.lock -czf $SYNC_SNAPSHOT -C $LOCAL_SNAPSHOT ." || return 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've forgotten why we did not go with snapshot over backup. i know that backup is a full instead of a partial, but is also more disk/resource intensive.

Are we still running this every 2min or did we slow it down to every 10?

Copy link
Contributor Author

@ThrawnCA ThrawnCA Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't see any distinction in the docs between snapshot and backup. The commands are just 'backup' and 'restore'.

It runs every 5 minutes.

}

function import_snapshot () {
# Give the master time to update the sync copy
for i in $(eval echo "{1..40}"); do
if [ -f "$SYNC_SNAPSHOT/write.lock" ]; then
sudo -u solr rm -r $LOCAL_DIR/snapshot.$CORE_NAME-*
sudo -u solr rsync -a --delete "$SYNC_SNAPSHOT/" "$LOCAL_SNAPSHOT/" || exit 1
rm $LOCAL_SNAPSHOT/write.lock
curl "$HOST/$CORE_NAME/replication?command=restore&location=$LOCAL_DIR&name=$BACKUP_NAME"
return 1
if [ -f "$SYNC_SNAPSHOT" ]; then
sudo service solr stop
sudo -u solr mkdir $LOCAL_DIR/index
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we have a -p for safety?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah, line 80 already does that.

rm $LOCAL_DIR/index/* && sudo -u solr tar -xzf "$SYNC_SNAPSHOT" -C $LOCAL_DIR/index || exit 1
sudo systemctl start solr
return 0
else
sleep 5
fi
Expand Down Expand Up @@ -100,9 +100,7 @@ if (/usr/local/bin/pick-solr-master.sh); then

# Hourly backup to S3
if [ "$MINUTE" = "00" ]; then
cd "$LOCAL_DIR"
tar --force-local -czf "$SNAPSHOT_NAME.tgz" "$SNAPSHOT_NAME"
aws s3 mv "$SNAPSHOT_NAME.tgz" "s3://$BUCKET/solr_backup/$CORE_NAME/" --expires $(date -d '30 days' --iso-8601=seconds)
aws s3 cp "$SYNC_SNAPSHOT" "s3://$BUCKET/solr_backup/$CORE_NAME/" --expires $(date -d '30 days' --iso-8601=seconds)
fi
else
# make traffic come to this instance only as a backup option
Expand Down
7 changes: 7 additions & 0 deletions recipes/ckanbatch-configure.rb
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,13 @@
group "root"
end

file "/etc/cron.daily/prune-health-checks" do
content "/usr/local/bin/pick-job-server.sh && find /data -maxdepth 1 -name '*-healthcheck_*' -mmin '+60' -execdir rm '{}' ';' >/dev/null 2>&1\n"
mode "0755"
owner "root"
group "root"
end

file "/etc/cron.d/ckan-worker" do
content "*/5 * * * * root /usr/local/bin/pick-job-server.sh && /usr/local/bin/ckan-monitor-job-queue.sh >/dev/null 2>&1\n"
mode '0644'
Expand Down
9 changes: 7 additions & 2 deletions recipes/solr-deploy.rb
Original file line number Diff line number Diff line change
Expand Up @@ -267,10 +267,15 @@
action [:stop]
end
bash "Copy latest index from EFS" do
user account_name
code <<-EOS
rsync -a --delete #{efs_data_dir}/ #{real_data_dir}/
LATEST_INDEX=`ls -dtr #{efs_data_dir}/data/#{core_name}/data/snapshot.* |tail -1`
rsync $LATEST_INDEX/ #{real_data_dir}/data/#{core_name}/data/index/
CORE_DATA="#{real_data_dir}/data/#{core_name}/data"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how many snapshots do we keep on the efs?
could we move from the full file on efs but use a s3 pointer file to reduce efs costs also?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sync script, on export, removes all snapshots except the current one (solr-sync.sh line 90).

We can probably just drop EFS and use S3 without too much trouble. I didn't do it here because it wasn't needed, but it should be fairly straightforward. We don't use EFS for anything that demands high I/O performance; it's just putting timestamps in heartbeat files, and passing snapshots in the background.

LATEST_INDEX=`ls -dtr $CORE_DATA/snapshot.* |tail -1`
if (echo "$LATEST_INDEX" |grep "[.]tgz$" >/dev/null 2>&1); then
mkdir -p "$CORE_DATA/index"
rm -f $CORE_DATA/index/*; tar -xzf "$LATEST_INDEX" -C $CORE_DATA/index
fi
EOS
only_if { ::File.directory? efs_data_dir }
end
Expand Down