Chaos-Monkey-and-Doctor-Monkey

Resilience testing confirms that the system can recover from expected or unexpected events without loss of data or functionality.

To perform Resilience testing, we have implemented Chaos Monkey.

Chaos Monkey

Chaos Monkey is a service which randomly terminates one or more instances in the group. Various companies execute chaos monkey in a controlled manner during work hours (generally 9am to 3pm monday to friday) which randomly disrupts different groups. This is done to detect and handle an error during work hours instead of client finding out a problem at other times.

To implement chaos monkey, we have made use of npm aws-sdk package. It provides methods to shut down, terminate, list instances, etc.

Twice random numbers are generated. First - to select number of instances to stop. Second - to select the instance to stop.

var number = Math.floor(random(1, instances.length-1));           // random number of instances to shut down
console.log("Shutting down "+number+ " instances - Chaos Monkey");

for( i =0;i<number;i++)
{
  var position = Math.floor(random(0, instances.length-1));       // random instance 
  stopInstance(instances[position]);
  console.log(instances[position]);
}

The method ec2.stopInstances({ InstanceIds: [instanceId] } is used to stop the instance.

Chaos Monkey to test back up

Chaos monkey can also be used to check if the backup servers are working and they come up correctly when needed. To demonstrate this, we have added a list of backup servers. A heartbeat checks for number of application servers available and if this value crosses a certain threshold, it alerts the user and adds the backup servers to the lists.

The load balancer forwards the request to the application server, but if the error is received, it redirects the request and forwards it to the next server in the list.

 request(options, function(req,response,error)
{
	if(error!=null)
	{
	console.log("Success");
	instances.push(TARGET);			       // Add the server back to last position in the list
	}
	else
	{
	console.log("Redirecting...");
	res.redirect('http://localhost:3000')         // Redirect back to proxy to be handles by next server.
	}
}).pipe(res);

Doctor Monkey

Job of the Doctor Monkey is to perform health checks on instances. If any unhealthy instances are detected, they are reported and removed from service.

We have used the following approach -

Each instance is running a small application which monitors its health and maintains a counter.

For every 'ALERT', this counter is increased and for every 'OK' status, this counter is decreased -> Thus a couple of ALERTS will not cause the machine to terminate. Only if there are various ALERTS in a short span of time, the device will be terminated.

function statusVal() 
{
	if(loadpercent < 50 && cpuload < 50)
	{
		counter--;
		if(counter<0)
			counter = 0;
		return "OK";
	}
	else
	{
		counter++;
		if(counter>5)
		{
			stopInstance(instanceId);
			return "Terminated"
		}
		return "ALERT";
	}
}

The Doctor node executes and have a heartbeat monitoring each of the instance every 2 seconds. It displays the result on the index.html file.

The output is color coded as followed -

Cool (BLUE) - not monitored/ connected.
Ok (Light Blue) - working fine.
Alert (Orange) - Threshold crossed.
Terminated (Red) - Instance terminated.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
img		img
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chaos-Monkey-and-Doctor-Monkey

Chaos Monkey

Chaos Monkey to test back up

Doctor Monkey

About

Releases

Packages

sujithktkm/Chaos-Monkey-and-Doctor-Monkey

Folders and files

Latest commit

History

Repository files navigation

Chaos-Monkey-and-Doctor-Monkey

Chaos Monkey

Chaos Monkey to test back up

Doctor Monkey

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages