Description of noisy neighbor problem #6 lacks some depth.
AWS noisy neighbors problem is very often misunderstood. CPU steal time under linux does NOT mean that somebody is stealing your CPU. It simply means that you wanted to use CPU and hypervisor has given it to another instance. This may happen because you have exceeded your quota or scheduling algorithm selected another pending instance at this very moment and it would give CPU back to you a bit later. In the end in both cases your instance gets fair share.
Why does killing and restarting instance help? It likely moves instance to different hardware node with less active neighbors. When your neighbor is not active your instance can use CPU idle cycles of your neighbor! You sort of become the noisy one. Still hypervisor would prevent it once neighbor starts to fully utilize his CPU quota and you are back to square one.
Amazon specifically states that t1.micro instances do not guarantee CPU performance:
"Micro instances are a very low-cost instance option, providing a small amount of CPU resources. Micro instances may opportunistically increase CPU capacity in short bursts when additional cycles are available. They are well suited for lower throughput applications and websites that require additional compute cycles periodically, but are not appropriate for applications that require sustained CPU performance."
While CPU sharing is pretty well documented noisy neighbor problem still exists for network and disk resources being shared by multiple instances on the same hardware node. The only way to detect these problems is to track network throughput/loss rate for network and IO stats for disk.
"You are guaranteed to avoid noisy neighbors CPU problem by using AWS dedicated instances"
This is incorrect. You simply ensure that the only ec2 instances that can be a noisy neighbor are other instances of yours. Since most users investing in dedicated instances tend to put their instances to work, this sometimes has the net effect of reducing performance.
There was recently [a paper][1] published in ACM Transactions on Computer Systems where the argument was that if your instance is scheduled out by the hypervisor during TCP traffic, the latency in ACKing packets would reduce the network throughput of instances, because of the slow start mechanism.
Given that distributed network rely heavily on network latency and capacity, I found it very interesting to see the effect of busy CPU propagate to slower network IO.
If you're sharing a CPU with a neighbour and you're using 3% and the neighbour is at 100%, the hypervisor needs to somehow squash that 103% into the 100% it has available - which means you lose a little bit of the scheduled time slots that otherwise would have been yours. If your neighbour drops down to 50%, there's now enough CPU time to handle both of you, so there's no 'steal'.
>Why does this happen even when the instance isn't using all of its CPU?
Probably hypervisor does not assign CPU every time it is requested but it still manages to assign as much as needed because in the end there is some idle time left.
We've seen for example, if we have 5 instances running the same app, one instance potentially uses 20% more CPU with tons of CPU steal time, but none are using 100%, the "normal" ones use ~40% and the one with tons of steal time spikes between 60% - 70%.
But rebuilding the instance brings them all in line.
AWS noisy neighbors problem is very often misunderstood. CPU steal time under linux does NOT mean that somebody is stealing your CPU. It simply means that you wanted to use CPU and hypervisor has given it to another instance. This may happen because you have exceeded your quota or scheduling algorithm selected another pending instance at this very moment and it would give CPU back to you a bit later. In the end in both cases your instance gets fair share.
Great detailed explanations of steal time: https://support.cloud.engineyard.com/entries/22806937-Explan... and: http://www.stackdriver.com/understanding-cpu-steal-experimen.... The latter article is mentioned by OP but seems to be not fully read/understood.
Why does killing and restarting instance help? It likely moves instance to different hardware node with less active neighbors. When your neighbor is not active your instance can use CPU idle cycles of your neighbor! You sort of become the noisy one. Still hypervisor would prevent it once neighbor starts to fully utilize his CPU quota and you are back to square one.
Amazon does not oversubscribe CPU according to their CTO: http://itknowledgeexchange.techtarget.com/cloud-computing/am...
Amazon specifically states that t1.micro instances do not guarantee CPU performance: "Micro instances are a very low-cost instance option, providing a small amount of CPU resources. Micro instances may opportunistically increase CPU capacity in short bursts when additional cycles are available. They are well suited for lower throughput applications and websites that require additional compute cycles periodically, but are not appropriate for applications that require sustained CPU performance."
While CPU sharing is pretty well documented noisy neighbor problem still exists for network and disk resources being shared by multiple instances on the same hardware node. The only way to detect these problems is to track network throughput/loss rate for network and IO stats for disk.
You are guaranteed to avoid noisy neighbors CPU problem by using AWS dedicated instances: http://aws.amazon.com/dedicated-instances/
I work for APM ( Application Performance Management) vendor , I have no business praising AWS.
[Edit:spelling and clarity]
[Edit2: changed CPU scheduling description]