UPDATE: Maybe you are interesting how the story went on. Have a look at Natural borders of Cloud Queue: Taking the step to Service Bus.
Programming for the cloud is always experimental. This can be quite scary when unexpected things happen in a production environment. Why? It is pretty hard to understand all the environment details plus all the implementation details of used services.
Last Tuesday we had to publish a new version to the cloud. We communicated a maintenance window, informed the customer to be aware of a short outage, but nothing that should have affected the system too much. Actually half an hour later everything should have been done, as the code changes have been pretty small.
These have been the changes:
- shorten log messages due to we created a lot in Stackify (over 300 gb of log messages within 2 months)
- Correct one small issues: Missing try catch in catching the exit of a process will lead to restart the worker. Okay, actually that doesn’t sound small but restarting the whole worker when one process disappears can be considered to be a valid fix. It is not optimal because of the time that is necessary to process.
- Add some good guess in code. Yes, I really hate such kind of programming, just implementing something because something happened that wasn’t detectable and understandable fully, so the best guess is implemented to hope and believe
Let’s go into more depth about this entry, I also created a StackOverflow question for this: Scalability of Azure Cloud Queue. Unfortunately this doesn’t solve it, got additional information, though.
Shortly repeating the content of the question:
In current project we currently use 8 worker role machines side by side that actually work a little different than azure may expect it.Short outline of the system:
- each worker start up to 8 processes that actually connect to cloud queue and processes messages
- each process accesses three different cloud queues for collecting messages for different purposes (delta recognition, backup, metadata)
- each message leads to a WCF call to an ERP system to gather information and finally add retreived response in an ReDis cache
- this approach has been chosen over many smaller machines due to costs and performance. While 24 one-core machines would perform by 400 calls/s to the ERP system, 8 four-core machines with 8 processes do over 800 calls/s.
Now to the question: when even increasing the count of machines to increase performance to 1200 calls/s, we experienced outages of Cloud Queue. In same moment of time, 80% of the machines‘ processes don’t process messages anymore.
Here we have two problems:
- Remote debugging is not possible for these processes, but it was possible to use dile to get some information out.
- We use GetMessages method of Cloud Queue to get up to 4 messages from queue. Cloud Queue always answers with 0 messages. Reconnect the cloud queue does not help.
Restarting workers does help, but shortly lead to same problem. Are we hitting the natural borders of scalability of Cloud Queue and should switch to Service Bus?
So how did we get there?
Certainly the changes in the program have not been exactly what I communicated to the customer. „Just a few more changes“ the developer told me. Some code refactoring, some improvements, a little beautifying. No of these changes should lead to the problem we had, but it certainly added some more complexity, as the system was more different than we thought.
The most important change, that had been introduced to the system was something completely different like mentioned above. When deploying, the developer decided to increase the size of the machines. We experienced over 95% cpu usage on these machines, so it sounded natural to the developer to just increase the memory, due to he had the impression that memory was the cause of the usage.
He was right.
As stated before, we communicated 800 calls/s to the customer. We are not the only guys using the ERP system, currently ca. 2.5 billion calls are done to that ERP monthly. So it really matters how many calls we do, due to we don’t know how much load is put from the other consumers, so we could break it.
But the additional memory lead to 1200calls/s using tons of TCP connections. So far so good. It worked fine, lightning speed, but … only for ca. 2 hours. Then the phenomen arises that I mentioned in the StackOverflow entry.
- All of the machines‘ processes stopped in the same moment of time. That really catches my attention, due to it was machine border comprehensive.
- Some of the processes stayed active, but with unpredictable performance. Some have been very slow, others worked reliable like before
It was Thursday evening, 10:00pm and the workers just had been restarted. And that happened again. What now? We considered all the code. The „good guess“ we implemented didn’t change anything. We just decided to put code that will reconnect to Cloud Queue after a certain time frame when no message has been received. That doesn’t help at all.
As time went by and we didn’t want to stay up another night, we needed a decision. This was the production system. What now?
The most obvious change has been the increase of memory, after all. We remotely connected to one of these machines and checked the count of TCP connections being done. There have been hundreds. Certainly not only the ones to the ERP system, but also to Cloud Queue, Azure storage accounts, etc. That really makes a different when the CPU is not busy with doing System.IO due to having too less memory.
It could now completely focus on TCP connections and the workload.
We deployed again and went back to 7 Gb of memory and reduced the processes of one machine while adding another worker. Instead of 7 machines with 8 processes we then had 8 machines with 7 processes. And watched the system. Today’s Monday and it worked pretty well, no outages like this anymore.
There are still some questions that remain:
- When we add more machines due to necessary performance, will that happen again?
- Why didn’t we get any exceptions, just zero messages from GetMessages method of the Cloud Queue?
- Cloud Queue promises up to 2000 messages/s, we never hit that amount. But we surely put some serious load.
- From my point of view, we really hit the maximum count of TCP connections of the OS, but why the outage affected all the machines instead of the machines themselves then?
- Why isn’t it possible to remotely debug processes beside worker/ web roles themselves?
Couldn’t find anything about the maximum count of connections that Cloud Queue offers – that would be pretty interesting 😉
We experienced this behavior more often. The solution to fix this issue seems to be to just switch to another storage account. Every account garantuees to scale up to max. 20000 operations/s. We do have three cloud queues on one storage where one is heavily and steadily used while the others are used for different tasks that occur at least once a day or once a week. Beside that we use the storage account for creating backups from ReDis cache and another functionality that also heavily uses the storage account once on hour.
It looks like when there is heavy traffic on the hourly task, it breaks. What does „it breaks“ mean? As described above we don’t get any exceptions from the Cloud Queue, we just don’t retrieve any messages even when recreating the connection.
I’ll update this entry when we deployed new solution and add some results.
Now we did do the changes. First a summarization of the usage of the storage account:
- We had six queues running on one storage account.
- We used the blob storage, once a day pretty heavily.
- The „normal“ diagonistics that Azure provides out of the box also used the same storage account.
- Some controlling processes used small tables to store and read information once an hour for ca. 20 minutes
- There may be up to 800 calls/s that try to increase a number to count calls to an ERP system.
When recognizing that the storage account is put under heavy load we split it up.
- Now there are three physical storage accounts heaving 2 queues.
- The original one still keeps up to 800/s calls for increasing counters
- Diagnositics are still on the original one
- Controlling information has been also moved
The system runs now for 2 weeks, working like a charm. There are several things we learned from that:
- No, the infrastructure is „not just there“ and it doesn’t scale endlessly.
- Even if we thought we didn’t use „that much“ summarized we used quite heavily and uncontrolled.
- There is no „best practices“ anywhere in the net that tells the complete story. Esp. when start working with the storage account a guide from MS would be quite helpful
- Exception handling in storage is quite bad. Even if the storage account is overused, I would expect some kind of exception and not just returning zero message without any surrounding information