We found a smart solution to reduce how long it takes for agents to notice your command by 30 times. Let’s dive into how it used to work and why it was possible to improve the speed.
How command notifications used to work
Before today, the P9 Agent would talk with the P9 Cloud using a pull model. The cloud cannot contact the P9 Agent because the Agent installed on the computer or server must initiate the communication. Most of the time the P9 Agent is sleeping, waiting for an event or a scheduled task. When it wakes up, it collects info and posts it to the P9 Cloud for further processing. The communication is done over HTTPS due to security, and both the client and the server must submit certificates to verify authenticity.
One such scheduled task is to perform a check every 10 minutes to see if a new command exists. Commands can include things like restart a service, display a message or reboot the device. The command is created in the instance when you as a user request that a device performs a job. Next you have to wait up to 10 minutes before the device notices your request. This can be torture if you have screaming users expecting a troubled service to be restarted.
Can we do it better?
Our overall goal was to bring this wait down to something close to instantly; anything below one minute from the moment a new command is submitted to it being picked up by the P9 Agent would be satisfying. Needless to say, it must be done without taxing the CPU or network, and it must work across the WAN and with thousands of endpoints, some which run 24/7. This wasn’t an easy task we gave ourselves. The current HTTPS solution requires about 5000 bytes for handshake and certificates per check. Going from the current six checks per hour to, for example, 180 checks would raise the network traffic to ~1MB/h. Multiply that by the number of clients on the network, and we have way too much traffic.
The normal practice would involve a solution in which the P9 Agent creates a persistent connection. It requires only one handshake and an exchange of certificates to establish the connection. This would significantly reduce the network traffic, and clients could instantly be notified about new commands. In reality, firewalls would be an issue. Most firewalls will terminate the connection after some time and require the P9 Agent to re-establish it. Furthermore, the P9 Cloud must be able to handle all the incoming connections, which requires CPU and memory.
DNS to the rescue
A DNS query is typically done using the simplest transport layer communication available in the TCP/IP protocol suite: User Datagram Protocol (UDP). This requires very little overhead and minimal network traffic. We updated the P9 Agent so it now does a check every 20 seconds if a dynamic DNS record exists. When the query is answered, the agent instantly schedules a check for new commands. Each DNS query only requires about 137 bytes (this number is a little higher if the record exists, i.e., if a command is waiting to be downloaded). This means that every P9 Agent will only generate ~0.02MB of extra traffic per hour. Yes, it means extra work for your DNS servers but only a fraction of the workload they are designed to handle.
The result is that you can expect the P9 Agent to quickly notice your command with almost no extra constraints on your server(s).
The new communication model is being applied today, and as a Panorama9 user, you will see the Agent notice your commands 30 times faster than usual. We hope this will make your network administration life a bit easier. As always, your feedback and comments are welcome. If you are not already using Panorama9, you can request a free trial here.