Lab performance is a frequently misunderstood concept. If you deploy labs, I’m sure you’ve heard the comment, “This lab is slow.” What does that mean and relative to what?
We’ll endeavor to peel back the proverbial onion and explain factors that impact lab performance and how you can create labs that perform well on our platform, Skillable Studio.
Typically, when we discuss performance, we first narrow down to one of three specific areas:
- The time it takes to launch a lab (“start”). This is the time from when the user hits the Launch button until they can effectively use the lab.
- The performance of the application or services inside the lab. When I click a button or take an action, does it happen as quickly as I expect?
- The performance of the UI in the lab. This applies to Virtual Machines (VMs) only and generally means that the mouse and keyboard are responsive – or not.
Let’s break down these three types of performance and look at what goes into achieving “good” performance. In an ideal world, a lab must achieve all three to have good performance. When troubleshooting performance issues, it’s equally as critical to identify which of the three types of performance you are troubleshooting.
Launch times are the length of time it takes your lab to start. This can range from a few seconds to many, many minutes and the factors change based on the type of lab you have. While the lab type can vary, the bigger and more complex the lab, the longer it will take to start. Managing startup time is more about managing experience and expectations than true performance optimization.
Cloud Labs (Azure/AWS).
The launch time of cloud labs is determined by how quickly the selected cloud provider can provision the environment you requested. Both Azure and AWS labs can leverage a deployment template, ARM or CloudFormation, respectively. If that template contains a large and complex deployment, your lab will take a long time to start. Optimization of this has nothing to do with Skillable Studio. The deployment time will be the same inside Skillable Studio as outside. Skillable Studio has built-in retries in the event a deployment fails, so if you experience a very long launch time, it is possible a retry is needed. If your deployment is fast in a native cloud portal, it will be equally as fast in Skillable Studio.
Performance Optimization Tip: Use simple deployment templates.
Container Labs (Docker).
A container launch is generally very quick. The container does not need to “boot,” so simple Linux containers deploy in seconds. Within Skillable Studio you can provision container startup scripts and those scripts, depending on what they do, may start services or perform other functions that can impact startup time. Additionally, you can choose to present either the container console, or if the container supports it, navigate directly to a web page inside the container. In those situations, you may be waiting for the website to be ready before the container “appears” ready for the user.
Performance Optimization Tip: Present the console and then in the lab, have the user navigate to any service web pages.
VM Labs (Hyper-V/VMware/Azure IaaS/AWS EC2).
This is where you have the most control. It’s also the easiest to understand.
How do you make a computer boot quickly?
- Disable services that are not needed
- Apply adequate RAM
- Only install software that is required
Performance optimization for VMs is a bit of an art form. The same things you would do to your own computer will apply to a VM used for a lab. You should take time to tune and optimize individual operating systems to create the lowest possible boot times.
One very good practice is to follow optimization guidance for Virtual Desktop Infrastructure (VDI). Lab deployment for VMs and VDI share many common architectural elements and the patterns and practices for VDI universally apply to lab VMs. One good reference is https://docs.microsoft.com/en-us/windows-server/remote/remote-desktop-services/rds-vdi-recommendations-1803. There are also a number of tools, including the VMware OSOT (https://flings.vmware.com/vmware-os-optimization-tool) and Citrix Optimizer (https://support.citrix.com/article/CTX224676?download), which implement many best practices for you. As with anything, read and consider the license agreements for any tool you use and fully test your lab after any optimizations.
Performance Optimization Tip: Use the Task Manager to reduce the number of services on startup for faster start times.
Pro Tip: Use a network monitor or tracing tool to see if your lab is attempting to contact external services during startup and find a way to block those requests.
Start State and VMs.
For more complex labs, you may implement “Start states.” Start states for Hyper-V allow the lab to be resumed instead of booted. In many cases, this can highly reduce your start times, especially for large and complex labs. Start states come with some drawbacks, including placing restrictions on your ability to update your lab without additional steps, and some software does not work well when restored from a Start state. You should fully test your lab after implementing Start states.
Pro Tip: Do not implement Start states until you have fully validated your lab is working and you anticipate no changes.
The decision to use a Start state vs naturally booting the VMs can have a huge impact on the perception of launch performance. If you have a complex lab, your boot sequence must account for dependencies between lab components. For example, do you need to start a server VM before a client VM to ensure the client VM can correctly interact with the server VM? Skillable Studio enables you to sequence the startup of VMs by controlling which VMs auto-start (and which do not), as well as introducing startup delays.
For more complex lab environments, the time to correctly start and stabilize services may be far longer than the actual boot time of the VMs. We have had labs that take nearly 45 minutes to be “ready for use” due to complex dependencies between VMs, the need to slowly bring various services online and the need to let those services stabilize after boot. In these cases, using a Start state yields a better launch time. Even when using Start states, sequencing can be important as larger RAM VMs (usually servers) will take longer to restore than lower RAM VMs.
In these situations, it becomes more important to manage the users’ expectations with some of the techniques I’ll share shortly.
Performance inside the lab.
When your lab is running, do the applications and tools respond as quickly as they should? For example, if you open an application, does it open as you would expect? We are going to focus exclusively on VMs as again, this is where you have control. Cloud and Docker containers will perform as fast as the underlying fabric. If Azure or AWS are running slow, so will your lab. In both these cases, there is little you can do as the author to design for performance.
Performance in VMs comes down to the usual suspects: Disks, Network, Memory and CPU. If you have inadequate resources allocated, you will see poor performance.
Let’s look at how we should evaluate each of these in a lab:
Disk storage is managed by Skillable Studio on flash and SSD storage. Limits (aka throttling) are not placed on Disk inputs/outputs. In general, we do not see performance bottlenecks in our disk subsystems and we constantly monitor and upgrade our disks. As a lab developer, you do not gain performance by using larger or multiple disks. In most cases, the simplest disk configurations are best, and the key is to reduce disk performance as much as possible.
Pro Tip: Most desktop variants of Windows and Linux have scheduled disk maintenance tasks and software update tasks. These tasks can incur massive disk IO and can be triggered every time you launch your lab. Many labs have been viewed as underperforming only to discover that scheduled virus scans, defragmentations or update installations were running in the background. Because labs are launched from a fixed state, these will repeat for every lab launch.
Network performance has two considerations: Internet access and VM-to-VM communication. Internet access is governed by a series of firewall devices and is throttled, yet adequate for most labs. VM-to-VM communication is governed by the underlying hypervisor. There are multiple types of network adapters for Hyper-V and VMware. Ensure you select the Enhanced network adapter to allow the fastest VM-to-VM performance. Legacy or emulated adapters should only be used for compatibility reasons and wherever possible, use static addresses, including TCP/IP and MAC.
Pro Tip: Do not rely on lab users to download items from the Internet. Pre-download and stage any files. If you think the files will change often, stage them on an attached ISO which is easy to swap out and update versus in the VM, which is not as easy.
Memory is where you have the most impact. The number one cause of performance issues inside labs is inadequate RAM. Inadequate RAM causes disk paging, which ultimately slows performance. Measure your lab to ensure that you have enough memory and avoid disk paging. This will have the single biggest impact on overall lab performance.
Pro Tip: Do not add more RAM than you need as it drives up cost. Do not skimp on RAM as it kills performance. Find the balance.
Labs typically do not handle large datasets and are not compute intensive, so you do not need large quantities of CPUs. In most cases, regardless of the RAM configured, four (4) CPUs is adequate. Again, test your lab and understand how the software in your lab uses CPUs. Adding more CPUs typically does not help performance and in some cases, can hurt performance.
Pro Tip: One (1) CPU for every 4GB RAM is a good rule of thumb. Any more than eight (8) CPUs is overkill.
To measure performance when your lab is running, you should focus on a few key metrics:
- CPU and Disk utilization – Lower is better.
- Disk queue length – Lower is better, generally five (5) or less.
- Page file usage and Hard Page Faults – Lower is better. These indicate the page file was needed on the disk.
These metrics can be accessed through Windows Task Manager or a variety of Linux tools. Remember to measure these metrics the entire duration of your lab and if your lab is being booted (versus restored from a Start state), these numbers will likely be very high initially. For your users to have a good experience, you also need to understand how long it takes for your VMs to “settle down” after starting and all services return to an idle state.
Pro Tip: Regardless of what you are running in your lab, there is probably an optimization guide for that software. Optimization methods for a software running in a lab do not differ from a production environment.
A common mistake is not managing user expectations. Almost all applications and services require a small warmup time when they are first started and almost all modern software assumes Internet access. Users often have unrealistic expectations about lab environments and expect them to perform better than the real world. The opposite is usually true because of shared infrastructure between users. Use scripts and actions to pre-launch software or stating that a given task may take a few minutes, will go a long way to manage expectations and ultimately the perception of performance.
An often-overlooked element of performance is application tuning, ensuring that:
- Any applications running on your VMs are configured to run as efficiently as possible
- Unnecessary add-ons are avoided
- Integrated help and tips are turned off
- Updates are disabled, especially for server-based applications
- Networks are correctly configured (If you are using Start states, capture the Start state with all assorted configuration consoles pre-launched and open)
Pro Tip: Many modern apps expect that they can reach the Internet to check for updates. They can take a very long time to start if the Internet is not available, as they attempt to connect, retry and eventually fail without you knowing.
User interface performance.
The final element of performance has nothing to do with the lab and everything to do with your connection to the lab. Users expect that when they move their mouse or type that the screen is responsive and there will be little delay. You can have the best technically performing lab in the world, with excellent startup times, but if the user has a poor connection, the complaint will often be, “the lab is slow.” Usually this one is out of your control.
Types of user interfaces.
There are three user interfaces that may be exposed in your lab:
- VM Desktop – A “remote desktop” style connection to the VM on Hyper-V or VMWare that leverages our HTML RDP Gateway.
- SSH Console – An HTTP-based text console used in Docker containers and available on some Linux VMs.
- Portal – A redirect to a public portal such as the Azure or AWS portal.
User interface performance as described below is only a valid consideration for VM remote desktop connections. The second two rely on standard HTTP traffic, which works very well even on very poor connections. Put another way, if you can effectively browse the web, you can effectively use our SSH Console and any Portal we present.
Factors affecting user interface performance.
We provide a Connection Assessment Test (CAT) on our website which provides a quality report for users’ connections by running a series of tests against our datacenters. These tests provide a reasonable benchmark but are not a guarantee of performance. There are three primary factors that impact performance:
- Latency – The time to send data packets from the user to the server hosting the accessed VM. Latency has a direct impact on the response of the mouse and keyboard. The higher the latency, the more delay can appear. Generally, you are looking for latency values of 250 ms or less to have a good experience. Latencies above 350 ms are usable but can be frustrating at times. Latencies above 500 ms can be unusable.
- Jitter – The stability and consistency of the connection.
- Error rate – The number of data packet errors that occur.
An unstable connection with low latency but high jitter and errors will be frustrating to use. This combination might be found on public WIFI or a connection that is very heavily used. Conversely, high latency combined with low jitter and errors, on a very stable connection, creates a suitable experience. We have effectively run labs on tethered cell phones with 500+ latency but very stable connections.
When evaluating a location for its lab suitability, it’s important to conduct the evaluation when the lab will be used. For example, evaluating a hotel conference room in the middle of the night will produce a very different result then the middle of the day when more users are active. Many lab hosts over the years have been burned by assuming that all times of day perform equally at a given site.
Skillable Studio features automatic geo-location of lab sessions. At lab launch, your last known public IP address is evaluated to determine your location. Based on this, we attempt to launch your lab from the nearest possible datacenter or associated vPOP which contains a replica of your lab.
Note that if you use a VPN or a corporate network in which your egress point to public Internet is not anywhere near your physical location, you may be geo-located to a farther datacenter. The geo-location is not the position of the user, but the location of the point at which the user enters the public Internet. Your public IP is determined either by your lab client automatically or can be manually passed in using an API launch. Additionally, for users of our API, “hints” can be included that force geo-location to a specific datacenter. While geo-location does a reasonable job of ensuring that a user is attaching to the “closest” datacenter, it may not always choose the lowest latency datacenter and it may be manually overridden by the system which is launching the lab.
Optimizing labs for user interface performance.
There are several things you can do to ensure your lab performs well even on low latency connections. These activities come down to ensuring that as you navigate a user interface, you generate as little traffic as possible by generating fewer UI updates.
For example, using a solid color background wallpaper versus a complex bitmap. Disabling visual effects such as animations and fade effects go a long way. One setting in Windows allows you to adjust the visual effects for “best performance” instead of “best appearance.” This one setting makes a massive difference. As with startup optimization, many best practices, guides and optimizations for VDI can be applied here to great effect.
- Use a smaller screen resolution in the VM for lower latency connections.
- Instead of navigating menus, use shortcut keys or commands to avoid UI updates in poor connections.
- Implement Type Text in IDLX to eliminate typing by the user.
Balancing configuration, performance and experience.
Creating a great lab experience does not always mean that you focus exclusively on performance. The very nature of “slow” is subjective. To a Formula 1 driver, a sports car is slow. To an SUV driver, it might be very fast. Managing expectations and controlling the experience goes a long way to creating the impression of “good performance.”
Below are some tips for helping to manage user expectations regarding performance:
- If you know a lab takes a long time to start (such as a complex multi-VM infrastructure or an elaborate cloud template), you can use the time during startup for other purposes. Skillable Studio includes the notion of a startup video or URL which can be displayed while the lab is starting. This could be a lab overview, tips and tricks or some resource to pass the time. One customer even used this time to display a small game.
- For VM labs, the default behavior is to not show the desktop of the VMs until they are fully started. In a complex lab with domain controllers and other services, this may take several minutes. You can modify this default, which allows the students to see the boot process and gives them access to the lab manual early. Alternatively, you can choose to not boot VMs by default and allow the student to boot manually, engaging them earlier.
- Cloud deployments have a similar function which directs the system to deploy the cloud templates in the background. This allows you to get the user into the lab early and perform some tasks, all while longer deployments run in the background. Automatic notifications inform the user when the background deployments are completed.
Setting expectations is key. If you tell a user a deployment will take five minutes because that is the average startup time you’ve measured and it ends up taking three, they will be impressed.
If you tell them nothing and it takes three minutes, “This lab is slow.”
Want to optimize your labs?
Our lab platform team can analyze your lab series for inefficacies and then work with our lab developers to optimize your labs.