Editorial note: This article is a follow up to Corey Hynes’ blog on Optimizing Lab Performance. We recommend reading that article first to review factors that impact lab performance and how you can create labs that perform well on our lab development platform, Skillable Studio.
Now that you have optimized your lab, you can start to analyze the recorded performance data over time. To effectively do this, you need to understand what you can analyze and how the data is captured.
Lab Start Times
The lab “start time” is defined as the time between the user clicking “Launch” and the lab being ready for the user. The definition of “Ready” can vary depending on the lab and its configuration.
A virtual machine (VM) is considered “Ready” when the heartbeat from the hypervisor is accessible to Skillable Studio. This is controlled through the “Wait for heartbeat before displaying to user”configuration setting. If the VM operating system does not have a set of integration components or VM additions, this value should be unchecked or Skillable Studio will wait a long time before moving on, which will result in very long startup times. Unchecking this value will always produce very short startup times.
Pro Tip: To produce the most realistic startup times, check this box for any VMs that are hypervisor aware (have supported VM additions or integration components) and uncheck it for any that do not.
Note that even though a heartbeat may be present, that is not an indication that your lab is “Ready” for the user. There may still be many background services or processes that need to run.
For cloud labs, any template deployment configured to run in the foreground will count towards your lab startup time. To exclude a template from the lab startup time calculation, configure it to deploy in background as shown below.
Pro Tip: Measuring performance with templates set to deploy in background is almost pointless as all start times will be recorded in seconds, which is the time it takes to provision a cloud slice, not the time it takes to prepare the user environment.
When analyzing lab start times, you only need values:
- The lab name or ID
- The start date of the lab instance
- The startup duration of the lab in seconds
Using these three values you can graph, over time, the startup performance of your lab.
Consider the following example of a single lab:
This data shows very long startup times in the second half of 2019 and very short startup times in 2020. Analysis of the lab notes indicates that the checkbox to “Wait for heartbeat…” was selected on a VM that did not contain a set of integration components, which created very long startup times in 2019. The actual underlying performance did not change.
Pro Tip: Never average the lab startup times across two labs. Focus on one lab at a time and analyze over time.
By correlating the changes in lab configuration to startup time over time, we can get a clear picture if a labs performance is increasing, decreasing or remaining constant.
Latency is measured from your user’s computer to the datacenter hosting the user’s lab. If your lab is hosted in more than one datacenter, geolocation will identify the closest physical datacenter to host your lab session.
To effectively measure latency, you need the following data:
- Geolocation (city, region, country)
- Lab start time
- Average latency
- Lab session duration
You will note that the lab is not included in this analysis. The actual lab does not matter. Any labs can be used, and you can mix labs in this analysis.
Note the following when measuring latency:
- Depending on where the user is located, it may not be possible to get an accurate measurement due to network traffic restrictions.
- For most users, latency variance occurs in the first few hops from the user’s computer, therefore you should focus your analysis on specific users or cities.
- As most latency variance occurs close to the user, the user’s local time of day is an important factor to consider.
- Geolocation does not take latency into consideration. Geolocation uses geographic data to identify the closest datacenter physically, which may not always be the closest datacenter electronically.
- The IP address used to calculate geolocation is the public IP of the device where the user entered the public internet. In some cases, this may not be representative of the user’s physical location and may contribute to high latency. There are known cases of users exiting private networks in different countries – or even continents – than where they physically sit.
- Average latency is a straight average of all samples collected during the lab session and can be adversely affected by anomalies.
- Latency is only valid for labs that have VMs and do not use the HTML Console connection method.
Pro Tip: Latency values under 250ms should never produce the perception of UI lag. Latency values above 250ms will vary, but generally will start to deteriorate at around 500ms.
The example below shows the global latency for a lab sample set for the US-East datacenter displayed as a map. In this example, we are excluding all lab sessions shorter than 10 minutes and with a latency greater than 600ms.
While this example does not illustrate it, shading can be applied to each region to indicate the overall performance of the region.
Considering that VMs generally lose interactive response when latency exceeds 600ms, we can make the following assertations:
- Locations with a very high latency and low lab duration indicate that the lab experience is problematic.
- Locations with a very high latency and high lab duration indicate that the lab experience is acceptable, but latency cannot be effectively measured (or is not providing a good indication of experience) as users are still completing the lab.
- Locations that are seeing connections to multiple datacenters with varying latencies indicate that content is not fully functional in all datacenters.
- Locations with low latency and a low lab duration indicate that the lab is being abandoned early, which could indicate that issues other than latency exist.
Examples of Analyzing Latency
The next few examples take a production dataset and perform some sample analytics.
It is important to note that these examples do not take lab session duration into consideration, as we are analyzing the performance of “launched labs” not “completed labs.” If I were to focus on completed labs, I would factor in time spent in the lab to remove any sessions that were too short to have been completed.
The example below shows latency for a single user over time.
You can see that the user sessions are clustered in the same timeframe and latency is consistent regardless of time. For this user, there are no concerns on latency, regardless of the datacenter to which they connect. This user likely took three classes, a few weeks apart.
This next example shows latency for a city over the last 18 months, by datacenter.
In this case, the city is Paris and data is shown for 18 months for the US East, US West, and EU North datacenters.
The sample is limited to all sessions with 1000ms or less average latency to allow us to see the outliers.
We’ll now take a sample of three unique IP addresses from the above set and how they reported latency to EU North.
From looking at this, we can interpret the vertical groups as “events” where a lot of users did labs around the same time, likely in a class. We can also see that in those events there was some variance in latency.
Overall, the location indicated by grey saw consistent latency, except when these larger events occurred, indicating perhaps the local connection was not sufficient.
Finally, let us look at our largest sample, the grey sample, and see how time of day affects latency for this location.
This data would support the conclusion that this is likely a corporate office or training center and that they generally have good overall latency.
They have times during peak office hours where the connection seems to struggle as load increases and perhaps might be unusable at times.
The above examples demonstrate how to think about analyzing lab performance and how understanding the data elements that can vastly impact the analysis approach.
The conclusions drawn here are speculation only but provide a good indication of where the next steps might be taken to identify a potential bottleneck affecting latency.