Balancing Productivity and Cost in Cloud-Based Remote Desktop (Part Two)
The `stop-if-inactive.sh`
Script Revised to Sustain Temporal Network Glitches
The updated version of the stop-if-inactive.sh
script saves me 1 to 2 hours of productivity daily. Here’s why.
Previously, the script used the Linux shutdown <timeout>
command to stop inactive VM instances. This command caused issues with restoring disrupted network connections. Network disruptions occur 2-4 times daily, taking about 5 minutes each to recover. Due to the break in mental concentration, each network disruption resulted in about 30 minutes of productivity loss.
This publication follows up on my previous discussions about the AWS Multi-Account/Multi-Platform/Multi-User (MAPU) environment architecture and SSH protocol tunneling, focused on the cloud-based remote desktop operation at scale as a subtle balance between security, cost, and productivity.
The third article in this series introduced an automatic solution for stopping a virtual machine instance when no user activity is detected. While this solution worked on average, it had a flaw, mentioned above, leading to productivity loss in case of SSH session reconnect attempts performed by a client application, such as VSCode Remote.
In this publication, I will present a modification of the original solution, that eliminates this problem without significant cost increase. While the final solution is fairly simple, understanding the original problem and the rationale behind the selected solution requires a good understanding of how Linux and the SSH protocol work. Before diving into technical details, let me recall why remote desktops are essential.
Why Remote Desktops?
Performing all programming activities using a remote desktop is critical for anyone working with multiple software technologies. With the vast array of programming languages, runtime environments, and libraries, maintaining and updating them on a local computer is nearly impossible: one quickly loses mental control over multiple configurations.
This challenge leads engineers to stick with the single environment they initially started with, making them reluctant to experiment with alternatives.
While vendors might be content with this status quo, the industry suffers. This limitation also narrows the spectrum of hardware in use. For example, if a local laptop or desktop computer is based on Intel or AMD CPUs, supporting ARM-based CPUs might not be considered, despite their price, performance, and environmental advantages.
The same logic applies to various combinations of GPU, FPGA, secure enclaves, and other potentially advantageous hardware components. While additional hardware can usually be purchased, it will occupy extra space and become outdated too quickly.
With this understanding of overall motivation, let’s define the problem more precisely.
Problem Statement
Periodically, SSH sessions are disrupted due to network glitches or other reasons, causing VSCode to attempt reconnection. These attempts never succeed, forcing the user to wait until the VM instance enters a stopping
state before trying to reconnect. Each disruption took about 5-10 minutes, which can seriously disrupt the user's workflow.
The root cause of this issue was that the stop-if-inactive.sh
script invoked the Linux shutdown $SHUTDOWN_DELAY
command when it detected a closed SSH connection. During the $SHUTDOWN_DELAY
, the SSH daemon refused to accept incoming connections, and the script could not detect a new connection attempt and cancel the shutdown
command.
As a result, when VSCode tried to reconnect, the initial attempt always failed, requiring the user to wait for the VM to stop and reconnect manually, leading to significant productivity loss.
Solution Overview
The solution overview is presented in the diagram below.
The only difference from the previous version is that thestop-if-inactive.sh
script to stop the instance communicates with the AWS EC2 Service instead of using the Linux shutdown $SHUTDOWN_DELAY
command.
Script Logic
Here is a simplified description of the script logic.
- Read Configuration: Load configuration values from an external file.
- Monitor Activity: Regularly check for file changes and user command submissions in the terminal.
- Check Sessions: Determine if there are any active SSH or Tmux sessions.
- Disable SSH: If no user activity is detected, close all SSH sessions and temporarily block new SSH connections.
- Stop Instance: If inactivity persists, unblock new SSH connection requests, and stop the EC2 instance.
The new version of the stop-if-inactive.sh
script is presented below:
#!/bin/bash
set -euo pipefail
# Read configuration values
CONFIG=$(cat /root/autoshutdown-configuration)
# Assuming the configuration file is in the format:
# TIMEOUT_NO_ACTIVITY=<value>
# POLLING_SLEEP=<value>
# NO_CONNECTION_RETRIES=<value>
# INSTANCE_ID=<value>
eval "$CONFIG"
USER=ec2-user
WATCH_DIR=/home/$USER
# Check for any file changes
has_any_file_changed() {
if [[ $(find "$WATCH_DIR" -type f -mmin -$TIMEOUT_NO_ACTIVITY ! -path '*/vscode.lock' | wc -l) -gt 0 ]]; then
return 0 # True - at least one file changed
else
logger -t autoshutdown "No file was changed during the last ${TIMEOUT_NO_ACTIVITY} minutes."
return 1 # False - no files changed
fi
}
# Check for user activity in terminal
was_any_command_typed() {
local current_time=$(date +%s)
local last_activity_time=$(ls -l --time-style=+%s /dev/pts | grep "$USER" | awk '{print $7}' | sort -nr | head -n1)
if [[ -z "$last_activity_time" ]]; then
logger -t autoshutdown "No terminal activity for the ${USER} user detected."
return 1 # False - no user activity detected
fi
local time_diff=$((current_time - last_activity_time))
if [[ $time_diff -lt $((TIMEOUT_NO_ACTIVITY * 60)) ]]; then
return 0 # True - user was active recently
else
logger -t autoshutdown "No user activity during the last ${TIMEOUT_NO_ACTIVITY} minutes."
return 1 # False - no recent user activity
fi
}
# Check whether the user is active
is_user_active() {
has_any_file_changed || was_any_command_typed
}
# Check whether any SSH session is active
is_ssh_active() {
if ss -t -a | grep -q 'ESTAB.*:ssh'; then
return 0 # True - at least one SSH session active
else
return 1 # False - no SSH session active
fi
}
# Check whether any Tmux session is active
is_tmux_active() {
tmux list-sessions > /dev/null 2>&1
return $?
}
# Stop instance function
stop_instance() {
logger -t autoshutdown "Stopping instance..."
aws ec2 stop-instances --instance-ids $INSTANCE_ID
}
# Disable SSH to prevent VSCode reconnection attempts
disable_ssh() {
logger -t autoshutdown "Killing the SSH process..."
pkill -U $USER # close all SSH sessions
iptables -A INPUT -p tcp --dport 22 -j REJECT # refuse input connections
sleep "${POLLING_SLEEP}m"
iptables -D INPUT -p tcp --dport 22 -j REJECT # remove the rule
}
main() {
# Main monitoring loop
local retries_counter=0
while true; do
sleep "${POLLING_SLEEP}m"
if is_ssh_active; then
if ! is_user_active; then
logger -t autoshutdown "SSH Active. No user activity during the last $TIMEOUT_NO_ACTIVITY minutes detected."
disable_ssh
break
fi
retries_counter=0 # reset retries counter
elif is_tmux_active; then
if ! has_any_file_changed; then
logger -t autoshutdown "Tmux Active. No file change during the last $TIMEOUT_NO_ACTIVITY minutes detected."
break
fi
elif [[ $((++retries_counter)) > $NO_CONNECTION_RETRIES ]]; then
logger -t autoshutdown "No SSH or Tmux active detected after the $NO_CONNECTION_RETRIES retries."
break
fi
done
stop_instance
}
main
The Solution Logic Under the Hood
While the stop-if-inactive.sh
does not look very complex, the underlying logic of why it’s organized this way and not another is non-trivial.
To understand this logic, we need a deeper understanding of how the system works end-to-end.
In the case of VSCode Remote, five components potentially affect the SSH session stability:
- Local
~/.ssh/config
configuration file - Remote
/etc/ssh/sshd_config
configuration file on the VM instance - Local VSCode User Settings file
- Local
awsssh.sh
script - Remote
stop-if-inactive.sh
script on the VM instance
Let’s briefly review each one.
Local ~/.ssh/config
Configuration File
This file may contain two configuration parameters that potentially affect the SSH session stability:
- ServerAliveInterval: Specifies interval for sending
keepalive
messages to the server to detect if the server has crashed or the network has gone down. - ServerAliveCountMax: Sets the number of
keepalive
messages that may be sent without receiving any messages back from the server. When this threshold is reached the client will terminate the session.
Remote /etc/ssh/sshd_config
Configuration File
This file may contain two configuration parameters conceptually parallel to those from the local ~/.ssh/config
file:
- ClientAliveInterval: Sets a timeout interval in seconds after which if no data has been received from the client, will send a message to request a response from the client.
- ClientAliveCountMax: Sets the number of messages that may be sent without receiving any messages from the client. If this threshold is reached while client-alive messages are being sent,
sshd
will disconnect the client, terminating the session.
Local VSCode User Settings File
The safest way to modify this file is via the VSCode Setting menu. While the Remote.SSH
contains multiple configuration parameters, we will focus only on two:
- Connection Timeout: Specifies the timeout in seconds used for the SSH command that connects to the remote.
- Max Reconnection Attempts: The maximum number of times to attempt reconnection. Use 0 to disallow reconnection after the first attempt, and
null
to use a maximum of 8.
To make the system work properly, setting the following VSCode preferences is critical:
and
Local awsssh.sh
Script
This script was described in the previous publication. It is responsible for starting the VM instance if it is not running and initiating the SSH session over the AWS EIC Endpoint with this VM instance. If the VM instance is stopping
, it will wait for it to stop completely and restart.
Remote stop-if-inactive.sh
Script
The script is responsible for stopping the VM instance if no SSH or Tmux session is detected or when there is no visible user activity for a prolonged time (e.g. 30 minutes).
The original version of this script was described in the previous article and a brief description of the new version was presented above. Here, we will dive one inch deeper.
Having so many sources of potential problems can be overwhelming. If set incorrectly, each element can impact the SSH session stability. Moreover, different parameters should be configured in concert and support each other. While this script culminates the end-to-end solution, it relies on proper definitions made elsewhere.
Contrary to popular belief, keepalive
message configurations within the ~/.ssh/config
and /etc/ssh/sshd_config
configuration files do not define effective system timeout due to user inactivity unless one of the sides or network crashes.
Sending the keepalive
message, however, is an effective measure to prevent a premature timeout initiated by the AWS websockets client used for the SSH session tunneling over EIC.
While it could be configured on either side, I chose the server side to rely on the client-side configuration as little as possible. This configuration is provided via the /etc/ssh/sshd_config.d/keepalive.conf
automatically included by etc/ssh/sshd_config
.
The awsssh
script needs enough time to wait for a VM instance to stop, restart it, and initiate SSH Tunneling over AWS EIC. In the worst case, this might take more than one minute. Therefore, setting the SSH connection timeout to 120 seconds (2 minutes) provides enough safety margin. While this parameter can also be specified in the local ~/.ssh/config
configuration file, VSCode overrides via command line argument, and therefore VSCode User Settings is the right place for specification.
The second VSCode User Settings parameter “Max Reconnection Attempts” set to zero reflects more subtle logic. It tells, VSCode: “If you detect a network problem, try to reconnect automatically, but only once”. In the case of an accidental network glitch, that would be enough. If the VM instance is going to be stopped due to user inactivity, the stop-if-inactive.sh
script will block this attempt and thus prevent the VM instance from being automatically restarted by the awsssh
script.
To prevent stopping VM instances prematurely in the case of a temporal network glitch, the stop-if-inactive.sh
script needs to check the SSH session availability more than once (normally, two attempts will be sufficient).
Overall, the stop-if-inactive.sh
script logic ensures a smooth user experience even during temporary network glitches without extra charges for running VM instances that the user is not actively using but forgot to close the client.
Acknowledgments
While preparing this publication, I used several key tools. The article draft was prepared using the free Notion subscription.
I used the free version of Grammarly for grammar review, eliminating most basic spelling and grammar mistakes.
The stylistic finesse and coherence of the writing followed suggestions from the paid version of ChatGPT 4.0o, which was also instrumental in developing the new version of the stop-if-inactive.sh
script. Though the process was not always smooth, including occasional interruptions, the final result was much better than anything I could develop alone.
With all these advanced techniques employed, it’s important to emphasize that concepts, solutions, and final decisions presented in this article are entirely my own, and I bear full responsibility for them.