Azure Linux Agent – Repair

Recently, I was working with a customer who makes extensive use of Linux workloads in Azure deploying many heavily customized virtual machines (VM) into Azure.  After some reboots, installing packages, and performing some system configuration tasks the customer ran into an issue where the Azure Linux agent was unable to communicate with the Azure Fabric to report the health and availability of the operating system.  While the cause of what interactions caused this, the customer needed to redeploy the agent onto this ‘broken’ VM that would not boot properly in Azure.  The customer was able to revive this machine by mounting the image onto another VM running the same Linux distribution and reinstall the agent to which this guide walks through.

Did you know Azure supports an exhaustive list of Open Source technologies and Operating Systems?!

 

Quick Background:

The VM agents in Azure provide health information to the Azure fabric in addition to providing the ability to surface diagnostic information such as capturing the serial/console stream and surfacing it to the portal for troubleshooting.  This is provisioned automatically on newly created VMs in Azure and has a number of additional configuration options such as providing the ability to control whether to enable the swap/paging file on the temporary disk.  Additionally, this extension is used to deploy additional extensions to the VM such as the custom script extension to invoke scripts on machines booting in Azure.

Additional details on the Linux Agent is available here and, if you’re interested to see exactly how it works, the source code for the agent is available on GitHub.

This agent can be installed/upgraded with standard RPM or DEB Linux package management utilities such as YUM.  They can be also be manually deployed as outlined in the details above using a Python install script.

In almost all cases, the agent can be updated using these same package management utilities and this agent is like the old Ronco commercials – ‘set it and forget it’.

 

Repair Instructions:

Repairing a VM with an agent in a corrupt state looks painful; but is not difficult….here’s how to get started:

Note: This is assuming you’re using unmanaged disks in Azure — see addendum below for some additional steps required to obtain access to the underlying VM’s virtual hard disk.  If you’re unsure how to validate if your VM is using unmanaged disks, click the screenshot below:

Managed Disk Detection in the Portal
Managed Disks

Steps:

  1. Shut-down the VM that is having issues reporting health through the agent to the portal
  2. Provision a new VM that is running the same version/distro of Linux – ideally in the same region as the source VHD
  3. Find the storage account associated with the impacted VM and make note of the storage account name and the container the VHD resides in (applicable to unmanaged disks only) as shown in the image for step 3 below
  4. Download a utility such as the freely available Azure Storage Explorer (http://storageexplorer.com) and _copy_ the impacted VHD to the storage account the new VM resides in so you can attach this drive as a data drive on the new VM
  5. Attach the copied disk to the new VM following the steps 4 and 5 images below
UnmanagedVM
Locate the Impacted VM’s Storage Account (Step 3)

 

AddDataDisk1
Add the Impacted Disk as a Data Disk to the New VM (Step 4)

 

VMDataDisk2
Add the Impacted VM’s Boot Disk as a Data Disk to the new VM (Step 4/5)

 

5.  Boot the new VM and change to a root user (using “su –” if the root password is known or “sudo su –” if this image was created fro an Azure VM gallery image)

6.  Create a mount point for the new data disk such as: mkdir /repair

7.  Mount the data disk to this folder: mount /dev/sdc1 /repair

8.  Change to the mount point: cd /repair

9.  Run the following commands to create a bind mount to pull in specific folders to reinstall the agent and use chroot to in essence   allow commands that are subsequently to occur on the newly attached data disk getting you console access to this disk

  • mount -t proc proc proc
  • mount -t sysfs sys sys/
  • mount -o bind /dev dev/
  • mount -o bind /dev/pts dev/pts/
  • chroot /repair

10.  Re-install the wagent using package manager tools such as yum: yum install WALinuxAgent  or apt-get install WALinuxAgent

11.  Check /etc/resolve.conf and ensure that the Azure DNS server is properly defined if necessary to allow it to communicate once booted

Resolve.conf should have an entry for a nameserver as follows:

  • nameserver 168.63.129.16

12. Run the following commands to unmounts the repaired disk:

  • cd /repair
  • umount /repair/proc/
  • umount /repair/sys/
  • umount /repair/dev/pts
  • umount /repair/dev/
  • umount /repair

13.  Shutdown the temporary VM and remove the datadisk from the portal — now we need to re-associate this disk with the broken VM — we can easily do this using the CloudShell built into the portal to open a new PowerShell prompt:

 

 

  • $vm=Get-AzureRMVM -ResourceGroupName RGOfBadVM -Name NameOfBadVM
  • $vm.StorageProfile.OsDisk.Vhd.Uri = ‘https://storageaccount.blob.core.windows.net/vhds/CopiedAndRepairedxxxxxxxxxx.vhd’
  • Update-AzureRmVM -ResourceGroupName RGOfBadVM -VM $vm

 

Hopefully that will get you back up and running allowing the VM to boot up and properly report it’s health!

About the Author

Ryan Berry

Hello! I am a technology professional with over 20 years of successfully selling, designing, and implementing IT solutions of all sizes. I have a sincere passion for technology both inside and outside of work involving myself with community events such as .NET user groups, FIRST robotics and in helping college bound high schoolers prepare for success through mock technology interviews. It’s easy to be successful in a role that you love and coming to work for me means I get to play with new and innovative technologies to bring bold new ideas to customers solving their business challenges. In my current role, I take the complex world of the cloud and Microsoft Azure, a portfolio of capabilities consisting of over 100 products, and help customers in properly selecting capabilities, tooling, and technical frameworks to product results. My vast technical background spanning technologies such as embedded development, Windows debugging and C# to open-source technologies such as PHP, Python, MySQL and Linux aids allows me to quickly dive into a problem and identify solutions producing quick results using the power of the cloud!

%d bloggers like this: