Friday, August 20, 2004

The Case of the Disappearing DLL

Although this is not quite the Sherlock Holmes and Dr. Watson type of mystery, this bug was an interesting one to troubleshoot. After reading the bug report, I suspected what the problem was, but could not reproduce it on purpose in the lab without some hassle.

The bug report stated that some of the DLL's required by our product were deleting themselves. After installation, everything was fine. Sometime weeks later, a DLL or two mysteriously vanished.

The culprit? On an uninstall of the old version of the software, a DLL was in use, and its removal slated for post reboot. After the uninstall ran, the user promptly ignores the reboot request and installs the new version of the product. The dll is no longer in use at this point, so the installation succeeds and no reboot is prompted for. The next time the user rebooted - blammo - bye, bye DLLs!

I'm not going to give away my solution for this one (other than it involved fixing the uninstall process going forward), but one possible way to prevent this is to disallow installations if there are pending file removals or replacements. You can check the registry at HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\PendingFileRenameOperations to see if there is some activity scheduled for after reboot. There is a tool that dumps this information and another one that captures that plus some other post-reboot activities. One of these tools should be part of every setup developer's toolbox.

1 comment:

Steven Bone said...

Since I handwaved on the solution before, and it is pertinent now (I need to address the comment in the pet-peeves post), here was how I handled it.

Note that I did not use the suggested solution (and using my suggestion violates a pet-peeve of myself and others).

The uninstall bug was related to a COM+ application ultimately called by an IIS web application. When removing the product, IIS was stopped, preventing attempted use of the COM+ application. COM+ components will still be active (in-use) for a configurable timeout even when they are no longer doing anything productive.

When the uninstall was run, the components in-use at the time were scheduled to be deleted at reboot. The install was MSI-based, written by my predecessor. My first job was to rewrite the MSI, as the old one broke so many MSI rules it was non-patchable and non-upgradeable. I added support for removal of the old MSI using the upgrade table, but most people performing the installation of the new version uninstalled the old version manually beforehand, as the old MSI required this. Like I said, it was really bad.

In the new MSI, I realized on uninstall that in-use files was a real possibility, so after StopServices, a custom action runs that calls Shutdown() on the COM+ application. Since nothing new can start the COM+ application, the files could safely be removed. That addressed the "going forward" fix.

To fix the old problem, I added a custom action that runs after the RemoveExistingProducts Action that checks to see if the old version of the product was removed, and if so it checks the PendingFileRenameOperations key. If there is anything in that key, I set a property that is used to trigger a ForceReboot action.

Since I was using Wise for Windows Installer 4.x, the default dialog sequences being conditioned incorrectly (yep, a Wise bug, probably still there) created a bunch of work to work for me to actually get the installation to resume correctly after the reboot. If I remember correctly, I also needed to adjust the placement of one or two of the sequences as well.

Although the installation took into consideration all of these potential issues, it really came down to a documentation and training issue that can be summed up in a few words: "If it tells you to reboot, you should reboot before doing anything else."