Photoshop pushes a machine harder than most programs. It’s part of squeezing out as much performance from a machine as possible. Sometimes that means marginal hardware can show problems that look an awful lot like software bugs but aren’t. Or, even worse, cause a machine to mysteriously reboot (it’s happened). In other words, we take performance seriously, but that can sometimes tickle machine weaknesses.
So, as we’ve heard the occasional reports of mysterious pauses or bad slowdowns with CS2, we have really been paying attention. Trying to find a reproducible case can be very hard, though – and it’s hard to fix a performance issue if you can’t reproduce it (or at least have enough information to simulate it). And sometimes, through the course of dealing with a helpful user, it can become clear that there is something very weird or particular about the machine they have, and boy, at that point, do you really want to get your hands on that machine and take a good, deep look.
Well, through a series of events, we got lucky enough to find a user in the local area with a slowdown issue. So Seetha and I packed up his laptop in the car and headed out yesterday afternoon. When we got there, our user was so happy to have us.
We started with a demonstration of the problem. It took a while to happen. But when things started going “funny”, it was pretty clear that something was not normal. Ah, good, we thought. Not intermittent or one of those funny bugs that can simply disappear when two senior engineers walk in the room (you’d be surprised at how often *that* happens), but a repeatable problem that just happened to be tied to a specific machine.
I had us do the thing I always do when starting to try and figure out a bug – we pulled up Performance Monitor (on a Windows box – I pull up Activity Monitor on Macs, or install X Resource Monitor) and repeated the problem. Our user once again showed us the problem as Performance Monitor was watching CPU, disk activity, and free RAM. I have a set of colors I tend to use for these things, so that I can glance at Performance Monitor and know quickly what is going on. So we tried to reproduce the problem quickly. And the lines in Performance Monitor looked OK. So we sat back to think about what things might be going on – and after a short bit, I noticed that the blue line I use for free RAM was making a steady march downwards. With Photoshop idle. Ugh. And we watched the memory go to near-zero, at which point disk activity started going up, and we started poking around, and it now became clear why performance on this machine was getting so bad so quickly – a memory leak. A bad one. One that we had never seen before. So now the hunt was on for why…
We started up Task Manager and went to the Performance tab. Yup, available RAM was sitting at 3MB. On Windows XP, if your machine every has less than about 15MB of free RAM, you know you’re going to be in for trouble performance-wise. Ugh. So we quit Photoshop – which took a bit, since there was so much of Photoshop paged out due to the leak. I switch Task Manager to the Processes tab, added Virtual Memory Size, and started Photoshop, and didn’t touch it. Yup, there, plain as day. Munch, munch, munch went memory.
We did some first-step debugging – restarting without plug-ins, without scripting. No good. Still leaking memory. With all the easy answers gone, we hooked Seetha’s laptop up to our user’s network, installed the remote debugging stubs on the user’s machine, and started up a debug build. When the debug build started up, with all of it’s extra asserts and checks and paranoia built in, the smoking gun showed up right away. Aha. Bad font. And the light bulb went off in my head… The font preview menu. You see, I implemented the font preview menu to take absolutely zero extra time at startup. It generates all the previews on the fly, in the background, at idle time. Works really well, especially as most machines can read in a font and generate a preview just about as fast as they could read in a cache. We can usually generate all your previews in the first 20 seconds or so while the app is up, while some other programs that have WYSIWYG menus take more than twice as long to read in their cache.
Anyway, it matched the idle-time symptom of the memory leak we saw. And it struck me that if – **if** – a font failed in a certain way that caused our font rasterizer component to let the failure leak out (which isn’t supposed to happen) we could not only drop our memory on the floor but not remember that we tried that font. Bad news. So, we tried the workaround of turning off the preview menu, and of course, no memory leak. Now, both Seetha and I are eager to figure out which font it is, so we can bring the sucker back with us and dissect what went wrong. So, we bring up the debug build again, catch the problems, read the indicators, and get the font name. Cool. We turn on the preview menus and remove those fonts from the system, and start Photoshop and watch. Shoot. The memory leak is still happening. Seetha posits that it could be more than one font. So, we start a full-on search for the bad font (yup, the very same binary search-for-the-bad-font that we ask users to do when we suspect the problem is related to a bad font). Sure enough, with nothing but the minimal font set installed, Photoshop starts up fine – no leak. So we start moving fonts back, chunks at a time. No leak. We move more No leak. We’re back down to the one font we thought was bad, and moved it back. No leak. Huh?
At this point, we’ve actually completely solved the user’s problem with no workaround (all their fonts are back in place and the program is performing great). And we have enough information to simulate the bug back in the lab (which I’ve now done). So we pack up and head out, thanking our user for their time (I hope we conveyed how immensely we appreciated it – we really did). Our best guess is that some pre-caching or performance mechanism in the operating system somehow got a bad shadow copy of the font and that bad shadow copy was causing the issue, and that re-installing all the fonts (which is what we essentially ended up doing) force the bad shadow copy to be refreshed from the good copy. But that’s a guess.
Now, while I’d really like to know what actually went wrong, we ended the day with a happy user, information that might help some other users, and a cornered bug to be fixed for the next release. And both Seetha and I reflected on our way back to Adobe about all the bad software we deal with and programmers who wouldn’t even think of going to a user’s site to debug an issue (and certainly not 2 senior engineers like us).
A good day.
-Scott