31 January 2024

Apple Vision Pro, eye tracking, and the cursors of the future

I am fascinated by how the Apple Vision Pro identifies where the user is looking, treating that locus of attention much like the cursor used on the Mac and other desktop computers; one “clicks” with hand gestures. This is a cunning way to make desktop software usable on this very different platform, and discerning this by watching eye movements is an astonishing technological feat. It is not just a matter of precisely detecting where the eye is pointed, which would be hard enough; our eyes constantly jitter around in saccades, so the Vision Pro has to deduce from this complex unconscious movement where the user has their attention in their subjective experience.

Modifying desktop computer interfaces

It is fun to think about exotic alternatives to the conventional mouse/trackpad & cursor combination. The big gestural interfaces seen in science fiction movies mostly turn out to be a bad idea — Tom Cruise was exhausted after fifteen minutes of just pretending to use them in Minority Report — but I believe that there are opportunities for innovation. Clayton Miller’s 10/gui considers ways we might take advantage of a huge multi-touch surface instead of a little trackpad. Bruce Tognazzini’s Starfire from 1994 is still ahead of available technology, bursting with both good & bad ideas for combining direct manipulation with touchscreens & styluses together with indirect manipulation using a mouse or trackpad. Devices like the iPad have begun to unlock the promise of distinguishing fingers from styluses to create more graceful, complex interaction idioms by combining the two; a few specialists use stylus input tools like Wacom tablets at the desktop, and I feel an itch that more people might benefit from integration of stylus input into their desktop experience.

So might we just replace the mouse/trackpad & cursor with eye tracking? No. I cannot imagine that it could ever provide the fine precision of the mouse/trackpad (or a stylus). But I think eye tracking could combine well with those input tools to make some things more graceful. It would not require fine precision, just the ability to register which window the user is currently looking at.

Discussion with Raghav Agrawal underlines that I am proposing something I hope would deliver a fundamentaly different experience than the Apple Vision Pro. A user of the Vision Pro feels that they control the system with their gaze. A user of the desktop should still feel that they control the system with the mouse, with the system aware of their gaze and using that to ensure that it Just Does The Right Thing.

Solving some multi-monitor challenges

I think this will prove especially valuable if one has multiple big screens, which I expect more and more people to do as they get better and cheaper. I am a lunatic who uses a big wide monitor, a big tall monitor, my laptop’s 16" display, and a little teleprompter display at my desk. I love being able to look around at several open windows, and expect that many people will discover how good this is for work.

But using existing mouse-cursor-window interfaces with multiple big screens does come with disadvantages. Dragging-and-dropping across expansive screens gets clumsy. One can lose track of the cursor in all that space; even wiggling the cursor does not always make it easy to find. With a lot of windows open, one can easily lose track of which one is currently selected.

A radical proposal for multiple cursors

Rather than drive the cursor to appear at the point of one’s visual focus — one does not want the cursor racing back and forth across the screen every time one glances at information on another screen — I think it would work to have a cursor in each window, with mouse/trackpad actions affecting only the window one is currently looking at. When one looks away from a window, its cursor stays where one left it.

This puts a cursor within view wherever one looks, easy to find. Maybe on a big window, if one has not looked at it in a while the cursor returns to the center or gets a little momentary flash of emphasis when one looks back at that window.

The Mac puts the Menu Bar at the top of the screen because the edge preventing overshooting make it easier to decisively mouse to an element there. Keeping the cursor confined to the boundaries of each window makes all four window edges this kind of convenient interface territory.

Integrating eye tracking also eliminates the need to have a selected window to keep track of. In existing systems, actions like using the mouse scroll wheel can produce an awkward surprise when it does not affect the document in view, instead disrupting the content of a window which one has forgotten remained selected. With eye tracking, user actions can always just affect the thing they have in view, eliminating that problem. (I will get to one important exception to this pattern in a moment.)

Acting across multiple windows

Confining input effects to within windows seems like it would break a lot of interaction gestures which require moving across multiple windows, but I think everything one must do that way now can work at least as well in my proposal.

Again, we do not need to move the cursor across windows to select one; attention tracking eliminates the need for a selected window.

One need not move the cursor across windows to do window management. The edges of windows remain drag handles for resizing them and moving them around, and as I said above, with the cursor confined to the window, these become easier targets. One can combine this with the buttons and other controls I envision putting at those edges: drag to affect the window, click to use the control. I am a crank who perfers a tiled display to overlapping windows, but handling overlapping windows is fine: look at the protruding bit and click to pop it to the front.

Drag-and-drop across windows would require a bit of an adjustment, but eye tracking enables an improvement. One starts dragging an object in one window — turns to the other window — and that window’s cursor is right there with the object attached, responding to mouse movements. This will be more graceful, with less mouse movement and less risk of dropping onto the wrong window when dragging between windows on separate screens.

Imagine working with two text documents, referencing an old one while authoring a new one, bouncing back-and-forth between the two. Turning from the new document to the old one briefly, one might scroll to the next page in the old document, use the cursor in that document to select something one wants to quote, copy it, then turn back to the new document to find the cursor waiting right where one left it, ready to paste in the quote.

Plain text as the input exception

Keyboard shortcuts would act on the window one is looking at, just like mouse movement and clicks. But plain text is a little trickier.

It should be obvious how in the new-and-old document example above one may want to type into the new document while looking at the old one. There are a lot of situations like that. Text input boxes need a standard interface element allowing one to lock the box as the target of plaintext input from the keyboard; when that is active, other text input boxes show as unavailable. So one need not hunt down the locked text input box to unlock it, when a text box shows as unavailable, it would a control to unlock the old locked text box, allowing text input to go where one is looking ... or to immediately make that text box the new locked input target.

Having proposed this interface idiom, I am tempted to want the this ability to lock the text input target, overriding the selected window, in the conventional systems we have now!

No comments: