Explaining X11 for the rest of us


Last time, we got more intimately familiar with the technical details of the window, one of X11's basic building blocks for building interactive user interfaces. We learned that windows can be composed and nested, which allows for complex event delivery and simplified code on the part of the toolkit, at the expense of added latency and graphical glitches.

It might seem like the vision of the desktop the X11 protocol had in mind doesn't match how users actually used it and how user interfaces, as a whole, evolved. One could even say that the window tree expected composing user interfaces in a way that doesn't really match reality.

I, in perhaps an ill-fated attempt to remain impartial, won't give my opinion on this, specifically, but today I want to talk about another feature commonly seen in modern user interface design, and how it offers challenges with the drawing model X11 specified and expects clients to use. I am, specifically, talking about alpha blending and transparency, or more specifically, "compositing". We'll take a look at what the X COMPOSITE extension does, how it works, and how it can be used together with other technologies to build a modern looking graphics system, all while remaining complete backwards compatibility with older systems. It's quite clever!

I don't want to remain too harsh on the X Window System. I can't really fault it for the lack of alpha blending. The system was first designed at MIT in 1984, the same exact year that Thomas Porter and Tom Duff released Compositing Digital Images, the seminal paper from which the mathematical underpinnings of alpha blending and alpha compositing are taken. Computer graphics was a rapidly-innovating field in that time, so for its day, it did quite well.

A different direction

I've long said that windows, in the X11 sense, aren't a tangible thing. They simply make their mark on the X server's front buffer, as a series of pixels which they own. The X server keeps track of which pixels each window owns, and sends them Expose events to let them know when to paint to these pixels.

Another way to think about this is that every pixel inside the server is owned by exactly one window. At any given point in time, you can point to a pixel, and the server can tell you which window is responsible for painting it. We can even visualize this. Given the simple moving kittens from the first article, we can just give every window a color, and with that, visualize who is responsible for each pixel.

Here, the blue color represents the root window, the yellow as the top kitten, and the red the bottom kitten.

Even when we're shaping the window, we still have the restriction that a pixel is owned by one window. The issue with this approach is that there's no way to blend multiple windows together, or letting one window be transparent atop one another. This isn't because it's expensive: blending is a simple enough operation. The issue is that when the bottommost kitten is being clipped away, we don't have access to its pixels at all; they simply don't exist anymore. The topmost kitten window owns that pixel.

In the early 90s, transparent terminals dominated Linux desktops with cool Linux graphics. But their mechanism was a simple optical illusion: when they drew, they picked up the system's wallpaper, and blended their terminal contents against that, leading to punch-through effects. Some more clever programmers found other interesting tricks to allow them to gain actual transparency. I won't detail these hacks in depth, because it is 2014, not 1996.

So, the first step to getting working compositing would be to have some way of keeping the pixels that a window has drawn. This is where the X COMPOSITE extension first comes into play.


It turns out the solution here is astonishingly simple. If we'd prefer that all drawing a window does instead goes to some place in memory, we simply create some place in memory and redirect all drawing the window to point there instead. And indeed, with the RedirectWindow request, we can do exactly that: mark a window so that all its drawing operations are indeed redirected so they instead go to some place in memory.

But it gets more interesting. As I mentioned way back in the first article, we already have a standard to create and represent "images" inside the X server: Pixmaps. I also mentioned that there was this concept called Drawable attached to all the drawing operations, which represented either a Window, or a Pixmap. So this makes our redirection operation quite simple: if the window is redirected, then simply swap out the Drawable the user is drawing to with a secret Pixmap that we created which contains the window's contents. For clarity purposes, this secret Pixmap we created is called the window's backing pixmap, since it backs the pixels of the window.

At this point, you might have another epiphany to bring it all together. I've been using the term "front buffer" when talking about where windows draw to. "Yes, yes, it's the place where windows borrow their pixels from, and where nothing lasts for ever. We get it already" I hear you cry. But the front buffer also has to exist somewhere in the server's memory. So when we're drawing to the front buffer, we're actually drawing to some set of pixels somewhere in the server's memory as well. It turns out that the front buffer has another name: the "root pixmap". I'll let that sink in.

Bear with me, because this next point is fairly confusing. Redirected windows send their drawing away from the front buffer and into the window's backing pixmap. But the root window already has backing pixmap: the front buffer! So, in a sense, we can say that the root window was redirected, even before the COMPOSITE extension was invented!

We also know that, in the normal drawing case, children of normal windows "borrow pixels" from the front buffer. With COMPOSITE, children of redirected windows actually borrow pixels from their redirected parent's backing pixmap instead. It's as if every redirected window has its own front buffer that children borrow pixels from.

Without this history and this context, the choice of word "redirect" feels fairly bizarre. Normally, the word "redirect" means "things get sent elsewhere". But what you see if you look at this with a fresh perspective is that windows that have been "redirected" draw to a thing they own, while windows that aren't are instead sending their drawing to some other window's backing pixmap. It's backwards! This very real confusion is one of the reasons that history and context are very important when trying to understand X11: the legacy behaviors are embedded so deeply into the system and the engineers working on it that we arrive at weird definitions like these. And if you've worked on X11 for a long time, it all seems so natural.

Phew. OK, enough talk. Let's take the kitten demo from the first article and redirect the window on the top to see what redirecting does.

... where did the other kitten go? It was redirected to a Pixmap, of course, but we can't actually see it. You see, while the root pixmap is just like any other pixmap, it turns out that it also has a special feature: it's also the front buffer. That is, it's also the buffer that we show to you on your monitor. All other pixmaps are off-screen storage. You can't see them unless you draw them to the front buffer.

By the way, I didn't cheat or anything and never create the window for the kitten or never map it or something like that. This is just how redirection behaves. If you click in the vicinity of where the kitten was in the original demo, you can even drag the kitten around. Redirection doesn't affect input... but we'll talk more about that later.

Drawing it all together

So, now that we have the window redirected and rendering to its own pixmap, now what? How do we turn that into transparency? Well, the answer is as simple as it sounds: something else is what takes all those windows and paints them together. You see, the way the X Window System was designed has been based on a principle of "mechanism, not policy". The X Window System gives application and desktop authors a giant toolbox with lots of cool tools to build fun systems, but never tries to dictate how that system could work. As such, rendering of all the composited windows is left up to a third-party client called a "compositing manager".

The compositing manager has one job to do: draw the backing pixmaps of the windows together to form a pretty picture as your output. Technically, the compositing manager could not draw any windows, or draw some windows more than once, or wobble them around or spin them on a cube, but unfortunately, there's no way for the compositing manager to influence how input is sent to individual windows, meaning that the compositing manager has no choice but to draw the windows so that they appear in the same place X thinks they are, otherwise you might find that you can drag invisible windows around, or clicking on a button is offset by a few pixels. Some of the more popular compositing managers might implement effects like wobbly windows, but input isn't actually morphed in that case. The effect simply snaps back fast enough before you have the chance to click on anything else.

But before we go any further, let's talk about the steps a compositing manager would take to put the windows on the screen. Now that we've redirected the windows, we can use the NameWindowPixmap request to fetch the window's backing pixmap. This Pixmap acts just like any other Pixmap on the system, so we can continue to use the standard drawing requests we learned about in the first article, like CopyArea. Unfortunately, this won't do if we want to do true alpha blending, since CopyArea and the rest of the core drawing requests were, again, invented before the widespread availability and understanding of alpha compositing.

In order to get a graphics API that supports alpha blending, we could opt to use OpenGL or fall back to software rendering, perhaps with some state-of-the-art hand-written MMX code, both of which are valid options, but thankfully another X11 extension can come to our rescue. The X RENDER extension, which supports a variety of advanced vector graphics including matrix transforms and gradients, provides to us an operation that can do alpha blending for us, which is also (confusingly) named Composite. Using this request to the X server instead of CopyArea, we can finally apply our hard-earned alpha channel and draw our windows semi-transparent.

Since we've redirected all windows on the desktop away to be able to blend them, the root window, and correspondingly, the root pixmap, are actually entirely blank. We can take this opportunity to just claim the root window as our own. So, when we paint our beautiful final image, we can actually just paint it directly to the root window and it will immediately be visible on the display.

And after all that work, we finally have transparency.

The intangible image

I want to make you sure that you realize here that the image painted with COMPOSITE is really just whatever the user wants it to be. This image doesn't actually get input: it just reacts to whatever else happens on screen. In order for input to make sense, however, the windows have to line up perfectly to whatever else is on the screen: much like in the "naive redirection" example above where the kitten just turned invisible, it's still a window that reacts to input, it just doesn't have any output on the screen itself. The compositor itself has to take special care to line the windows up and draw them where they normally belong.

Here, I'm using OpenGL to render the scene using OpenGL, and added a few fancy features like wobbling windows and a spinning 3D triangle in the middle. Go on, drag the top kitten around, and you'll notice it wobbling in real-time.

To do that, I import the backing window pixmap as a GL texture. On real hardware, like your Linux system, both backing window pixmaps from the X server and GL textures are just buffers of memory managed by your GPU, so we can sample from them directly. The OpenGL extension known as texture_from_pixmap, sometimes referred to as "TFP", lets you take an X11 Pixmap and convert it to an OpenGL texture. If we hook all the pieces up, that means that we redirect a window to its own backing pixmap, then retrieve the pixmap with NameWindowPixmap, then take that Pixmap and plug it into the TFP extension to retrieve a GL texture object that we can texture from just like it was a normal 2D texture.

Note that none of this works when the X server is on a different machine than your compositor or your GPU. Compositors, to be fast, require that you be on the same machine, so we can use the GPU directly.

This demo, however, is using a combination of HTML5 <canvas> and WebGL. Specifically, every Pixmap, including the root pixmap, is backed by a <canvas> stuffed somewhere in the DOM where it's not visible. WebGL also has the convenient ability to use HTML <canvas> elements when specifying a texture. Whether this is zero-copy or not depends on the platform, and I'm not sure myself, but regardless, the textures we're using in this demo are small, and copies are fast.

Wait a minute...

At this point, you might be thinking something's amiss. We have a window tree, but we shouldn't use it. Windows are clipped sections of the front buffer, but in order to have a modern, compelling desktop environment with transparency, we need to redirect windows away from the front buffer into their own pixmaps. Another external process, the compositor, takes these Pixmaps and draws its own scene with it. So, what exactly is the X server doing? Um, well, I can't really tell you. It's fielding rendering requests from clients, and it also has the responsibility of delivering input to clients, but in a lot of cases, these are things the compositor might be better off doing.

The fact that the compositor can't control pointer input is unfortunate. There are several practical cases where compositors would like that ability: the biggest such case, right now, is for scaling up legacy applications on high-DPI displays. We can scale the window to 2x, but since we don't control the input, the application doesn't behave properly.

Also, I haven't shown it off yet, but when you throw resizing into the mix, it gets even worse. When resizing a window, the X server will first fill the window with its background pattern, then fire off an Expose event to the window. In a composited world, this means that we resize the backing Pixmap of the window, fill it with a color, then wait for the application to draw into it, which leads to flickering and misfortune. Really, it would be great if the application could just draw into a Pixmap, and then hand that to the X server and say "here's my new window, all fixed up for the new size", avoiding the mess of invisible Pixmaps and resizing windows.

All of this is separated from when the compositor actually uses the Pixmap to draw into it, meaning that the user might see quite ugly window contents, simply because the X server is thinking in terms of a window tree, which is then redirected to a backing Pixmap, without regard for what the Pixmap ever contains or who will use it to paint.

This, more than anything, just setting up the story for newer display servers like Wayland and Mir. These projects started after X11 engineers asked themselves the very same questions, and built something better. We'll look at those in due time, but for now, we have to work our way through X11.

Coming up...

I know that last time, I wrote that we'd see "Expert Window Techniques". I'm usually writing two or three articles at a time, and it's simply down to how much fun I'm having writing the code and the article to determine which one I release next. I wasn't happy with "Expert Window Techniques", so I put that one off while I worked on this one instead.

I'm still not happy with "Expert Window Techniques", so next time we'll instead do something a little different. Next time, we're going to take a deep dive into one of the data structures and algorithms in the X server, the region. I'll see you next time, in "Regional Geometry".