Explanations

This article relies a lot of interactive demos that use JavaScript to show how the X Window System works. The article unfortunately won't work properly without these interactive demos, so if you feel like you're missing out, try enabling JavaScript?

Before we begin...

Since this article contains a lot of interactive demos relying on fairly modern browser technology, let's make sure that everything is OK before continuing.

If you can see the stipple pattern above, that means that your browser is modern enough to see the interactive demos.

You might have noticed that when you ran your mouse over the stipple, your cursor changed. That's because this isn't just any old stipple image, that stipple is actually the background of a full X server session running in your browser using HTML5 canvas. All of the interactive demos will use this framework to explain what's going on under the hood.

Basic Architecture

Although it may sound a bit stilted, notice how I keep saying "the X Window System" instead of the more traditional shorthands "X", "X11", or "Xorg"? I want to be very careful to separate the ideas and design of the system from its component parts.

The X Window System is a networked display system. A server component, the X server, is responsible for coordinating between all of the clients connected, taking input from the mouse and keyboard, and pushing pixels on the output. The most popular X server implementation is the Xorg X server, developed by the X.Org Foundation and community. There are other X server implementations: you might remember that Xorg was forked from XFree86 a decade ago, that Sun Microsystems has had several X server implementations, in both Xsun and XNeWS. Today, Xorg is the dominant X server implementation, getting most of the development. But back in the day, multiple competing implementations existed.

X servers and X clients all talk a standardized network protocol to each other, known as X11. This protocol is well-specified, from the wire format to the semantics of every request. The protocol documentation linked above is invaluable documentation for any hacker who wants to learn more about this stuff.

Applications and toolkits don't write the wire format onto the socket directly, however. They often use client libraries that implement the protocols, like the traditional Xlib library, or the somewhat newer xcb.

I'm going to try and be precise in my nomenclature in this article.

When I talk about features or design decisions of the overall system, I will try to call it the X Window System, even if it sounds a bit verbose. e.g. The X Window System provides us Pixmaps, which are images in the server's memory.

When I talk about features or details of the network protocol, I will talk about the X11 protocol. e.g. The X11 protocol provides for a generic extension mechanism, which allows for a forward-compatible way to implement new features without having to redesign older parts of X11.

When I talk about the behavior of a client or a server, I'll say X client or X server e.g. Using the MIT-SHM extension, X clients can pass memory buffers to the X server using POSIX shared memory, which prevents networking and large copies.

When I talk about features or the architecture in the Xorg X server implementation, I'll mention it explicitly as Xorg or the Xorg X server, e.g. In order to make drawing calls accelerated, Xorg video drivers can provide hardware-accelerated versions of certain drawing primitives through EXA.

If I ever say "such-and-such is a feature of X", it's a bug.

Requests and events

As said in the Introduction, X clients connect to an X server, and they speak an X11 protocol. In more detail, clients can send requests to ask the X server to do something. A simple example of a request is CreateWindow, which tells the X server to "create a window". We'll learn more about windows in a bit.

If something interesting happens inside the X server (for instance, a window was created), the X server can send X clients an event. To prevent network traffic from getting overloaded, X clients need to tell the X server which events they're interested in. The network side of it is a bit complicated (it's ugly, let's not get into it), but programs using Xlib can tell the X server that they wants to listen to specific events using the XSelectInput function call.

Let's go

Let's start super simple. Here's a simple X server with two windows in it. They don't have any title bars, and you can't drag them around because I haven't launched a window manager yet. There's just two windows, each showing each kitten.

You might notice that i button in the top right of the demo above. Click on it, and the inspector will pop open. On the left side of the inspector is a list of windows, and on the right side are the properties and attributes for the selected window.

This lets you dig into all the demos in this article in detail, showing how things are constructed. So, if you ever find yourself not quite understanding something I'm saying, playing around with the inspector can often help.

A window, in the X11 protocol, is a structure that allows an X client connecting to the X server to display something on the screen, and take input as well. Windows are fairly simple: they have an X, a Y, a width, and a height. This forms a rectangle which is known as the window's bounding rectangle. The window occupies this space. Windows also have a defined stacking order, which means that some windows can be above other windows. If a window is higher in the stacking order, it occludes the windows below it.

For historical reasons related to some initial implementations, showing a window in the X11 protocol is called mapping a window, and hiding a window is called unmapping. Windows, when initially created, are unmapped (or hidden). Clients have to map windows by sending a MapWindow request to the X server, and clients can later unmap windows using UnmapWindow. Note that unmapping a window doesn't destroy a window — doing so simply hides it. The window can then be mapped again later. It's more like minimizing a window (in fact, that's how minimization is implemented on most window managers).

So, we know what windows are. But how are those kittens getting on the screen?

Exposing historical baggage

In the late 80s, when the X Window System was designed, RAM was costly, and was a scarce resource. If we stored window contents in system memory, if you want to have a maximized window, that would be, well, 1 byte * (800 * 600 pixels) = almost half a megabyte! The user can't open more than 10 maximized windows before exhausting the 5MB in his workstation, and with 16-bit True Color around the corner, we can't fit more than 5! No, no, this can't possibly scale.

So, if we can't store window pixels in system memory, where can we store them? They have to exist somewhere, right?

Nope. The trick the X Window System authors realized is that the pixels for a window don't have to exist at all. We only have one giant buffer of pixels for the entire screen, the front buffer, and windows borrow pixels to draw to.

The demo above shows two windows, with one window occluding the other. The window underneath moves from side to side, and you can see that when it moves, the window blanks out for a moment before redrawing itself.

The window on top, marked as kitten1.png in the inspector, owns a rectangle in the center of the screen. The window below, kitten2.png, owns a "L" shape slightly below and to the left.

When the X server needs pixels from a window, it tells the window to redraw the area it's missing pixels for using an Expose event. The window then responds by submitting drawing commands back to the X server. The X server then processes all these drawing commands, touching pixels on the front buffer where the window is.

You can also drag kitten1.png around. Try it, and see if you can figure out how this behaves. Does it seem familiar? The authors of Windows chose the same design when they wrote their display server, however they called their equivalent to the Expose event WM_PAINT instead.

Windows of all shapes and sizes

I said above that windows are rectangles. In the above demo, you see a circular window, so it's quite obvious that I lied. You can still drag the top window around, but it might slow your browser down. Sorry, the math here is a lot more computationally expensive, especially in JavaScript.

Internally, the X server keeps a record of all the pixels that are currently visible for every window, containing the part of the window's bounding rectangle that is currently showing on the screen. It's calculated by taking the window's overall bounding rectangle and then subtracting out the bounding rectangles of all the windows above it. It's somewhat like a simple 1-bit alpha mask.

This data structure is called the clip list in the Xorg codebase. As the word "list" in the name might tell you, it's not actually a 1-bit alpha mask. That would waste too much memory. Again, for a full-screen 800x600 window, you don't want a giant alpha mask in the server's memory telling it that it's mostly visible, or mostly obscured. Instead, the X server stores a more compact version of the same thing, as a list of rectangles containing the areas where the window is visible. For an 800x600 window that's not occluded, we now went from a 60kb bitmap mask to a rectangle containing four 32-bit numbers, for a total of 16 bytes. Quite a savings!

This data structure is seen throughout X11 programming, and it's known as a region. If you've ever used the cairo graphics library, it has an implementation of regions, called cairo_region_t. (Actually, both the implementation in the Xorg codebase and the one in cairo are the same code, they're both using the pixman library, underneath.)

However, like any data structure, it isn't efficient in all use cases. A simple example here is a checkerboard pattern: instead of one bit per pixel, now we have 16 bytes per pixel! This is much more computationally expensive to work with, and takes a lot more memory. Thankfully, not a lot of software will use a checkerboard region.

Ahem, sorry. Enough reminiscing. Anyway, the story goes that when people working on the X server source code were doing anything with windows, they had code that looked like this:

void do_something_with_window(Window window)
{
    Region region = new Region();
    region.add_rectangle(window.bounding_rectangle);

    region.intersect_rectangle(other_window.whatever);
    draw_some_nonsense(region);
}

That is, the code was almost always taking the window's bounding rectangle, and then converting it into a region to use elsewhere. So, they said to themselves, Hey, why don't we let the user set any arbitrary region that will be used instead of the bounding region? And thus, the X SHAPE Extension was born. The X SHAPE Extension allows the user to instead tell the X server to use a bounding region instead of a bounding rectangle.

This is how we get a circular window, as above: we construct a region of a circle, and then set the window's bounding region to be that circle. This is also how the classic xeyes and oclock get their classic cutout shapes.

The inspector will show the bounding shape region of a window in yellow.

There are two more notes I want to make here. Although this might seem like it would allow for some windows to be semitransparent by poking holes in them, it's still all-or-nothing: either the window is fully transparent, or it's not. That is, if you take your finger and point to any pixel on the above display, I can do some math and tell you the exact window that will paint to that pixel at any point in time. The X SHAPE Extension doesn't change this, it just makes the logic for figuring out which window "owns" a specific pixel more complicated than testing against simple rectangles. In order to allow for true semitransparent windows, we'll have to somehow figure out a way to blend between the pixels that windows draw. We'll explore that another time.

Additionally, setting a bounding region that's larger than the window's width/height can't actually make the window own some pixels that it wouldn't own otherwise. The X SHAPE Extension only allows a window to carve away from where it would normally paint and give that space to the windows underneath.

Pixmaps

The more attentive of you playing around with that last demo might have noticed something special when poking around in kittencircle.png with the inspector. In the Attributes section of the inspector, you might have noticed a background-pixmap attribute, and hovering over it shows the circle kitten image! That raises a few questions: first of all, what is a Pixmap? Why didn't the other windows have a background-pixmap attribute? They seemed to have a background-pixel attribute instead. What's with that?

You might have been wondering how it could have been efficient to keep transferring missing pixels for the kitten images at 60 frames per second, over a network connection in the 80s. The answer is that we're not. Instead, when we create the window and load the image in, the code creates a Pixmap, which allows us to have memory-backed pixel storage on the server. We then upload the pixels for the kitten image to the Pixmap once, using a PutImage request.

Whenever we want to draw to the window from an Expose event, we simply tell the X server to copy from the kitten pixels it already has in its memory space. To do that, we make a CopyArea from the Pixmap to the Window. No more copying done.

You might have noticed the word Drawable in the protocol documentation. A Drawable is something that the user can draw to, which is either a Pixmap or a Window. A Pixmap draws to its own memory storage, but a Window draws to the pixels on the front buffer which it owns.

OK, so, then what's this about background-pixel and background-pixmap?

When the X server sends Expose events to a window, keep in mind that means that there are pixels "missing" from the front buffer that need to be redrawn. It needs to fill in the missing pixels with something, and the X server provides a window with three options:

It can fill the pixels with a color. This is what the background-pixel attribute specifies.
It can fill the pixels with the contents of a pixmap. This is what the background-pixmap attribute specifies.
It can do nothing, and simply leave whatever pixels were there before, and wait for the application to redraw. This is also the default, and it's what you get when there's no explicit background-pixel or background-pixmap attributes. This is how Windows works, and why you see the repeated "IE6 crashed" window when iexplore.exe hangs: it can't respond to the WM_PAINT events, so the old pixel contents stay on the screen!

Coming up...

What is that mysterious "Root Window" we saw in the inspector? How do desktop environments set it up so that windows can be dragged and resized? Why do I have to use GtkEventBox in order to make my widgets respond to input?

All those questions, and more, will be answered... next time! In "Advanced Window Techniques"!