Basics for a *Good* Image Protocol
[ I wrote this document about half a year ago, after 2+ years of thinking what it takes to have a good image protocol, one that addresses my then-posted concerns at VTE #729204. I publish it now unchanged (except for a few small changes, and this very paragraph) in the hope that it's useful, but – as per an announcement I'm just about to make in VTE's issue tracker – I won't be following the comments, engaged in discussions, or participate in the actual design of the protocol. ]
The discussion at #12 is a nightmare.
Many people have expressed fundamental technical concerns about introducing pixel size. On June 30, 2019 a comment said:
I think we already buried the "Pixels to the terminals!" idea like 30 posts ago?
And then on October 29, about 50 posts later, another comment said:
Imho its a big mistake to put a notion of pixels into a terminal image spec.
Like, what? Are we just going around in circles?
It doesn't matter who wrote these comments, I'm not picking on him (in fact, I do agree with him, except for the soft choice of wording). However, if something is agreed on, and months later is still a discussion topic with "humble opinions", we do have serious troubles.
It's impossible to design something if we can't lay down firm principles and can't make decisions which we stick to and build upon.
There's also a recent concrete proposal for images in that thread which only mentions a few principles, by far not everything that matters, and quickly jumps into escape sequences. How could we jump to details like the choice of escape sequence, before we agree on the conceptual basics??
Another comment clarifies that that proposal doesn't aim to address all the problems and doesn't intend to be a replacement for existing ones. Then what's the point, really? We already have at least five different image protocols (sixel, regis, iterm2, kitty, terminology), terminals don't know which one to implement, apps don't know which one to implement. Adding a sixth such protocol (and then a seventh? eighth? hundredth?) could clearly not improve anything.
What the terminal world needs is not more and more competing protocols, but one properly designed that we can all migrate to. What the terminal world needs right now, is not a simple image protocol, but a good one (whereas one of the aspects of being good is being as simple as reasonable, within the laid down principles and goals).
Let's start this over from scratch. Let's design a good image protocol, that is suitable for all terminals and applications to adopt. It needs to go along the following lines.
Must Have
A good image protocol MUST build up on the following key observations and key conclusions, some of which are detailed in #25 (make sure you familiarize yourself with that document):
Headless, multi-headed
As per that other doc. It MUST address headless, multi-headed terminals, detach-attach operations, font size changes etc.
Deterministic emulation
As per that other doc. A good image protocol MUST NOT let the behavior, as in which character cell ends up holding what character, depend on the font size or other similar property.
No pixel size
A consequence of the previous:
The only unit we can have is character cell size. The concept of pixel, as a unit, can not be introduced.
(Apologies for the loose wording: by "pixel size" I mean the size of a character in pixels, and not the size of a pixel in inches (like DPI) – although that one cannot be introduced either, for the same reasons.)
In many cases this concept just doesn't exist. In other cases it has multiple values, or changes over time, or exists but is unknown by the application.
The protocol thus MUST look along the lines of: "here's an image, display it in the following W×H character cells".
An operation like "display this image in the next 100 pixel rows, or at its 1:1 size (i.e. the height in pixels is taken from the image) and continue printing text in the next character row" would necessarily break the principle of deterministic, font-agnostic emulation.
Don't worry: Pixel-perfect rendering will be possible. Stay tuned. It's just not something the specification can revolve around.
Sixel is broken
A consequence of the previous ones: Sixel is broken.
Sixel is broken because the emulation behavior depends on the font size, thus theoretically can't be supported by terminals that don't have the concept of pixel size (e.g. a detached tmux).
Sixel is broken because it cannot be supported by tmux with side-by-side panes.
Sixel's palette-based approach is also a terrible legacy, practically unable to transfer photos.
Do not ever mention Sixel as a reference while designing a good image protocol, unless you bring it up as an example how not to do it.
No asynchronous operation
As per that other doc. Important use cases, such as listing files along with thumbnails, are incompatible with the unreliability and slowness of asynchronous operation.
Okay to resize (1:1 is a myth)
It's absolutely okay, inevitable, and even desired for the terminal to resize the image.
The concept of 1:1 size is a myth. You hardly ever get that. You look at photos (from your camera, or a website, etc.) you never get it. Resolution of cameras and monitors keep increasing. You look at the same picture on your TV, on your computer screen, on your mobile, it's sized differently on them, and that's okay. What's a pixel with retina / HiDPI screen anyway?
You look at a graph and want crispy edges? It's not that bad after a rescaling either. What's the next thing you do with that graph? Place it in a word document, spreadsheet, upload to a web page? 1:1 size is gone, crispy edges are gone.
Or you want a sophisticated graph in order to examine what is going on? Wouldn't you prefer an interactive application then where you can zoom in along the axes, etc.?
That being said, there will be ways that give you 1:1 size, see later.
Keep the aspect ratio
While resizing is okay, stretching is not okay. An image MUST NOT be automatically stretched, breaking its aspect ratio. A circle MUST remain a circle, square MUST remain a square by default.
(There might be an optional feature to stretch the image.)
Gap above/below, or on the sides
Since the emitter cannot know the aspect ratio of character cells, there might be a gap either above and below, or on the two sides of the image, and the emitter application cannot know which of these will happen.
This is not a problem. Think of television, or movies on the computer. TV sets, monitors have different aspect ratios, and the programme also has varying ones, depending on the content being broadcast, streamed, played locally. Sometimes you have black stripes at the top and bottom. Sometimes you have them on the two sides. Same story when you look at your photos fullscreen. It is okay.
Every character cell has a background color, character cells containing a bit of the image should not be exceptions either. This background color shines through translucent pixels of the image, and will also be visible at these gaps.
No client-side stitching
A consequence of the previous point:
The emitter application cannot tell which character cell contains which exact part of an image. E.g. if an image is displayed in 3×3 characters, the top left cell might contain the upper quarter or so, and left third of the image, and some gap on the top. More on this below at "Overall architecture".
Thus a client application usually cannot stitch an image together from smaller bits. The image has to be transferred in its entirety, along with instructions on how to crop it by the terminal if necessary.
Arbitrary cell-based masking
Scrolling in tmux panes, or arbitrary placement of free-flowing sub-windows along character boundaries, and various other legitime use cases one might think of (e.g. "mc" presenting a popup dialog while an image is displayed in one of its panels) require that images can be masked (cropped) arbitrarily along character edges.
As the most extreme case, if it's requested to occupy W×H character cells, a W×H sized bitmap (or some equivalent structure) must be transferable to specify which cells will display a part of this image, and which cells to skip.
It is not okay to demand that the image is displayed without cropping, and then undesired parts are subsequently overridden by the desired text. This could result in unacceptable flicker. The image has to be displayed with the required masking straight away.
Overall architecture
A consequence of all of the above:
Terminal emulators would remain cell-based. A cell could contain either whatever it can already contain (a letter, a half of a double-wide letter, the "erased" state etc.), or a picture fragment.
If the terminal supports changing the font, there would be no way to precompute once and for all the exact tiny picture a character cell contains: it would change as the aspect ratio changes and the gap perhaps moves from the top and bottom to the sides, or similar.
One particular character cell would contain something like: "There's that PNG. Imagine that it's scaled to W=30 × H=18 character cells to fit (or fill, etc.) that area, aligned to the top-center (or wherever else), then sliced to W×H (30×18) equal parts. I hold position (X=25, Y=15) of that".
The next character cell could easily point to the same PNG, with same sizing/scaling/aligning parameters, and be the (26, 15) fragment of that. In this case they would line up perfectly, with no user-visible stitching. Or the next character cell could, of course, contain something totally different.
Terminal emulators could (and should) cache the scaled/aligned variant of the picture, and recompute it every time the font size changes.
A few words about implementation
This architecture, and not introducing the new unit of "pixel", is the straight consequence of the principles laid down, in order to come up with a protocol that can be supported by various popular use cases such as working under tmux.
However, this architecture also has a great advantage of being much simpler (both the specification and the implementation) than anything that tries to convert the terminal into a mixture of character cells and pixels.
The overall behavior, namely how data is stored, how the scrollback is handled, how hundreds of escape sequences operate on the buffer can all remain unchanged.
As for storage: In terminals, a character cell contains a Unicode character (at least 21 bits, potentially more if combining accents are supported), and attributes such as foreground color (at least 25 bits if truecolor + old palette are supported), bold, italic etc. that are unnecessary for picture fragments. That is, with the addition of 1 new bit (picture or not) and some bitpacking, terminals already have at least 50-ish bits free for picture fragments. This should be enough to encode an identifier for an image along with its sizing-stretching-aligning properties, and the coordinates within.
There needs to be a pool to store the images, and some kind of reference counting, and forced eviction if the pool gets full. This is the bit tricky one. Something similar is already implemented in VTE for the pool of explicit hyperlinks.
For the emulation handling component, a new escape sequence needs to be added that places the image fragments in the cells. This sounds straightforward.
Implementing the rendering component sounds very easy, too.
The required changes are limited to three-four different areas of the terminal, and are reasonably sized each.
Outline of a good protocol
Two-phase escape sequence
The idea has been raised to have two separate steps:
-
Upload the image
-
Display (parts of) the image
I really love and do endorse this idea, or actually a more generic variant:
Generic blob uploading
Multiple different new protocols can be imagined that require transferring somewhat larger amount of data. In addition to the current topic, a few possible ones:
-
Icon to show in desktop notification (#13)
-
Favicon for the given tab (Windows Terminal #1868, GNOME Terminal #142)
-
Upload a background image
-
Play some short sound (#14)
-
Upload a font file to be used
-
etc.
Instead of uploading a picture for this new protocol, we should design a way to upload an arbitrary binary resource (probably along with an optional MIME type), which, in turn, could be used by all of these protocols. This makes it much simpler that we don't have to figure this out over and over again, and won't end up with different solutions for each feature.
Uploading a picture is a magnitudes more expensive operation than printing text, due to the amount of data. It's great to significantly save on network traffic, and not have to retransfer the entire image every time someone e.g. scrolls by a line in tmux over ssh, or every time someone wants to reset the favicon from the shell prompt, etc.
During the discussion of desktop notification icon, it was mentioned that a cap is desired on the length of escape sequences. This might be required by some standard (I'm not sure), and is a good protection against runaway escape sequences. I don't know what's the right answer here, whether the image should be transferred in multiple smaller escape sequences, or in a single arbitrarily large one. But whatever we agree on, should be consistent across these possible new extensions.
We should study existing file transfer protocols (x/y/zmodem, kermit, ...?). I'm not familiar with any of them. If any of them is suitable for us (allows to name the resource, attach a mime type, uses an escape sequence within the terminal line rather than an external channel, needs one-way traffic only, 7-bit clean...) then probably we should use that.
If a new protocol is designed, it should use some widely available transfer encoding. It might be a good idea for future extendability to mandate a content-transfer-encoding field in the protocol. I vote for either requiring base64 support, or requiring both base64 and base85 support to be implemented by supporting terminals. Base64 has the advantage that it's universally and easily available, even from shell scripts. Base85 reduces traffic to 125% of the payload, compared to base64's 133.3%, and thus is a better choice for more sophisticated apps that wish to display photos. We might even be thinking about inventing base192 (at least 105.47%, avoids C0 and C1 controls) or base224 (at least 102.46%, avoids C0 controls) or base253(-ish) (at least 100.21%, avoids CR, LF and the terminator byte), or some kind of backslash-escaping (varying size), or even transferring the picture as a raw bytestream (with content-length sent in advance), bearing in mind the gotchas mentioned in #25.
Named / unnamed
Many utilities just want to print an image once – like the Sixel protocol does. Life should be kept fairly simple for them. So there should be a way to transfer an unnamed one-shot resource, which then can be used once (by the following escape sequence actually displaying it).
For more complex cases, when an image can be reused, it should be transferred as a named resource. I'd like to propose the Java-style naming convention "com.yourdomain.whatever" to avoid accidental collisions, assuming fair play from the apps.
There needs to be a way to "unref" the named resource, that is, tell the terminal that it won't be referenced again (at least not until it's reuploaded). The terminal can then free up the blob as soon as it's not contained in any of its charcells anymore. Maybe there's a need to increase/decrease the ref count as well (e.g. for nested shells)? We might also consider two different sets of resources, one for the normal screen and one for the alternate, and automatically unref all of the alternate screen when switching away from it.
When a named resourced is redefined to a different value, terminals should make sure that wherever it was already printed stays at the previous contents. They can do it by some internal name mangling or versioning or allike.
Positioning, cropping
There are several quite different scenarios to keep in mind.
Note that for the following bullet points, I assume that the picture is uploaded in a separate step. If that proposal is turned down, things become more complicated here. (This is another reason for having a separate upload phase.)
-
Some utilities just want to display a picture, and nothing next to it. These apps may or may not care about the terminal's width, but probably don't care about the terminal's height, and whether the picture fits or the scrollbar has to be used. Printing an image has to work even in the extreme case if the terminal is only 1 or 2 lines tall but the picture is much taller. So it's desirable to have an escape sequence that simply "just prints" the image from the cursor, causing scrolling if necessary. This resembles the Sixel experience, but as we've seen, the desired number of character columns and character rows would need to be specified by the emitter application.
-
Some utilities want to "just print" something, but this something has a bit more complex layout. Think of a multi-column "ls" boosted with thumbnail support: multiple small pieces of picture and text next to each other; maybe with thumbnails that are 2–3 character rows tall. Or "curl wttr.in" using actual pictures. Still, just like in the previous case, these utilities probably don't care about the terminal's height (the user can just scroll back if the output was too long), and have to work even in case of extreme short terminals. For this, it's desirable to have an escape sequence that prints one character row of an image, and advances the cursor just by that much (as if simple text was printed). The emitting utility then can loop through the rows, if it wants to display multi-row pictures, and always decide what to print next to a given character row of the image.
-
For apps that control the entire canvas, e.g. tmux, w3m, ranger, mc etc. there needs to be a way to fully control the placement and cropping of the image, along character edges. This basically means: Scale the image to W×H cells, place it at (X,Y) (possibly partially outside of the viewport in any of the four directions) (warning: needs to support negative numbers!), and print parts of the image according to the given W×H bitmap (or some equivalent structure – for example a list of rectangles might be more convenient). As opposed to the first two cases, this method of printing should not scroll the terminal, and thus not affect the scrollback. Alternatively (or in addition), there could be an escape seqeunce which places a single picture fragment in a particular character cell, and then applications can loop through all the desired cells; although this might have a higher network and performance cost.
Strictly speaking, 2 isn't needed because it can be emulated by using 3's sequences, and 1 isn't needed because it can be emulated by repeatedly (for each row) using 2's sequences. The exact details depend on the exact to-be-made choices. However, providing dedicated escape sequences for 2 and 1 make an emitter application's life easier by not having to loop over each cell or each row, significantly reduce the network traffic, and likely make the overall user experience noticeably snappier.
Providing direct support for use cases 1 and 2 also has the advantage that the image resource can remain unnamed (for use case 1, as well as use case 2 with single-raw tall images), rather than having to ref/unref the image.
Placement, scaling
How to position the image within its designated area? This should be controlled by options along these lines.
-
Resize to fit (most likely leaving a gap either at top and bottom, or at the two sides) [default]
-
Resize to fill (most likely chopping off either the top and bottom, or the two sides)
-
Halfway between "resize to fit" and "resize to fill" – Does it make sense? Probably not.
-
Stretch to fill (altering the aspect ratio)
Also:
-
Disallow to enlarge (but still can shrink)
-
Disallow to shrink (but still can enlarge)
-
(The combination of these two would result in 1:1 size!)
Also:
- Align to 9 different positions: top-start, top-center, top-end, middle-start, middle-center [default], middle-end, bottom-start, bottom-center, bottom-end.
Also:
- Maybe allow repeated (tiled) layout – Probably not.
Image formats
The protocol should state that every popular image format (png, jpg, svg etc.) which is supported by the underlying system should be accepted.
In addition to the standard formats, the idea has been raised to come up with a simple format that is super-duper easy to generate from shell scripts, without introducing dependency on any image library. I do endorse this idea. (I'm wondering if there's any pseudo-standard for this.) I'm thinking along the lines of spelling out each pixel's color in hex (3/4/6/8 characters), separating or terminating values within a row with a comma, and separating or terminating rows with a semicolon. It's yet to be designed exactly. Also, this needs to come along with a "trivial" pass-thru content-transfer-encoding.
Attributes
Bold/bright (SGR 1), faint (SGR 2), italic (SGR 3), reverse (SGR 7) etc. should be ignored for cells that contain an image. The colors should not be tampered with, and bold or italic don't make sense.
Underline (SGR 4), strikethrough (SGR 9), overline (SGR 53) might make some sense for 1-line images containing some symbol, e.g. a replacement for something missing from Unicode. I'd leave it up for implementations to decide.
Blink (SGR 5) and conceal (SGR 8) should be respected if the terminal supports them for text.
The foreground color might be used when the picture is highlighted with the mouse. It's also needed for underline, strikethrough, overline.
No image mangling
I'm pretty sure that sooner or later we'll get requests for new image mangling attributes, e.g. to mirror, rotate, blur, dim, change to black and white etc.
The ones that by design cannot be done by the client application, e.g. scaling, aligning, have to be part of the good terminal protocol.
The rest has to be the task of the client application. We must consistently reject requests to add anything that a client app can do. Implementing any of them opens up an endless can of worms, there's nowhere to draw a line, and terminals would also struggle catching up with new ones. We're not designing a GIMP or ImageMagick built into terminals. The app needs to give us the picture it wants us to display.
RTL, BiDi
A good image protocol must be RTL/BiDi-friendly. This means that in RTL context, the mirrored overall layout should be presented automatically, without the application having to have a different business logic. The picture itself obviously must not be mirrored.
As per the BiDi document, the emulation (model) layer needs to operate without knowing anything about the presentation (view) layer. We've seen that the image escape sequences, at least in some cases, will automatically place lower numbered columns of the picture in lower numbered columns of the emulation buffer, and higher numbered columns of the picture in higher numbered columns of the emulation buffer. With BiDi, columns of the emulation buffer might be displayed from right to left.
The only way this can result in the original picture in RTL context, without breakage, is if columns of the picture are also counted from the right. So the behavior should be:
Each character cell that contains an image fragment is, as per Unicode's recommendation, replaced by U+FFFC "object replacement character" for the purpose of running the BiDi algorithm. The logical cell's visual location is determined accordingly. If such a cell ends up having a resolved RTL directionality, the "column" within the image (e.g. 25 in an earlier example) has to be interpreted from the right edge of the image.
The 9 possible alignment positions should be revised whether to go with "top-left", "top-right" etc. (i.e. always "left" and "right" independent of the directionality), or "top-start", "top-end" etc. (i.e. auto-swapping left/right for RTL). I'm pretty confident that "top-start" etc. are the better choice. Or we could require to support both variants in parallel.
Ncurses compatibility
Even before ncurses adds support for such an image protocol, developers of ncurses apps might want to find some workaround and offer this feature to their users.
Ncurses paints the screen only at times controlled by the developer. It's possible to manually (i.e. without the library's help) (re)display the images afterwards, every single time. The only drawback is a possible flicker. (Again, assuming that we have a separate upload and display phase, and named resources, so that we don't have to retransfer the image itself every time something changes on the screen, which would be another giant drawback. This is yet another reason for separating these two steps.)
I think the following could be a nice way to stop flickers. Introduce a new DEC private mode, disabled by default, where printing (or repeating via the REP sequence) the U+FFFC "object replacement character" does not place this in the buffer; instead, leaves the character or picture fragment underneath unchanged (but other than that, moves the cursor as if a letter was printed). Then enable this mode from the app, and fill up ncurses's canvas with this character exactly where you'll overpaint with an image.
Rewrapping
This is probably the only minor weak point of this proposal.
Due to how certain terminals rewrap their lines, the character based approach easily results in the image falling apart when the terminal window is narrowed, as every character row folds into two.
This is a compromise we can easily live with. After all, it's the same with any tabular data or pseudographics that you print now in your terminal (e.g. output of SQL queries, or a QR code printed using block characters). They also fall apart the same way when you narrow the terminal. So there's no regression to speak of.
The desired behavior of rewrapping on resize isn't specified anywhere. Terminals might even come up with improved algorithms, i.e. skip rewrapping the lines that contain some image.
More on 1:1 size
Let me add another chapter for this, because I know many people are concerned about this; in fact, it misleads many to design an image protocol that revolves around 1:1 size, causing much bigger fundamental problems elsewhere.
Please remember: the 1:1 size is a myth. Maybe you think you need, but you don't. You just don't.
But I do want 1:1 size
There's multiple ways you can get it.
There are two ways an app can try to figure out the character size in pixels. See e.g. VTE #782576 for details. Of course, you won't always get a response, in particular because a correct response often does not exist, but sometimes you do get a value.
Your app can try these methods, and use the result if available.
But these two methods aren't reliable
Work on improving them. (I mean the synchronous one, of course.)
Work on ws_[xy]pixel getting implemented in as many graphical terminals as possible. Work on convincing terminals that the value SHOULD NOT include the padding. In fact, we could demand these as part of the good image protocol.
Work on making sure that a change in these values is forwarded across any channel that forwards a window size change. Work on making sure that SIGWINCH is always triggered when these values change. Work on standardizing it in POSIX. Etc...
Or work on another even better brand new solution, orthogonal to the good image protocol. One that maybe reports all the sizes a multi-headed terminal has. Or whatever.
But I still don't like that
You can modify the application to accept a command line flag or an environment variable as a (somewhat unreliable) hint for the character size.
But it's not good enough for me either
The good image protocol offers a mode where the image is displayed at its original 1:1 size. (Its viewport, however, isn't determined by the image size.) All you need to do is make sure the application allocates a big enough character region for displaying it.
But I want even better
Any terminal emulator can implement a behavior where hovering over the image, or Ctrl+clicking, right-clicking etc. opens it in a new window at 1:1 size. Absolutely no support from any image protocol is required to do this. Ask your favorite terminal to implement it, or implement yourself.
But I'm still not happy
There's a great world outside terminals. There are great graphical environments, great picture viewers apps. Use them whenever they are more suited for your task than the terminal!
But I want it as natively as e.g. on the web, without drawbacks or compromises
Do you want native 1:1 size, and the designated area to be determined based on the picture size and the font size, without any of the aforementioned tiny compromises? Whether you understand the reasons or not, you need to accept that it's impossible to have that without breaking the existing great features and great power of terminals (e.g. having tmux) – some of which you may not need, but millions of other users certainly do.
You can still get this behavior in some terminals, or go ahead and implement it in your own. But you cannot expect any such solution to become widely accepted by terminals, because as I proved it, it would necessarily be incompatible with existing great features, e.g. with the ability to use tmux.
The good image protocol doesn't force a concept to a component which does not necessarily have it. The good image protocol leaves the decision to the application, which can make a good decision about what to do if the concept of pixel size doesn't exist, rather than forcing the terminal to do something arbitrary and meaningless. And, by the way, such a good image protocol is much easier to formalize in a specification and much easier for everyone to implement than one that would introduce pixels as a new unit, and thus will lead to much quicker adoption of the feature.
A good image protocol gives you the best it can, in the cleanest possible way, and without breaking what we already have. And this "best", in fact, includes everything you might need, even the 1:1 pixel size and the "natural" designated area; all it takes is one extra step from the emitter application.