Due to an influx of spam, we have had to impose restrictions on new accounts. Please see this wiki page for instructions on how to get full permissions. Sorry for the inconvenience.
Admin message
Equinix is shutting down its operations with us on April 30, 2025. They have graciously supported us for almost 5 years, but all good things come to an end.
Given the time frame, it's going to be hard to make a smooth transition of the cluster to somewhere else (TBD). Please expect in the next months some hiccups in the service and probably at least a full week of downtime to transfer gitlab to a different place.
It would be useful to standardize an escape sequence for displaying images inline in the terminal output. Prior art that I know of includes item2 (https://www.iterm2.com/documentation-images.html). There is the sixel format but that is long obsolete. DomTerm has a more general escape sequence where you can send arbitrary (sanitized) HTML: to display an image: just send an <img ..> element. Any others? For interoperability we should focus on png, jpeg, and gif (including animated) images with the data encoded in base64.
Basic support could be restricted to images starting at the left column, with no further output to the right of the image. (I.e. only block-level images. Inline images could be a less-portable extension.) We can also make it unspecified how many lines are taken by the image: Some terminals may create an image as a single logical line, with a single extra-large character cell. Other terminals may allocate may allocate as many lines and columns as needed for the image to fit. Both should be allowed when using default options. However, using an extra-large character cell is preferable, as it does not require padding to an integer number of lines.
The portable way to erase or replace a previously-drawn image would be to save-cursor before writing the image, and then restore-cursor before replacing it. This may need some special care (hacks?) in case the image caused scrolling.
Is it important to to use an escape sequence that is will safely fallback to plain text on terminals that don't support images? (After all - you probably wouldn't want to send a large image to a terminal that can't handle it.) If so, we need two escape sequences: An initial OSC sequence, followed by optional plain text, followed by a final OSC sequence. The plain text is used on terminals that don't understand the image. It can also be used by screen readers, and as hover text. Some terminals may allow user switching between the text and the image.
The format should be extensible. The iterm2 supports key=value pairs, which is somewhat extensible, but the format of values is restricted. It would be better to use JSON for options.
We could standardize on the iterm2 1337 sequence, but it has some problems. A minor awkwardness is that inline=0 is the default. Another is that extensibility is limited because of the lack of quoting and escaping. A third is the lack of robustness (fallback) on non-supporting terminals.
"The file formats being used should not be bound by the specifications."
The specification should not limit the possible file formats, and should use MIME. But It think it is reasonable in this day-and-age to assume that "basic support" for the feature should handle at least png, gif, and jpeg. A terminal can always just display the plain-text if the format isn't supported.
Some terminals might support TIFF, or SVG "images", or even general HTML, but I wouldn't consider that "basic".
Do you have any feedback or response to any of my actual suggestions? I have given more thought to this than you know, and certainly more than covered by this initial posting. This was clearly not a complete specification, and is focused on minimal least-common-denominator functionality (for maximizing browser buy-in) rather than maximum flexibility and functionality.
I apologize for forgetting about Kitty and not mentioning it as prior art. But clearly the Kitty specification goes far beyond portable least-common-denominator functionality. Both "Should allow specifying graphics to be drawn at individual pixel positions" and "The graphics should integrate with the text, in particular it should be possible to draw graphics below as well as above the text, with alpha blending" are more advanced that needed for basic functionality, may be difficult for applications to make good use of, and may be difficult to implement on some platforms. I do agree that "The graphics should also scroll with the text, automatically" is essential basic functionality.
As mentioned before, I'm not thrilled by Kitty's overly-terse single-letter keys and magic values. I'm ok with it if other terminal emulators implement it, but from previous discussions there seems to be little enthusiasm for Kitty's option style. I think JSON is more extensible and flexible.
That is a comprehensive, well thought out, feedback based, performant, portable, battle tested implementation of graphics (as opposed to just images) in terminals. It should serve as the starting point for any specification.
I am not going to waste my time responding to your half baked proposal just because you "may have though about it more than I know" but have not actually bothered to post those thoughts.
JSON is a very common and flexible format for data interchange. While very easy to parse, there are parsers for almost every programming language: The domterm command uses json-c.
An application want to pass various kinds of options and meta-data to a terminal. Some options may be strings with arbitrary characters (in which case you might want quoting and escape sequences). Some options may be lists or more complicated structures (for example setting an application menu). Using JSON makes a lot of sense.
I'm afraid I disagree. JSON is just pointless overhead. If you need to pass arbitrary strings, pass them as base64 encoded data directly, as the graphics protocol I linked to does.
Base64 can work, though is mostly used for binary data, and for strings you'd also have to specify the string encoding (presumably Utf8). Base64 has some disadvantages when used for strings: It is not human-readable; for a program to output it requires an extra encoding step; it often uses (slightly) more space than quoted strings; it does not deal with more complex data types (lists or structures).
Base64 also forces an awkward either-or: A field's values are either restricted to a simple string (word) with a limited character set, or all values have to be base64. This is somewhat awkward. Though one idea to allow flexibility: a parameter value can either be a simple "word" (no special characters), or '=' followed by a base64-encoded value. (If a value is specified as always base64, then the '=' prefix is not needed, of course.)
I prefer JSON, but I have no strong objection to base64 (especially if using the convention in the previous paragraph), if that is the general preference.
"because you chose to implement a terminal in a browser, of all things."
DomTerm is not "implemented in a browser", strictly speaking. More accurately, it is (mostly) implemented in JavaScript, and uses the DOM/HTML/JS/CSS platform (a browser engine) as a GUI toolkit. Usually, you would use the browser engine of Electron or Qt, not use a web browser per se (that works, but is not as nice as Electron or Qt). The domterm application is written in C and uses WebSockets to communicate with the browser engine.
It works very well. There is a lot of functionality. Vttest compatibility is excellent (better than xterm.js or kitty). Raw speed of rendering cat large-file is ok but not great, but it's pretty fast for what you need to do. The xterm.js project shows that a browser-based terminal can be quite fast, and once the WebGL renderer is enabled will be one of the fastest terminals around. (DomTerm does not use xterm.js, except for an experimental build option, because some useful DomTerm features are difficult to implement using xterm.js, but that may change.)
A simple convention is enough to allow both enums and b64 encoded strings. Although in the actual graphics protocol, only one arbitrary string is required, a filename, the rest are all enums or numbers, so even that is not required. Anyway, the format of the escape code is a relatively minor issue, lets not get hung up on it. I suggest we first nail down the larger issues of what you want from a graphics spec for terminals first. What graphics formats to support, what drawing primitives to support, performance considerations for local programs and networks, etc.
As for using a browser engine as a GUI tookit, seems like way overkill to me. There is no way it can ever compete with a rendering engine specialised for rendering terminals. And if you are going to be using WebGL anyway (which means a rendering engine specialised for terminals), what's the point of the rest of the browser stack? Regardless, feel free to use whatever tech stack you like for you terminal, my point was the format for escape codes should not be dictated by a specific implementation. JSON is unnecessary for the actual protocol, which as I said requires one arbitrary string otherwise numbers, bools and enums. Your terminal appears to pass far more data around via escape codes, and for that no doubt JSON makes sense. It does not however make sense for this.
"I suggest we first nail down the larger issues of what you want from a graphics spec for terminals first. What graphics formats to support, what drawing primitives to support, performance considerations for local programs and networks, etc."
That was what I was trying to start with my initial message ... Maybe you want to take another look at it?
"As for using a browser engine as a GUI tookit, seems like way overkill to me. There is no way it can ever compete with a rendering engine specialised for rendering terminals."
Feature-wise a browser engine is clear winner, IMO. (For example DomTerm uses the GoldenLayout library to implement very nice draggable tiles and tabs.) Performance-wise, it is quite zippy for what I want to do, at least on my laptop.
"JSON is unnecessary for the actual protocol, which as I said requires one arbitrary string otherwise numbers, bools and enums."
For basic functionality, maybe. For plausible extensions, you might certainly want other string-valued options. One example that comes to mind: Hover text.
That is an example of why we should have an extensible protocol, where one can specify other options or meta-data, which may not be supported by all terminals. That is where JSON works really well, though base64 can work as long as one can add new keywords.
Umm pretty much any graphics toolkit from the last 40 years can do draggable tiles and tabs. There really is no need to use a browser for that. But anyway, as I said, feel free to use whatever floats your boat, I dont care, as long as your choices dont dictate escape code formats. I'm sure using a browser engine has some advantages that made you chose it, I personally dont see that, but I am willing to concede I dont see everything :)
Your initial message mentions support for "simple" images. Why would we create such a half-assed spec? Once somebody commits to implementing support for drawing images in a terminal, asking them to also allow placing those images at controllable locations is hardly an undue burden.
The question of fallbacks and erasing images and detecting support and a few other considerations you did not mention in your post such as network vs local transport and image formats have all been comprehensively answered in the kitty protocol. I am saying that should form a starting point for graphics support. If you or anyone else objects to parts of that protocol, I will be happy to debate it.
And the kitty escape code format can perfectly well handle arbitrary data, in its base64 portion, just use the enum keys in the first part to control how that data is interpreted. But again, I really dont want to debate escape code formats. Lets please table that until we have consensus on the larger issues.
"Your initial message mentions support for "simple" images. Why would we create such a half-assed spec? Once somebody commits to implementing support for drawing images in a terminal, asking them to also allow placing those images at controllable locations is hardly an undue burden."
What are the primary use-cases? I see the primary use-case as a REPL: The user runs a command, which prints some output, which may be graphical. A simple image, with no mixing of text or graphics on the same line, covers much of that. It is nice if the amount of vertical space used is just the number of pixel lines needed, without rounding up to a whole number of lines, which means adjusting the character grid spacing. However, this should not be required.
The next level of functionality and complexity is mixing text and graphics on the same line. An example use is a file listing with images. That is more complicated, and requires the application at at least know the relationship between image and character sizes and positions. This is more difficult to use for applications, and may be more difficult to implement for terminals. It is useful, but not quite as useful.
The next level is more general random-access positioning, with overlapping and blending of text and characters. That is still more complicated both to use and to implement. This is useful too, but I think even less of a priority.
Also keep in mind how screen resize is handled: The first level of the hierarchy handles reflow naturally - no image-specific reflow, but the vertical position is adjusted as needed to fit. The second lever can also handle reflow, but it's a bit more complicated to specify and implement. The third level probably doesn't handle reflow at all - the application that drew the output must repaint.
I think it makes sense for a terminal to implement just lower levels of this hierarchy, and for a specification to allow that. For example DomTerm currently only supports the first two levels (and the second with some limitations). I probably wouldn't implement the third level until other terminal emulators do and I see applications that make use of it.
"Umm pretty much any graphics toolkit from the last 40 years can do draggable tiles and tabs."
Well, I don't see many applications that do it in a general way - Kitty doesn't seem to. Emacs has tiles, but they can't be dragged - only resized. Common browsers only have tabs. Except for applications using browser technology (such as Atom) the only one I'm familiar with is the Netbeans IDE (and to a less general extent Eclipse, IIRC). I looked at how to do it using Qt, and it doesn't seem to handle it directly (though I think it could be done with some effort).
My primary use case is replacing https://github.com/kovidgoyal/iv with a pure terminal application (i.e an image browsing application). That means it needs to: 1) position images arbitrarily 2) draw text over images 3) have good performance so it can display animated images 4) allow control of the lifetime of image data in the terminal to minimize resource consumption while optimizing efficiency.
I'm afraid any spec that does not allow even such a basic application, and your proposal is one such, is not going to pass muster with me.
Umm if by tiles you mean multiple sub-windows inside a top level window, kitty most definitely has them, along with multiple programmable layouts for them. They are not draggable, but that is because the idea of using a mouse to arrange tiles horrifies me, not because it is hard to implement. And this is all without even using a graphics toolkit, let alone a browser.
"My primary use case is replacing an [image browsing application] with a pure terminal application/"
Why? I think very few people would find that a compelling use-case. I can see two minor advantages: Remote browsing of images, and re-using the terminal window (thus reduced need for mouse or context switching). Still, almost everyone else would just use a graphical image browser.
Of course that sort of thing is relatively natural and easy to do with DomTerm. One way is to use a script like:
where start-my-image-server is a simple http server that display the images and uses JavaScript to manage the UI. The domterm --tab browse URL command creates a new browser window (an iframe) in a new tab.
DomTerm also allows you to create an iframe in the current window, by just "printing" the appropriate iframe html.
And of course you don't have to use an iframe or http server at all: You can just "print" the image grid directly. DomTerm doesn't currently have commands to replace previously-written html output (except erase line/display), but I'm just waiting for a suitable application before adding that.
Perhaps using a browser to implement a terminal isn't so silly after all?
"the idea of using a mouse to arrange tiles horrifies me"
The idea of arranging tiles without a mouse horrifies me. (Having keyboard shortcuts is great, but a mouse is much more natural and discoverable to most of us.)
This is going nowhere. Feel free to propose whatever half-assed spec you like, I will feel free to ignore it. If any application I care about ever adds support for it, it will be trivial to support it in kitty on top of its much more capable graphics system.
I just have one question for you: "Why are you developing a terminal emulator?" If you are so fond of mice and graphical UIs and browsers, why not just use them?
There's lot of prior art to study – for all of us, including myself, as I'm not familiar with them – before moving forward. You guys have mentioned Kitty and iTerm2, also Sixel vaguely which is implemented by Xterm and perhaps a few others. You have at least missed to mention Terminology.
And if I may call it so, the VTE 729204: Sixel discussion, which has stalled for now, is also some kind of prior art, at least pointing out some core issues you haven't
mentioned. So is my comment at Austin Group 1151. I'll briefly summarize these here.
These emulators all seem to be going in their own individual ways. Which suggests that probably there's a fundamental issue that no one has figured out how to handle.
In terminal emulation, currently the only unit is the cell size. Every escape sequence moves with integer multiples of the cell width or height and position things accordingly. The first call to make is whether we'd still obey this, or break out of this limitation.
If we still obey this constraint, we cannot offer a means to display an image at its natural size. All that an image protocol can offer is something like "display this image in these 10x15 character cells, using this-and-that scaling-stretching-cropping-tiling-scrolling-aligning method". We need to study if keeping this constraint is acceptable, or whether apps outputting sharp lines (e.g. Gnuplot) would be really unhappy about it.
If we break out from this constraint and want to be able to show an image at its natural size, we have to introduce the concept of "pixel", whose relation (ratio) to the "cell" is unknown, and in some cases, not defineable. This, in turns, breaks current principles (invariants) of terminal emulation. For example, you output a few lines of text, then a certain image, then move the cursor up by 10 rows, then print something. Will it print into the picture, or in the line just above it, or one even further up? It will depend on the font size! Currently no behavior of the terminal emulator depends on the font size, but this one would. This not only results in unexpected user experience (e.g. if someone temporarily enlarges the font, then prints this image, then shrinks it back then the result will differ from as if he hadn't tampered with the font size), but is outright problematic when the concept of font size doesn't even exist.
Yes, there are circumstances when there's no such concept, including the libvterm headless terminal emulation library, screen or tmux in detached mode, screen or tmux when attached to multiple terminal emulators at once (which could have different font sizes), konsole with split view and different font size in each, gnome-terminal if we ever get to implement VTE 103770: Model/view split etc. Such a protocol would probably be unextendable in these directions, and I'm really reluctant to design anything that a headless emulator or screen/tmux or a future VTE cannot theoretically support. (And this dilemma is where current VTE progress has stalled, at least for now.)
So, (2) is technically a heavily problematic approach, even though chances are that some use cases would require it. (1) is a simpler game, although still has some issues.
Modern word processors, browsers etc. don't suffer from this problem because there the DOM is the source and all modifications are defined in the DOM; they don't define operations like "move the cursor upwards by 10 visual lines". This kind of cursor movement is so essential in terminal emulation that we have to live with it, and seems to me that it's pretty much mutually exclusive to approach (2).
If it was my call, I'd say that we should forget (2) and go for (1). We need to make a sacrifice, and being able to display images at 1:1 is going to be this sacrifice, in order to retain the character cell based model without having to have any concept of font size. To mitigate the problem, terminal emulators are free to implement a popup (tooltip) or similar widget that shows the image at 1:1 size (temporarily obscuring the rest of the contents) on mouseover.
Even though I'm a heavy user of terminal emulator and a developer of one of them, I myself don't see it a good idea to shoehorn everything but the kitchen sink into terminal emulators. Sticking to its very basics, the grid nature, is more important than adding new features. If some new feature proposal is an easy fit, so be it. If some is technically problematic, let's just leave it for tools that are better in them. If the task of displaying images is delegated to dedicated image viewers because it's problematic in terminal emulators, I don't see anything wrong with it.
That being said, (1) is an approach that I might see reasonable to further discuss and agree on some standard. However, there are plenty of issues to consider and agree on the behavior, including, but not limited to:
What if the image overwrites existing text
What if the image intersects with another existing image
What if text is printed over the image later on
What if other escape sequences operate on the image in all sorts of weird ways (e.g. ICH/DCH to horizontally scroll the right part of one character row of the image; SU/SD with a scroll region to vertically scroll a part of the image; etc.)
What if the terminal emulator rewraps its contents on resize, what should happen to the image then
One mode that I will call "big-cell image" makes it easy for applications is to treat an image as a pseudo-character: It takes one cell in the logical character grid, but visually it takes as much space as needed. There could be other modes that allow finer control: In another possible mode ("cell-overlay") the cell is zero-width, but an application can specify relative offset and size, including overlapping (blending).
In this message I'll focus on "big-cell image" mode. In the common case (of no other text or image on the same line) the application does not need to know about pixels or alignment. The escape sequence may specify scaling or position adjustment, but by default the image is displayed at "natural size (using "css pixels" which need not be hardware pixels). The application emits the escape sequence for the image (on a fresh line) followed by cr-nl. Cursor addressing works just as if there was a space in the cell. Deleting, erasing, or overwriting the cell with a character (or another image) deletes the image. The image automatically scrolls. If the zoom factor of the window changes, the image can be scaled without problem.
The unusual aspect of this you will no longer be able to fit N rows in an N-row window: The "home" position may be scrolled out if the height of an image is more than the text height. Hence this mode is most useful for normal (not alternate) buffer, in a REPL/shell, with scrolling, though it is not prohibited when using the alternate-buffer.
Now look at the more complex case when text and images are mixed on the same line. There will be both rows/column coordinates, and pixel x/y coordinates. The start x/y coordinate of a cell is that of the top-left corner of a cell. Printing a character adjusts x by the character width - and may cause line-wrapping. When an image whose scaled size is w*h is printed, an automatic (soft) line-break is inserted before the image if the current x>0 and x+w is greater than the window inner width. The height of a line is the maximum of the heights of the text character and image cells on that line after line-breaking. The cells are (by default) aligned by their top edges.
It is preferable that an application can print images without having to know the size in pixels of a text cell. The above line-breaking rules does that, but there is one detail: Any line breaks inserted after or just before an image "don't count" in terms of cursor addressing. For example, assume an application prints 100 characters, a 40-character-wide image, 10 characters, and another 40-character-wide image, starting with the top row of a 80-column window. Then we get a line break after character 80, and before the second image, so 3 lines visually. However, in terms of cursor addressing this would be only 2 lines. To erase the second image, move the cursor to line 2 column 32 (1-origin), or line 1 column 31 (zero-origin). The 31 skips past the 20 characters wrapped from the first line, the 1 column for the first image, then 10 characters for the next 10 characters, moving the cursor just before the second image.
I will discuss "cell-overlay" mode is a susequent message.
"Cell-overlay" mode is similar in some ways to "big-cell" mode, but the application has a lot of flexibility in positioning the image, including overlay/blending with text. Like "big-cell" mode, the image is the "value" in a character cell, and when it comes to cursor addressing the cell is a single column position. This means you can use regular cursor addressing to get to an image cell, and you can erase/delete the image by delete-character/decrease-chaarcter or overwrite. While the cell uses one cell in the character grid, it takes no horizonal space, has zero height, does not cause the x-position to increment, and does not cause line-wrapping. Instead, the applications has to manually adjust the cursor position afterwards. The Escape sequence specifies how big the image is (in pixels or characters), how much its origin (top-left) is offset relative to the character-cell's origin, and its alpha (unless using an image format with alpha).
Using a dummy character-cell makes scrolling work, and in general the image "follows" the adjoining text. You can position images arbitrarily, including on top of each other, or blended with text (depending on alpha).
"Big-cell" mode is simple for the application, which does not require knowledge of character sizes, is best with the normal buffer, and is great for a REPL to "print an image. The image is not padded to a whole number of rows. "Cell-overlay" mode is more complex for an application, but allows arbitrary positioning and overlapping of images. The character grid is not adjusted: The application has to manually position both text and images. This makes sense for a "grpical" "curses-style" application using the alternate buffer.
Answering Egmont's questions:
What if the image overwrites existing text
Since the image is a single zero-width character cell, it can "overwrite" a single character. However, you can overlay text and characters by specifying an appropriate offset on the image and/or writing text after writing the image.
What if the image intersects with another existing image
Whichever image is later (in row/column order) "wins*, but you can specify an alpha to combine with previous image or text.
What if other escape sequences operate on the image in all sorts of weird ways
If a character is inserted/removed before the image cell, then the image position will be shifted right/left, just like a normal character. This is because image offsets are relative the image's pseudo-character-cell.
What if the terminal emulator rewraps its contents on resize, what should happen to the image then
The image's position is always relative to the image cell, using the offset specified in the escape sequence.
The two modes (big-cell/cell-overlay) I mentioned above are special cases of a more general mode. The image is controlled by parameter pairs using the following options. (In the following I use the syntax key=value. Using JSON it would be "key": "value".)
A length is a decimal number followed an optional unit.
The unit can be px (pixel, the default) or c (character - line-height or column-width). (A pixel is not necessarily a hardware raster pixel, but rather is a unit that corresponds to a "CSS pixel".)
An slength is an optional sign followed by a length.
offset=_y-slength_/_x-slength_
Adjust image origin relative to starting cell (top-left) origin.
The default is 0px/0px.
size=_y-length_/_x-length_
Scale image to the specified size. If one of y-length or x-length is unspecified, scale proportionally. Default is the natural size.
maxsize=_y-length_/_x-length_
Same as size, but is the largest that will fit in a y-length and x-length rectangle while being ratio-preserving. Default for x-length is available width.
space=_y-slength_/_x-slength_
How much space to allocate for the image. This is the allocated size of the character cell: The cursor (top-left of next character or image) is incremented by x-slength. The height of the containing line is at least y-slength. The default is the calculated size from size or maxsize. If a sign is specified for y-slength or x-slength then that is added to the calculated size. The allocated space may be smaller or larger than the image size: For example to allocate 3px padding on all sides:
offset=+3px/+3px;space=+6px/+6px
If the allocated space is smaller than the image size, text or another image may be overlaid. "cell-overlay" mode corresponds to:
space=0px/0px
"Big-cell" mode (which is the default) corresponds to: