simple image display

The file formats being used should not be bound by the specifications.

"The file formats being used should not be bound by the specifications."

The specification should not limit the possible file formats, and should use MIME. But It think it is reasonable in this day-and-age to assume that "basic support" for the feature should handle at least png, gif, and jpeg. A terminal can always just display the plain-text if the format isn't supported.

Some terminals might support TIFF, or SVG "images", or even general HTML, but I wouldn't consider that "basic".

Please devote more thought to this, at least read: https://sw.kovidgoyal.net/kitty/graphics-protocol.html

"Please devote more thought to this"

Do you have any feedback or response to any of my actual suggestions? I have given more thought to this than you know, and certainly more than covered by this initial posting. This was clearly not a complete specification, and is focused on minimal least-common-denominator functionality (for maximizing browser buy-in) rather than maximum flexibility and functionality.

I apologize for forgetting about Kitty and not mentioning it as prior art. But clearly the Kitty specification goes far beyond portable least-common-denominator functionality. Both "Should allow specifying graphics to be drawn at individual pixel positions" and "The graphics should integrate with the text, in particular it should be possible to draw graphics below as well as above the text, with alpha blending" are more advanced that needed for basic functionality, may be difficult for applications to make good use of, and may be difficult to implement on some platforms. I do agree that "The graphics should also scroll with the text, automatically" is essential basic functionality.

As mentioned before, I'm not thrilled by Kitty's overly-terse single-letter keys and magic values. I'm ok with it if other terminal emulators implement it, but from previous discussions there seems to be little enthusiasm for Kitty's option style. I think JSON is more extensible and flexible.

I absolutely refuse to add a JSON parser to kitty just because you chose to implement a terminal in a browser, of all things.

As for a response to your actual suggestions, it is in https://sw.kovidgoyal.net/kitty/graphics-protocol.html

That is a comprehensive, well thought out, feedback based, performant, portable, battle tested implementation of graphics (as opposed to just images) in terminals. It should serve as the starting point for any specification.

I am not going to waste my time responding to your half baked proposal just because you "may have though about it more than I know" but have not actually bothered to post those thoughts.

JSON is a very common and flexible format for data interchange. While very easy to parse, there are parsers for almost every programming language: The domterm command uses json-c.

An application want to pass various kinds of options and meta-data to a terminal. Some options may be strings with arbitrary characters (in which case you might want quoting and escape sequences). Some options may be lists or more complicated structures (for example setting an application menu). Using JSON makes a lot of sense.

I'm afraid I disagree. JSON is just pointless overhead. If you need to pass arbitrary strings, pass them as base64 encoded data directly, as the graphics protocol I linked to does.

Base64 can work, though is mostly used for binary data, and for strings you'd also have to specify the string encoding (presumably Utf8). Base64 has some disadvantages when used for strings: It is not human-readable; for a program to output it requires an extra encoding step; it often uses (slightly) more space than quoted strings; it does not deal with more complex data types (lists or structures).

Base64 also forces an awkward either-or: A field's values are either restricted to a simple string (word) with a limited character set, or all values have to be base64. This is somewhat awkward. Though one idea to allow flexibility: a parameter value can either be a simple "word" (no special characters), or '=' followed by a base64-encoded value. (If a value is specified as always base64, then the '=' prefix is not needed, of course.)

I prefer JSON, but I have no strong objection to base64 (especially if using the convention in the previous paragraph), if that is the general preference.

"because you chose to implement a terminal in a browser, of all things."

DomTerm is not "implemented in a browser", strictly speaking. More accurately, it is (mostly) implemented in JavaScript, and uses the DOM/HTML/JS/CSS platform (a browser engine) as a GUI toolkit. Usually, you would use the browser engine of Electron or Qt, not use a web browser per se (that works, but is not as nice as Electron or Qt). The domterm application is written in C and uses WebSockets to communicate with the browser engine.

It works very well. There is a lot of functionality. Vttest compatibility is excellent (better than xterm.js or kitty). Raw speed of rendering cat large-file is ok but not great, but it's pretty fast for what you need to do. The xterm.js project shows that a browser-based terminal can be quite fast, and once the WebGL renderer is enabled will be one of the fastest terminals around. (DomTerm does not use xterm.js, except for an experimental build option, because some useful DomTerm features are difficult to implement using xterm.js, but that may change.)

A simple convention is enough to allow both enums and b64 encoded strings. Although in the actual graphics protocol, only one arbitrary string is required, a filename, the rest are all enums or numbers, so even that is not required. Anyway, the format of the escape code is a relatively minor issue, lets not get hung up on it. I suggest we first nail down the larger issues of what you want from a graphics spec for terminals first. What graphics formats to support, what drawing primitives to support, performance considerations for local programs and networks, etc.

As for using a browser engine as a GUI tookit, seems like way overkill to me. There is no way it can ever compete with a rendering engine specialised for rendering terminals. And if you are going to be using WebGL anyway (which means a rendering engine specialised for terminals), what's the point of the rest of the browser stack? Regardless, feel free to use whatever tech stack you like for you terminal, my point was the format for escape codes should not be dictated by a specific implementation. JSON is unnecessary for the actual protocol, which as I said requires one arbitrary string otherwise numbers, bools and enums. Your terminal appears to pass far more data around via escape codes, and for that no doubt JSON makes sense. It does not however make sense for this.

"I suggest we first nail down the larger issues of what you want from a graphics spec for terminals first. What graphics formats to support, what drawing primitives to support, performance considerations for local programs and networks, etc."

That was what I was trying to start with my initial message ... Maybe you want to take another look at it?

"As for using a browser engine as a GUI tookit, seems like way overkill to me. There is no way it can ever compete with a rendering engine specialised for rendering terminals."

Feature-wise a browser engine is clear winner, IMO. (For example DomTerm uses the GoldenLayout library to implement very nice draggable tiles and tabs.) Performance-wise, it is quite zippy for what I want to do, at least on my laptop.

"JSON is unnecessary for the actual protocol, which as I said requires one arbitrary string otherwise numbers, bools and enums."

For basic functionality, maybe. For plausible extensions, you might certainly want other string-valued options. One example that comes to mind: Hover text.

That is an example of why we should have an extensible protocol, where one can specify other options or meta-data, which may not be supported by all terminals. That is where JSON works really well, though base64 can work as long as one can add new keywords.

Umm pretty much any graphics toolkit from the last 40 years can do draggable tiles and tabs. There really is no need to use a browser for that. But anyway, as I said, feel free to use whatever floats your boat, I dont care, as long as your choices dont dictate escape code formats. I'm sure using a browser engine has some advantages that made you chose it, I personally dont see that, but I am willing to concede I dont see everything :)

Your initial message mentions support for "simple" images. Why would we create such a half-assed spec? Once somebody commits to implementing support for drawing images in a terminal, asking them to also allow placing those images at controllable locations is hardly an undue burden.

The question of fallbacks and erasing images and detecting support and a few other considerations you did not mention in your post such as network vs local transport and image formats have all been comprehensively answered in the kitty protocol. I am saying that should form a starting point for graphics support. If you or anyone else objects to parts of that protocol, I will be happy to debate it.

And the kitty escape code format can perfectly well handle arbitrary data, in its base64 portion, just use the enum keys in the first part to control how that data is interpreted. But again, I really dont want to debate escape code formats. Lets please table that until we have consensus on the larger issues.

"Your initial message mentions support for "simple" images. Why would we create such a half-assed spec? Once somebody commits to implementing support for drawing images in a terminal, asking them to also allow placing those images at controllable locations is hardly an undue burden."

What are the primary use-cases? I see the primary use-case as a REPL: The user runs a command, which prints some output, which may be graphical. A simple image, with no mixing of text or graphics on the same line, covers much of that. It is nice if the amount of vertical space used is just the number of pixel lines needed, without rounding up to a whole number of lines, which means adjusting the character grid spacing. However, this should not be required.

The next level of functionality and complexity is mixing text and graphics on the same line. An example use is a file listing with images. That is more complicated, and requires the application at at least know the relationship between image and character sizes and positions. This is more difficult to use for applications, and may be more difficult to implement for terminals. It is useful, but not quite as useful.

The next level is more general random-access positioning, with overlapping and blending of text and characters. That is still more complicated both to use and to implement. This is useful too, but I think even less of a priority.

Also keep in mind how screen resize is handled: The first level of the hierarchy handles reflow naturally - no image-specific reflow, but the vertical position is adjusted as needed to fit. The second lever can also handle reflow, but it's a bit more complicated to specify and implement. The third level probably doesn't handle reflow at all - the application that drew the output must repaint.

I think it makes sense for a terminal to implement just lower levels of this hierarchy, and for a specification to allow that. For example DomTerm currently only supports the first two levels (and the second with some limitations). I probably wouldn't implement the third level until other terminal emulators do and I see applications that make use of it.

"Umm pretty much any graphics toolkit from the last 40 years can do draggable tiles and tabs."

Well, I don't see many applications that do it in a general way - Kitty doesn't seem to. Emacs has tiles, but they can't be dragged - only resized. Common browsers only have tabs. Except for applications using browser technology (such as Atom) the only one I'm familiar with is the Netbeans IDE (and to a less general extent Eclipse, IIRC). I looked at how to do it using Qt, and it doesn't seem to handle it directly (though I think it could be done with some effort).

My primary use case is replacing https://github.com/kovidgoyal/iv with a pure terminal application (i.e an image browsing application). That means it needs to: 1) position images arbitrarily 2) draw text over images 3) have good performance so it can display animated images 4) allow control of the lifetime of image data in the terminal to minimize resource consumption while optimizing efficiency.

I'm afraid any spec that does not allow even such a basic application, and your proposal is one such, is not going to pass muster with me.

Umm if by tiles you mean multiple sub-windows inside a top level window, kitty most definitely has them, along with multiple programmable layouts for them. They are not draggable, but that is because the idea of using a mouse to arrange tiles horrifies me, not because it is hard to implement. And this is all without even using a graphics toolkit, let alone a browser.

"My primary use case is replacing an [image browsing application] with a pure terminal application/"

Why? I think very few people would find that a compelling use-case. I can see two minor advantages: Remote browsing of images, and re-using the terminal window (thus reduced need for mouse or context switching). Still, almost everyone else would just use a graphical image browser.

Of course that sort of thing is relatively natural and easy to do with DomTerm. One way is to use a script like:

DIR=$1
PORT=`start-my-image-server $DIR`
domterm --tab browse http://localhost:$PORT

where start-my-image-server is a simple http server that display the images and uses JavaScript to manage the UI. The domterm --tab browse URL command creates a new browser window (an iframe) in a new tab.

DomTerm also allows you to create an iframe in the current window, by just "printing" the appropriate iframe html.

And of course you don't have to use an iframe or http server at all: You can just "print" the image grid directly. DomTerm doesn't currently have commands to replace previously-written html output (except erase line/display), but I'm just waiting for a suitable application before adding that.

Perhaps using a browser to implement a terminal isn't so silly after all?

"the idea of using a mouse to arrange tiles horrifies me"

The idea of arranging tiles without a mouse horrifies me. (Having keyboard shortcuts is great, but a mouse is much more natural and discoverable to most of us.)

This is going nowhere. Feel free to propose whatever half-assed spec you like, I will feel free to ignore it. If any application I care about ever adds support for it, it will be trivial to support it in kitty on top of its much more capable graphics system.

I just have one question for you: "Why are you developing a terminal emulator?" If you are so fond of mice and graphical UIs and browsers, why not just use them?

There's lot of prior art to study – for all of us, including myself, as I'm not familiar with them – before moving forward. You guys have mentioned Kitty and iTerm2, also Sixel vaguely which is implemented by Xterm and perhaps a few others. You have at least missed to mention Terminology.

And if I may call it so, the VTE 729204: Sixel discussion, which has stalled for now, is also some kind of prior art, at least pointing out some core issues you haven't mentioned. So is my comment at Austin Group 1151. I'll briefly summarize these here.

These emulators all seem to be going in their own individual ways. Which suggests that probably there's a fundamental issue that no one has figured out how to handle.

In terminal emulation, currently the only unit is the cell size. Every escape sequence moves with integer multiples of the cell width or height and position things accordingly. The first call to make is whether we'd still obey this, or break out of this limitation.

If we still obey this constraint, we cannot offer a means to display an image at its natural size. All that an image protocol can offer is something like "display this image in these 10x15 character cells, using this-and-that scaling-stretching-cropping-tiling-scrolling-aligning method". We need to study if keeping this constraint is acceptable, or whether apps outputting sharp lines (e.g. Gnuplot) would be really unhappy about it.
If we break out from this constraint and want to be able to show an image at its natural size, we have to introduce the concept of "pixel", whose relation (ratio) to the "cell" is unknown, and in some cases, not defineable. This, in turns, breaks current principles (invariants) of terminal emulation. For example, you output a few lines of text, then a certain image, then move the cursor up by 10 rows, then print something. Will it print into the picture, or in the line just above it, or one even further up? It will depend on the font size! Currently no behavior of the terminal emulator depends on the font size, but this one would. This not only results in unexpected user experience (e.g. if someone temporarily enlarges the font, then prints this image, then shrinks it back then the result will differ from as if he hadn't tampered with the font size), but is outright problematic when the concept of font size doesn't even exist.

Yes, there are circumstances when there's no such concept, including the libvterm headless terminal emulation library, screen or tmux in detached mode, screen or tmux when attached to multiple terminal emulators at once (which could have different font sizes), konsole with split view and different font size in each, gnome-terminal if we ever get to implement VTE 103770: Model/view split etc. Such a protocol would probably be unextendable in these directions, and I'm really reluctant to design anything that a headless emulator or screen/tmux or a future VTE cannot theoretically support. (And this dilemma is where current VTE progress has stalled, at least for now.)

So, (2) is technically a heavily problematic approach, even though chances are that some use cases would require it. (1) is a simpler game, although still has some issues.

Modern word processors, browsers etc. don't suffer from this problem because there the DOM is the source and all modifications are defined in the DOM; they don't define operations like "move the cursor upwards by 10 visual lines". This kind of cursor movement is so essential in terminal emulation that we have to live with it, and seems to me that it's pretty much mutually exclusive to approach (2).

If it was my call, I'd say that we should forget (2) and go for (1). We need to make a sacrifice, and being able to display images at 1:1 is going to be this sacrifice, in order to retain the character cell based model without having to have any concept of font size. To mitigate the problem, terminal emulators are free to implement a popup (tooltip) or similar widget that shows the image at 1:1 size (temporarily obscuring the rest of the contents) on mouseover.

Even though I'm a heavy user of terminal emulator and a developer of one of them, I myself don't see it a good idea to shoehorn everything but the kitchen sink into terminal emulators. Sticking to its very basics, the grid nature, is more important than adding new features. If some new feature proposal is an easy fit, so be it. If some is technically problematic, let's just leave it for tools that are better in them. If the task of displaying images is delegated to dedicated image viewers because it's problematic in terminal emulators, I don't see anything wrong with it.

That being said, (1) is an approach that I might see reasonable to further discuss and agree on some standard. However, there are plenty of issues to consider and agree on the behavior, including, but not limited to:

What if the image overwrites existing text
What if the image intersects with another existing image
What if text is printed over the image later on
What if other escape sequences operate on the image in all sorts of weird ways (e.g. ICH/DCH to horizontally scroll the right part of one character row of the image; SU/SD with a scroll region to vertically scroll a part of the image; etc.)
What if the terminal emulator rewraps its contents on resize, what should happen to the image then

One mode that I will call "big-cell image" makes it easy for applications is to treat an image as a pseudo-character: It takes one cell in the logical character grid, but visually it takes as much space as needed. There could be other modes that allow finer control: In another possible mode ("cell-overlay") the cell is zero-width, but an application can specify relative offset and size, including overlapping (blending).

In this message I'll focus on "big-cell image" mode. In the common case (of no other text or image on the same line) the application does not need to know about pixels or alignment. The escape sequence may specify scaling or position adjustment, but by default the image is displayed at "natural size (using "css pixels" which need not be hardware pixels). The application emits the escape sequence for the image (on a fresh line) followed by cr-nl. Cursor addressing works just as if there was a space in the cell. Deleting, erasing, or overwriting the cell with a character (or another image) deletes the image. The image automatically scrolls. If the zoom factor of the window changes, the image can be scaled without problem.

The unusual aspect of this you will no longer be able to fit N rows in an N-row window: The "home" position may be scrolled out if the height of an image is more than the text height. Hence this mode is most useful for normal (not alternate) buffer, in a REPL/shell, with scrolling, though it is not prohibited when using the alternate-buffer.

Now look at the more complex case when text and images are mixed on the same line. There will be both rows/column coordinates, and pixel x/y coordinates. The start x/y coordinate of a cell is that of the top-left corner of a cell. Printing a character adjusts x by the character width - and may cause line-wrapping. When an image whose scaled size is w*h is printed, an automatic (soft) line-break is inserted before the image if the current x>0 and x+w is greater than the window inner width. The height of a line is the maximum of the heights of the text character and image cells on that line after line-breaking. The cells are (by default) aligned by their top edges.

It is preferable that an application can print images without having to know the size in pixels of a text cell. The above line-breaking rules does that, but there is one detail: Any line breaks inserted after or just before an image "don't count" in terms of cursor addressing. For example, assume an application prints 100 characters, a 40-character-wide image, 10 characters, and another 40-character-wide image, starting with the top row of a 80-column window. Then we get a line break after character 80, and before the second image, so 3 lines visually. However, in terms of cursor addressing this would be only 2 lines. To erase the second image, move the cursor to line 2 column 32 (1-origin), or line 1 column 31 (zero-origin). The 31 skips past the 20 characters wrapped from the first line, the 1 column for the first image, then 10 characters for the next 10 characters, moving the cursor just before the second image.

I will discuss "cell-overlay" mode is a susequent message.

"Cell-overlay" mode is similar in some ways to "big-cell" mode, but the application has a lot of flexibility in positioning the image, including overlay/blending with text. Like "big-cell" mode, the image is the "value" in a character cell, and when it comes to cursor addressing the cell is a single column position. This means you can use regular cursor addressing to get to an image cell, and you can erase/delete the image by delete-character/decrease-chaarcter or overwrite. While the cell uses one cell in the character grid, it takes no horizonal space, has zero height, does not cause the x-position to increment, and does not cause line-wrapping. Instead, the applications has to manually adjust the cursor position afterwards. The Escape sequence specifies how big the image is (in pixels or characters), how much its origin (top-left) is offset relative to the character-cell's origin, and its alpha (unless using an image format with alpha).

Using a dummy character-cell makes scrolling work, and in general the image "follows" the adjoining text. You can position images arbitrarily, including on top of each other, or blended with text (depending on alpha).

"Big-cell" mode is simple for the application, which does not require knowledge of character sizes, is best with the normal buffer, and is great for a REPL to "print an image. The image is not padded to a whole number of rows. "Cell-overlay" mode is more complex for an application, but allows arbitrary positioning and overlapping of images. The character grid is not adjusted: The application has to manually position both text and images. This makes sense for a "grpical" "curses-style" application using the alternate buffer.

Answering Egmont's questions:

What if the image overwrites existing text

Since the image is a single zero-width character cell, it can "overwrite" a single character. However, you can overlay text and characters by specifying an appropriate offset on the image and/or writing text after writing the image.
What if the image intersects with another existing image

Whichever image is later (in row/column order) "wins*, but you can specify an alpha to combine with previous image or text.
What if other escape sequences operate on the image in all sorts of weird ways

If a character is inserted/removed before the image cell, then the image position will be shifted right/left, just like a normal character. This is because image offsets are relative the image's pseudo-character-cell.
What if the terminal emulator rewraps its contents on resize, what should happen to the image then

The image's position is always relative to the image cell, using the offset specified in the escape sequence.

The two modes (big-cell/cell-overlay) I mentioned above are special cases of a more general mode. The image is controlled by parameter pairs using the following options. (In the following I use the syntax key=value. Using JSON it would be "key": "value".)

A length is a decimal number followed an optional unit. The unit can be px (pixel, the default) or c (character - line-height or column-width). (A pixel is not necessarily a hardware raster pixel, but rather is a unit that corresponds to a "CSS pixel".)

An slength is an optional sign followed by a length.

offset=_y-slength_/_x-slength_

Adjust image origin relative to starting cell (top-left) origin. The default is 0px/0px.

size=_y-length_/_x-length_

Scale image to the specified size. If one of y-length or x-length is unspecified, scale proportionally. Default is the natural size.

maxsize=_y-length_/_x-length_

Same as size, but is the largest that will fit in a y-length and x-length rectangle while being ratio-preserving. Default for x-length is available width.

space=_y-slength_/_x-slength_

How much space to allocate for the image. This is the allocated size of the character cell: The cursor (top-left of next character or image) is incremented by x-slength. The height of the containing line is at least y-slength. The default is the calculated size from size or maxsize. If a sign is specified for y-slength or x-slength then that is added to the calculated size. The allocated space may be smaller or larger than the image size: For example to allocate 3px padding on all sides:

offset=+3px/+3px;space=+6px/+6px

If the allocated space is smaller than the image size, text or another image may be overlaid. "cell-overlay" mode corresponds to:

space=0px/0px

"Big-cell" mode (which is the default) corresponds to:

space=+0px/+0px

"big-cell image" [...] It takes one cell in the logical character grid, but visually it takes as much space as needed [...] The unusual aspect of this you will no longer be able to fit N rows in an N-row window: The "home" position may be scrolled out if the height of an image is more than the text height.

"Cell-overlay" [...] While the cell uses one cell in the character grid, it takes no horizonal space, has zero height, does not cause the x-position to increment

These approaches all take us extremely far from what most of the terminal emulators are, namely, a strict grid of cells.

Remember: you have an HTML rendering engine underneath your terminal emulator. Most emulators don't. For you, implementing such a behavior might be quite easy. For most other terminals, it's an enormous amount of work which its developers are pretty unlikely to invest. One thing I can guarantee to you is that I'm definitely not going to voluntarily implement anything like this in VTE/gnome-terminal. Such a change would require us to refactor most of the fundamentals that this software is currently built upon. It's absolutely unrealistic.

(On a side note: VTE 769440 requested another feature where the concept of logical line and visual line would need to deviate, and we rejected that request too.)

The way I could imagine image support is: The strict grid remains, but similarly to how a CJK character occupies 2×1 cells, an image would occupy W×H, as these parameters are specified in the escape sequence. The image is somehow transferred, and then displayed in this region. Since the grid remains (and an image occupies many grid cells), cursor positioning and similar operations would work unaltered on the grid.

Then one possible approach to continue is to say that any time something happens to the image's contents, e.g. any of its cells is overwritten, the entire image disappears. This is what happens to CJK, too. Rewrapping would still be problematic, though.

Another approach is that each cell lives its own life independently from the others, and each cell stores the information like "I'm cell (3,5) of foo.png split into 12×10 cells" (probably even caching the exact bitmap according to the current font size) and displays accordingly. Subsequent updates to cells could mangle the image, but that's not a huge problem. Rewrapping to a narrower viewport than the image would also mangle it but then widening back the terminal would reconstruct it.

Of course, as said above, the image couldn't retain its natural dimensions. The original size could be displayed by the emulator on mouseover, this doesn't need to be part of the specification.

"The way I could imagine image support is: The strict grid remains, but similarly to how a CJK character occupies 2×1 cells, an image would occupy W×H, as these parameters are specified in the escape sequence."

My main quibble is I would like an option to leave W and H unspecified in the escape sequence: The terminal would use as many cells as needed for the image to fit. (Terminals that support variable-size cells might use a single extra-tall/-short cell.) This would simplify applications that only want to print a simple image without having to know pixels-per-character metrics. The other advantage is that a terminal can handle resizing better, especially in the common case of an image with no other text or images on the same line. Finally, it allows images to be the natural size without padding on terminals that support it.

"most ... terminal emulators are ... a strict grid of cells"

But does each line of cells have to be the same height? It would be some work to remove that restriction, but I don't see it as a fundamental rewite: If your terminal positions and writes each character individually, then it's (relatively) easy. If you use some kind of text widget, they you're presumably not using a "plain text" widget (since you need to style characters with colors, bold, etc). Which means you're using a "rich text" widget, and all the ones I have heard of support variable-height lines, with font changes and images. (I guess there are some text widgets written specifically for terminal emulators that support styled monospace-only text.)

Regardless, note that my "cell-overlay" mode or the more general model in my previous comment does not change the grid model, as long as the size parameters are an integer number of lines/columns. (Zero columns could be slightly more complicated.)

On Terminology, each line of cell has the same height. It would a lot of work to remove that restriction. It is using a special widget (a cell grid) written in the toolkit. Any change needed there would need to handled in the widget library. Then Terminology's would have to handle the 2 versions of the widget.

My main quibble is I would like an option to leave W and H unspecified in the escape sequence: The terminal would use as many cells as needed for the image to fit.

I, on the other hand, object to introducing the cell size as something that influences the behavior. I would like the behavior to be independent of the cell size. I would like it not to make any difference whether the font size is altered before or after printing an image and various other output including cursor moving escape sequences; the end result should be identical. I want to keep the possibility of having no font size at all, or multiple font sizes at once, as we already have these without any problem.

This requirement conflicts with yours.

But does each line of cells have to be the same height? It would be some work to remove that restriction, but I don't see it as a fundamental rewite: If your terminal positions and writes each character individually, then it's (relatively) easy.

How far would you go here?

Would you allow entire lines to have some different height than the normal one (and cells within that line keep their regular width)?

How would you define the reported window size (e.g. stty size) in relation to the window's physical size? Would, for example, there still be a "normal" height for text, and irregular ones only for images?

So, if a row is taller than the rest, content area above the normally visible region becomes addressable with cursor moving operations and alterable. Doesn't it introduce usability problems? Or in the other direction: what if a line is shorter (vertically) than the rest, would you have a peek into lines of the read-only scrollback contents?

What to do on the alternate screen where most terminals don't have a scrollbar? Would you add a scrollbar if there's a taller line, and would you show empty contents on the top if there's a shorter line? Some terminals still allow you to scroll back into the read-only scrollback buffer of the normal screen from the alternate one, would the scrollback contents peek through in this case if there's a shorter line?

What would Shift+PageUp and other typical scrolling keycombos do?

In VTE we currently have an obviouos 1:1 mapping between the scrollbar position and the logical position in our content buffer. This would no longer be the case, we'd need to add complexity to locate how the scrollbar's position maps to the lines to display and vice versa (and recompute this mapping on font size change). You may get this out of the box in a browser (and it may get noticeably slow relatively soon, as I experience in my browser with infinite scrolling pages), whereas in VTE scrolling remains blazingly fast even with millions of lines in the scrollback.

But there's another much bigger problem: if you start in this direction, you'll no longer be able to have tmux with split panes. (And by this I don't mean tmux in particular, and especially not its current implementation. What I mean it is that it would be theoretically impossible for a tmux-like software to operate as desired.) E.g. how would you place two panes next to each other when one has a row of unusual height and the other one doesn't? See VTE 587049 for a somewhat related, rejected feature request. Or how would you handle two panes underneath each other, if e.g. at the bottom of the top pane you have to display a partial text line due to an image above it?

Or, in order to address the tmux issue, would you start going in the direction where the terminal emulator is a canvas with free pixel-grained movement, and you can place letters anywhere? Define when a letter overwrites a previous one (e.g. if their regular-cell-sized bounding boxes overlap by at least one pixel). Define new escape sequences to move the cursor to arbitrary pixel positions. Define new sequences to print cropped letters, as tmux will need it. Figure out what to do on a font size change. Figure out how to rewrap the contents on resize if the concept of "text row" can no longer be defined. Figure out in which order to copy-paste the letter. Figure out how to do BiDi on that. Etc…

but I don't see it as a fundamental rewite

The more we talk about it, the more I'm certain that any approach that brings in pixel size as a concept is so much incompatible with the functionality and invariants that we have now that it needs a fundamental redesign and rewrite of most of the things we have now, with tons of arbitrary decisions. I find it absolutely unreasonable to go in this direction, and find no real chance that any suggestion along these lines would attract terminal emulator developers and would get adopted.

If anything, what could make sense to me is to redesign everything from scratch in a way that we abandon the concept of visual cursor movement, and instead build up a world where e.g. launching a command from the shell means that the command's output will be placed in a new <div> in the HTML DOM, allowing sub-elements; editing a command would operate in the DOM and the canvas would always be updated accordingly… Apps would no longer focus on outputting the desired visuals (within the given contraints, as they do now), but they'd focus on the data, plus output visual clues (like HTML + CSS does). This would also allow proportional fonts and complex text rendering, something that Indic, Arabic etc. scripts would be thankful for, and would have many other advantages. But I believe it's out of the scope of this Terminal WG.

If you use some kind of text widget [...] Which means you're using a "rich text" widget

Not sure about other terminals. In VTE we're not using any kind of widget in this sense. We have a cairo canvas, and paint letters on it using pango/pangocairo and "manually" paint additional stuff like underlines. In the object hierarchy, VTE is next to the text widgets, we paint the text just like they do.

The use of any kind of "plain text widget" or "rich text widget" would require that we pass the entire text (including the scrollback) to that, replacing the entire text any time something changes. Such widgets were not built up to handle millions of lines effectively, we'd get terrible performance pretty soon. Plus we'd need lot of other controls that we don't have, like having exact control of the scrollbar position, knowing which letter the mouse is over (and potentially underline it then) etc.

The biggest problem is still not how to print in bigger font, bold, colorblended, rotated, whatnot (although it would sure be cumbersome). The biggest problem is defining what needs to be shown in response to any combination of data received (text and escape sequences, including pictures) and user actions (font size change, window resize, tmux detach etc.). The moment you try to add to the current terminal emulation world something that depends on the font size, plenty of things that we have currently suddenly falls apart.

Let's get back to the use cases.

If a utility just wants to show an image, and it's important to show it at 1:1 size (note that this is already troublesome to define with HiDPI scaling), I still don't see why it needs to do it inside the terminal emulator. Just opening an image viewer app sounds perfect to me. (ssh, I know.)

Alternatively, it could print it, asking for the terminal's entire width (which it can easily query), and it could even make a guess on the number of desired rows assuming a reasonable 1:2 aspect ratio for the cells. This way, depending on the actual aspect ratio, there could remain a gap either on the sides or above+below the picture, or the picture getting slightly stretched (subject to a corresponding parameter in the escape sequence). (This would, however, behave badly after narrowing the terminal window. But again, open an image viewer if this is a concern.)

The terminal emulator could also conveniently show the 1:1 size on mouseover, or offer other means (e.g. ctrl+click, right-click menu entry etc.) to open the image in a designated viewer software.

Other use cases, such as showing a tiny preview (see tyls inside Terminology) or a file type icon, or showing a larger preview in some designated area (e.g. ranger or mc), however, require the other approach, the one that's magnitudes more easily doable in terminal emulation, without having to design everything from the grounds up: to automatically scale the image into a defined number of character cells.

This latter approach is feasible to standardize across terminal emulators, is reasonably implementable, and is theoretically easily tmuxable (if the escape sequence is designed properly, allowing cropping of an image).

"any approach that brings in pixel size as a concept"

(Just to make sure there is no confusion: I want to clarify that we're taking about "size of the image in pixels" and/or "size of a character cell in pixels", not the size of each pixel.)

I also want to emphasize that we're taking about inline images, in the same scrollable buffer as the regular text. "Just opening an image viewer app" or "preview with popup on mouseover" are perfectly reasonable things to do, but that is not what I'm hoping to specify. I think there are two main sub-use-cases: (1) Simple image "printing" without text and images on the same line and without needing precise positioning; and (2) applications needing finer control over where images are positioned and scaled.

In use-case (2) the application can request: "scale this image to fit in X columns and Y rows" (optionally preserving aspect ratio). But the application can do a better job if it knows the size in pixels of a character cell or at least the window width in pixels. (If nothing else, it can use that to avoid sending an needlessly high-resolution image.) I think we can agree on this.

I'm suggesting that for sub-use-case (1) it would be a convenient for an application to leave the X*Y unspecified, and let the the emulator calculate it. In that case the application doesn't need to worry about size in pixels. I also suggested that it should be allowable (but not required) for the emulator to "calculate" X*Y to be 1*1, but using an extra-large character cell, if the terminal can support that. You're free to think that's bad idea which can cause all kinds of problems, but all I'm asking is for a specification that allows this: A request to display an image with unspecified X*Y will work just fine on a terminal that doesn't support extra-large character cells. DomTerm does support the extra-large character cell model, and I'll respond in more detail to your concerns shortly. The point is: you and vte don't need to care.

[This is a point-by-point response to #12 (comment 140808) focusing on the possibility of variable-height lines. Feel free to ignore it not interested.]

Note that generally these "answers" are implemented or planned for DomTerm. (And of course DomTerm does have bugs.)

"Would you allow entire lines to have some different height than the normal one (and cells within that line keep their regular width)?"

Yes. The height of a line is the maximum height of the cells in it (except for empty lines which have the normal height).

"How would you define the reported window size (e.g. stty size) in relation to the window's physical size? Would, for example, there still be a "normal" height for text, and irregular ones only for images?"

There is a "normal text height" used for regular characters written directly to the terminal. But note that DomTerm allows "printing" of general HTML including text which be default is variable-width and can change fonts.

The terminal window's size in pixels (as reported by TIOCSWINSZ) is the size in "css/dom pixels" (which may differ from hardware pixels on a HIDPI screen), without borders, padding, scrollbar, and gutters. Let as define num_rows and num_columns as the number of rows and characters that can fit using the default monospace font. The default cell size is the size of such a character. The home line is by definition num_rows-1 above the bottom row (unless that would be negative).

"So, if a row is taller than the rest, content area above the normally visible region becomes addressable with cursor moving operations and alterable." Yes. "Doesn't it introduce usability problems?" Could be. However, an application would have to "want" to do this, and hopefully do what it is doing for good reason.

"Or in the other direction: what if a line is shorter (vertically) than the rest, would you have a peek into lines of the read-only scrollback contents?" Depends on how things are scrolled. They may be visible but are not addressable using normal cursor commands, since you can't address above the "home" line. It could be useful to have cursor motion commands that could go above the home line, but I have not tried that.

"What to do on the alternate screen where most terminals don't have a scrollbar?"

Most terminals don't have variable-height lines. DomTerm treats the normal and alternate screens as a single buffer, and you can scroll above the alternate screen. I think this approach works well. There are other possible approaches. One is to disallow or discourage variable-size cells in the alternate screen.

"Would you add a scrollbar if there's a taller line, and would you show empty contents on the top if there's a shorter line?" You would show empty contents on the bottom, like you normally do when the screen isn't full. "Some terminals still allow you to scroll back into the read-only scrollback buffer of the normal screen from the alternate one, would the scrollback contents peek through in this case if there's a shorter line?" Not unless explicitly scrolled.

"What would Shift+PageUp and other typical scrolling keycombos do?"

I think scrolling by pixel lines (i.e. a variable number of text lines, possibly partial) makes sense in this case.

"But there's another much bigger problem: if you start in this direction, you'll no longer be able to have tmux with split panes."

First, DomTerm has tmux functionality built in (i.e. multiple panes and persistent sessions), with each pane a much more functional terminal emulator than tmux.

"(And by this I don't mean tmux in particular, and especially not its current implementation. What I mean it is that it would be theoretically impossible for a tmux-like software to operate as desired.) E.g. how would you place two panes next to each other when one has a row of unusual height and the other one doesn't?"

Huh? If you're running a tmux-like program, then you're either emulating a traditional tmux-like terminal (with no variable-height lines), in which case the question is moot. Otherwise, each pane has a particular height (in pixels) and you manage the lines for each pane separately.

Emacs handles this fine: You can have two panes side-by-side, with one pane a regular monospace-font editor, and the other with an eww browser with variable-height lines and images.

To support tmux-like multiple panes you really need to handle the panes in the terminal emulator, like emacs does. Session management can still be handled by a separate program.

"Or, in order to address the tmux issue, would you start going in the direction where the terminal emulator is a canvas with free pixel-grained movement, and you can place letters anywhere?"

One could, but there is no need to do that for tmux or for images.

"If anything, what could make sense to me is to redesign everything from scratch in a way that we abandon the concept of visual cursor movement, and instead build up a world where e.g. launching a command from the shell means that the command's output will be placed in a new <div> in the HTML DOM, allowing sub-elements; editing a command would operate in the DOM and the canvas would always be updated accordingly…"

Well, yes - that is exactly what DomTerm does. It just does it in way that it also supports xterm-style escape sequences and features in an integrated way.

"Apps would no longer focus on outputting the desired visuals (within the given contraints, as they do now), but they'd focus on the data, plus output visual clues (like HTML + CSS does)."

That is exactly what DomTerm is trying to do. I just see no need to give up compatibility with traditional terminal emulators.

"But I believe it's out of the scope of this Terminal WG."

Agreed - but I would like to define an escape sequence that solves part of the goal, and that can also be implemented on less radical terminals.

Just to make sure there is no confusion: I want to clarify that we're taking about "size of the image in pixels" and/or "size of a character cell in pixels", not the size of each pixel.

Indeed, thanks for the clarification!

I also want to emphasize that we're taking about inline images

We're talking about inline images in the sense that this is what you want to have, and what I don't.

Inline sixel images are already supported by xterm and a few emulators. Is there really a need for something new here?

I've taken a quick look at xterm's sixel. There's no way for the emitter to know upfront how many character rows a picture will take. As such, it's useful for utilities that "just print" the image, but unusable for anyone who wants to control the entire canvas. The contents fall apart on font size change. It's theoretically impossible to implement support for this in screen/tmux.

The point is: you and vte don't need to care.

And likely I won't. The question is: What about other terminals, and what about applications? Is the developer of any other terminal interested in this, so that this becomes a cross-emulator feature, what this Terminal WG site is for? Will application developers be interested? Will they be interested in a feature that by its design cannot work within screen/tmux commands (and I'm not talking about screen/tmux integration of an emulator)?

I am personally not at all interested in such a feature and thus cannot endorse it, but of course I can't stop people from doing it.

I have come to the firm conclusion that this is a wrong approach.

In my firm opinion any image protocol (or any other terminal emulator protocol extension) has to fulfill these requirements:

Given an initial state (dimension in characters, cursor position etc.), and fixed sequence that is a mixture of incoming data (text, escape sequences including pictures) and user interactions (e.g. window resize), the end result (as in what character appears in which cell, and where an image shows up) has to be independent of any timing, as well as any font change that the user might be doing any time during the process (including the possibility of multiple concurrent font sizes due to multiple views, or no concept of font due to a detached screen/tmux).
Without any explicit tmux integration on the terminal emulator's side, a tmux-like utility must be able to support the feature as expected by the user. In other words: Whatever visual contents the terminal emulator is able to achieve on its entire canvas, it should also be able to achieve on any smaller rectangular area of cells, leaving the rest of the contents unchanged.

A consequence of these is that displaying an image inlined at its natural size won't be possible. This is by far the smallest price to pay. Displaying at its natural size could be possible in a popped out window, or alike. In addition to defining a new escape sequence, all the existing behavior of terminals remains well defined, the basics of terminal emulation will stay intact, and tmux can add support if they wish to. In this direction I find it much more likely to have buy-in from multiple terminal emulators and applications, and we won't lose tmux users. This is the only way forward that I'm interested in.

Since you opened this issue with the title "simple image display", I guess this means that I have to open a new one for my desired feature, and leave this very conversation.

I do not consider the Sixel format obsolete (even though the format may be technically old-fashioned) because it is supported for example by gnuplot, so gnuplot can nicely display inline graphics in mintty and xterm. Furthermore, xterm supports a slightly more modern "REGIS" format, but I don't know whether any application generates it (gnuplot does not).

For parameters, I agree with Kovid that handling something like JSON is inappropriate overhead for a terminal.

Adding to the "What if" issues list, it's important to handle font zooming as Per mentioned. This is nicely solved for mintty (implemented by Hayaki Saito).

I'd appreciate additional implementation of existing formats rather than inventing new ones.

I agree, no JSON.

Surprisingly the Kitty protocol actually isn't too bad. Although its reliance on knowing the pixel size of the terminal makes it unworkable for tmux, it would be much better to work in terms of cells and I think that would lose little for most applications. I haven't really looked at SIXEL.

If scrolling is supported, tmux would need to be able to redraw only part of an image if it has been partly scrolled off screen or the terminal resizes to partly cover it. However TBH it might be enough just to remove the image if that happens.

I also started to tackle possible image support for xterm.js and want to share a few aspects/thoughts.

strict text cell grid alignment vs. grid breakage

I kinda second Egmont's perspective - emulators' "holy grail" is the text grid with no notion of fonts, font-sizes ... whatsoever. If someone wants to go that path be aware that you basically reinvent a web 0.5 browser or a reduced text processor interface with image caps. Thus I would favor to stick strictly to the text grid with tiles of the image covering the target cells. This also answers a few questions:

What if the image overwrites existing text

Text cell content gets deleted, an image tile is place there instead.

What if the image intersects with another existing image

Just like any text can overwrite previous content, the image tiles will overwrite other image tiles.

What if text is printed over the image later on

The tile in question gets deleted/replaced by the new content.

What if other escape sequences operate on the image in all sorts of weird ways (e.g. ICH/DCH to horizontally scroll the right part of one character row of the image; SU/SD with a scroll region to vertically scroll a part of the image; etc.)

Like with a block of text it would allow any manipulation with sequences that are allowed there.

What if the terminal emulator rewraps its contents on resize, what should happen to the image then

My current dumb implementation just applies this as well leading to weird "image line reflowing". This needs further investigation how to deal with that and all possible intersections with an image with text right beside it (floating text around the image? - Eww, welcome to browser 0.5 yet again). Any pointers to possible solutions/further discussions on that would be very helpful.

term multiplexer

Imho the xpixel/ypixel in TIOCGWINSZ is the way to go here, it has been there for long and finally got some usage. If the parent emulator updates this correctly, a multiplexer can always request those values and apply them accordingly to the sub ptys.

sequence

Although SIXEL gets the job done and has being around for so long, its quite limited specwise. Guess we should find a new sequence based on George's and Kovid's work? About entering APC realms - isnt OSC or DCS better suited to transmit chunkified content? Imho APC and PM should stick to private caps of a particular emulator, thus I would prefer a sequence in OSC or DCS space. I second the "no json" part as it seems to introduce way to much overhead thats not really needed. BASE64 seems appropriate as content encoding as it is widely available. Things like the temp file creation kitty supports should not be part of an official sequence (if we ever manage to get there lol) as it introduces outer scope depedencies.

Thanks for your comment, it is good that people are working on this.

For tmux - xpixel/ypixel is not quite simple, What if the same image needs to be drawn on separate terminals at the same time - what is the right xpixel/ypixel to use?

What if a pty is detached and attached again to a terminal with a different font size? The application that drew the image has exited, how is it adapted to the new xpixel/ypixel?

If the entire image size and position is expressed in cells, these can probably be worked around by either remembering the original xpixel/ypixel (or always providing the same values regardless of the terminal outside) then scaling the image. It would be nice if the terminal could do this for me :-).

Another problem is - if an application draws an image, then writes over part of it with text or an erase or delete or whatnot, how do I redisplay that once the application has exited? It would mean tracking a lot of extra state with the image. Personally I think it might be better to say that any modifications to cells under the image removes the whole thing, or at least leave that as an option.

I agree that cells are much better for measurement than pixels wherever possible. Also Base64 is a good option for encoding.

The only thing that would be a complete killer for the idea in tmux would be requiring new external dependencies, I'm not going to link against libpng or something.

Actually I was still sort of thinking of images as single entities but it seems like you are saying images would be split into cells (before or after being sent) - so each cell of the image could be handled separately. If so then the overwriting isn't an issue because an individual cell could be erased or overwritten without problems. Likewise clipping the image would just work. So that is a good way to do it.

I would still need to scale each cell if xpixel/ypixel changed but that would be workable I think.

Of course I would need to be able to redraw an image as individual cells in that case, say someone draws a 10x10 image then erases the top 9 lines, I'd need to be able to draw the bottom line alone to show it correctly if tmux was detached and reattached.

"What if the terminal emulator rewraps its contents on resize, what should happen to the image then"

In my opinion, an absolute requirement is the following: If the image starts at the left-most column and there is no text to the right of it, then the image must be preserved as a unit on window resize. The image stays at the left edge; no characters appears to the right of it; there is no breaking apart or re-shuffling of tiles; the image scaling is not changed (unless font/text scaling is also done). If the window width is smaller than the image width, it is truncated; otherwise background spacing is added on the right.

(I'm assuming left-to-right text. We can adjust this as appropriate for right-to-left or bi-directional text.)

Any specification or implementation that does not have this property is just plain broken.

"If the window width is smaller than the image width, it is truncated;"

It might be reasonable to allow the image to scaled to fit if the window is narrower than the image "natural size" but this seems contrary to the tiling model people want.

(I don't think the character-cell-tiling model is the best, just that it might be the easiest to specify and implement for most terminal emulators. Thus I'm ok with a specification based on a character-cell-tiling model, as long as basic reflow works.)

I would still need to scale each cell if xpixel/ypixel changed but that would be workable I think.

Hmm not sure if you will get away with not further image handling yourself (no libpng and such). My idea for xterm.js currently involves at least to be able to rescale an image that was already embedded to meet later changes to text cell height/width. This way an embedded image can be stable in cell coverage, though the image might change in aspect ratio itself. Imho this is a matter of favoring image metrics or text grid metrics over the other, I tend to see the text grid as source of truth here (also changes to cell metrics are very unlikely in the long run for a single session).

You might have to tile an incoming image yourself, not sure yet if we can find a way to avoid that for a multiplexer by delegating all the image tile handling to the embedding emulator and go instead with some placeholder cell content in the multiplexer (that can be re-applied later on as well). Maybe Kovid's image storage engine might come handy, rough idea:

pty sends an image to cover x * y cells
multiplexer holds image data as black box data and also forwards the data to the embedding emulator
embedding emulator stores the image data and does whatever is needed like tiling
now multiplexer can ref image content by placing some tile refs into text cells as needed
both (multiplexer + embedding emulator) hold image data until either going out of scope (dropping off the scrollbuffer) or explicitly removed/overwritten
if a multiplexer session has to be restored from scratch the image data can be reuploaded to the embedding emulator and needed tiles can be ref'ed again by the multiplexer as needed

Such a system would be dynamic enough to cover most things even with multiplexers. If we allow to rescale the tiles (btw. this should be done from he original image, tested it with lousy results with tons of artefacts) it would even keep replayed sessions in sync across different xy-pixels.

In my opinion, an absolute requirement is the following: If the image starts at the left-most column and there is no text to the right of it, then the image must be preserved as a unit on window resize.

Yes I second that, we should try to preserve an image as a whole whenever possible. Thats why I am unhappy with my current dumb implementation which always applies reflow to the tiles. Started to investigate on area sequences and wonder if we could come up with a sequence that marks some rect on the grid as non reflowable (like an extension to any reflow caps). Not sure yet, how to cover this reliable.

Still there are more open questions, to name one - I currently have no clue what mark + c&p should do with image tile content.

Ideal for me would be a protocol where the image data was given as a block of data and a size in cells, then allowed all or part of it to be drawn at any character position on screen. So if an application gave me a 20x20 cells image at 0,0 but its pty is positioned at -5,5 I could ask the terminal emulator to draw only the right 15x20 cells of the image at 0,5.

If it was also possible to give it the images xpixel,ypixel and the terminal emulator would do the Right Thing (scale or whatever) if it is different from the current then that would be great.

Personally I think all this stuff with rewrapping and resizing and whatnot isn't that important, full screen applications will need to redraw including images on resize anyway and for stuff like image cat a best effort to preserve the image would be fine.

"Started to investigate on area sequences and wonder if we could come up with a sequence that marks some rect on the grid as non reflowable (like an extension to any reflow caps)."

One could use an explicit "don't-reflow" sequence. Or one could say that if a line contains part of one or more images, then the line is non-reflowable. However, the terminal can't discard any part of the line that gets truncated, since it need to be available in case the screen is made wider.

An (optional) nicity would be to bring up a horizontal scroll bar if a non-reflowable line is wider than the window and the line is visible.

DomTerm does handle re-flow of lines with images, but it handles variable-size cells, and treats an image as a single cell. I still thing this is a useful feature, but it doesn't have to be part of the initial specification.

"Still there are more open questions, to name one - I currently have no clue what mark + c&p should do with image tile content."

For copy to text/plain, just replace each image tile as either space or nothing.

For copy to text/html we'd like the image to become a single <img src="data:MIMETYPE;base64,DATA"> element, at least in the case I mentioned above: image at the left edge, with no text or other images to its right. (The DomTerm model where an image is just an extra-large character cell does handle the general case better.)

Thanks for your valuable input (@nicm and @Per_Bothner). I think I need several more brainstorming iterations with you to grasp what is / should be doable and what might be beyond terminal image support.

I see a common pattern/wish for image usage in terminals - REPLs that support some image plotting while staying in the REPL loop. Integrating this right into emulators would give great benefit to data scientists. The typical use case here is quite simple - plots/images could be added left aligned at the bottom creating new lines if needed, no specific sequence magic is needed. Thats goal number one for me - just make this possible. As a second goal I have a strong emphasis on multiplexers - they are a great way to host multiple terminal instances in just one embedding "text grid canvas". We should not break those by any means, instead make it possible for them to use it as well.

I think so far we all agree. But lets look on possible implications:

image origin must support arbitrary row/col offset

Thats a requirement following from "not breaking multiplexer" - we cannot simply put any image data left aligned on the global text grid as the multiplexer might have split the window vertically. Current SIXEL implementations deal with that by placing the image origin at current text cursor position. Imho a must-have.
image should not overdraw "foreign regions"

I think thats what @nicm meant in the last comment with the negative pty offsets. Basically we need a method to address parts of an image to be rendered into xy cols/rows at a certain position. It also emphasizes the tiling model, as the text grid is the only legit coord system we have (still the problem of pixels to cells transition remains). Imho there are several ways to address this:
- multiplexer does all the image handling
  
  The multiplexer does all the dirty image handling on its own and just emits tile data to the embedding emulator. Note there is already a SIXEL charmap replacement sequence (dynamically redefinable character sets - DRCS) for vtXXX which could be abused for this to work (or could be build similar to this). Alternatively the multiplexer could just output the whole image and repair any foreign region afterwards (not a good solution imho, as it would require to fix/redraw the rest of the text grid canvas after the image data was placed).
- multiplexer forwards image handling to embedding emulator
  
  The multiplexer does not know anything about image type, pixel size etc. and just forwards its handling. The embedding emulator stores the image, does the tiling and returns a col X row size of the image to the multiplexer by some report functionality. The multiplexer now can freely request image tiles to be inserted at current text cursor position. Not sure yet, if we could skip the report thingy (as it is always cumbersome with emulators) by enforcing the initial image sequence to contain a certain rows X cols size instead of pixel size (this will def. not work with SIXEL as the sequence has no notion of cell coverage).
line level attributes will not work

In general line level attributes are not a good idea as soon as multiplexers are involved (see xterm and these double line height sequences). They are likely to mess up any vertically split grid canvas. If we really want to cope with reflowing issues from the beginning only chance I see is a "non reflowing rect". Not sure yet if we need yet another sequence for that or just go with reflow impl recommendations.
auto insert of tiles to the right on resize

It seems pretty logical to have this one for the embedding emulator. Will there be pitfalls for multiplexers here? (Kinda depends on the question whether the multiplexer always orchestrates ALL cells in the viewport).
c&p behavior

Yes, for text c&p filling up image tiles with whitespace would work I guess. Imho any c&p thingy beyond text only should be left to emulators (Spitting out the whole image for HTML c&p might get cumbersome for overwritten image tiles).

Last but not least some notes regarding things that are already implemented/supported by other emulators. Kitty already has a really elaborated image support. I think we can learn from it, still there are several things I want to discuss before going down the rabbit hole:

addressing different output layers

Kitty supports a z-index and alpha blending on images and text. This pulls advanced image processing into an emulator. Since we try to find a working model for most emulators/multiplexers with least changes (and hopefully high adoption rate) I think we should not deal with that for now. Furthermore my idea just deals with image data as foreground content, while kitty also can send it straight to the background. I dont see much usage for this either and would like to skip this (maybe I just dont get the point due to hen-egg problem).
explicit image storage

Kitty supports separated image uploads before actually showing something. I think any emulator going to support images will have to deal with storing them somehow, so this seems a very handy way to orchestrate it. Also it might help to overcome issues with multiplexers by giving them a way to ref into uploaded image data. Imho worth to be adopted to some degree.

Would be great if you find some time a give feedback/ideas to the things above to make sure we are on the same boat. If we manage to get some consensus here we could get into sequence business soon.

@Per_Bothner Sorry for hijacking your thread like this. If it is a problem for you just gimme a hint and I will create a new one.

... image should not overdraw "foreign regions"... Basically we need a method to address parts of an image to be rendered into xy cols/rows at a certain position

Yes - an application running inside tmux may see a pty of 20x30 at 1,1 but in fact that pty may be positioned somewhere else on the terminal or may be partly offscreen. So if an application sends eg \033[5;5Hxyz then xyz may be drawn at 5,5 or at 10,10 or only partly printed (eg only the "yz" or "y" or not printed at all). For full support, I would need the same moving and cropping ability for images.

I suspect other applications would need the same abilities - imagine emacs showing an image in a buffer, the buffer could be any size or position on the terminal and may not be able to show the entire image.

... The embedding emulator stores the image, does the tiling and returns a col X row size of the image to the multiplexer by some report functionality.

Not entirely sure what you would see me doing with this.

TBH I don't think there is a perfect solution for tmux mapping image sizes to character cells, because there is no way for tmux or the application inside tmux to reliably know the font size it will be using later.

This is why I would quite like some way for tmux to say to the terminal "fit this image into these rows and columns however you think best". That way, tmux could use some reasonable default font size (or configurable) for it (or the application inside) to work out the image size in character cells, then if that font size didn't actually match the terminal could scale the image.

However, I realise this would be adding complexity to the protocol and the terminal for a feature that only tmux would really need, so I wouldn't mind doing something else.

explicit image storage

I think splitting upload and draw is a good idea.

auto insert of tiles to the right on resize

tmux is a full screen program so it will be redrawing the entire terminal including images on resize so what the outside terminal does isn't important in that respect.

I think ideally reflow would wrap around images, but that would be complicated to implement so I would suggest allowing some easier solution at least initially (eg not reflowing, removing the image, or replacing it with spaces).

addressing different output layers

Yes I think by this point you should be writing a GUI application not a terminal application.

If we need to support general positioning and cropping of images, then I think that needs to be motivated by a use case that is not a multiplexer. The point of a traditional multiplexer is that it works on a generic terminal emulator. Once you add extensions like image support that require compatible support from both the multiplexer and emulated terminal then it makes more sense to built the multiplexer into the terminal emulator itself, IMO (like DomTerm does). That is more flexible: you can have per-tile scroll bars, nicer-looking separators, different font sizes in different tiles, and a much better UI (such as integrated menus, or using a non-conflicting command key/modifier, or mouse drag of sub-windows and separators). This is also more efficient and reduces implementation complexity. Or go the other way: Integrate with the window manager. (The Sway tiling window manager looks very promising.)

To continue using tmux/screen I think it would make more sense to define a "sub-terminal" protocol: A escape sequence to specify a rectangular region of the screen, and a sequence to select a specific region. Once a region is selected, output is restricted to that region which is treated as a smaller window. The terminal emulator should handle line-breaking, wrapping, scrolling, and positioning.

I'm sure there are "real" (non-multiplexer) applications for explicit image positioning and cropping. If so, let's support those. But I think terminal multiplexers should not have to deal with translating and cropping images - that is better done by the terminal emulator itself.

That isn't really the point, tmux and screen /work/ on any terminal, but that still means they can support features that only some terminals offer. There is a catalogue of these features already, from colour onward.

Anyway, both will need explicit support for any image protocol to support detaching and reattaching and scrollback even if you take pane management out of the equation.

And of course they might want to use images for their own UI.

tmux and screen are widely used and creating an image protocol that they can't usefully support is shortsighted and doing a disservice to that userbase.

"tmux and screen are widely used and creating an image protocol that they can't usefully support is shortsighted and doing a disservice to that userbase."

What I was trying to say was something different: complicating the image protocol for the sake of screen/tmux pane management seems unfortunate, and a more general "pane protocol" seems a better option.

Short-term supporting translation+cropping in the image protocol is probably the simplest approach for screen+tmux - but as the previous discussion has shown it's still somewhat complicated. A "pane protocol" might not be that much more work initially, and I suspect it would be cleaner and would do more to enable future screen/tmux improvements.

"both will need explicit support for any image protocol to support detaching and reattaching and scrollback even if you take pane management out of the equation."

DomTerm supports detaching and reattaching - with images.

OK I see what you mean, thanks. I would be keen to see other improvements that would make tmux work better with terminals - copying across pane borders and synchronising scrollback are obvious ones and perhaps a pane protocol is a way to go with that.

I agree I wouldn't care to overcomplicate the image protocol trying to make it work with tmux. I'm pointing out what /would/ be needed for full support by tmux :-).

I do think positioning and clipping and even scaling would be useful for any application beyond simple image cat.

The problems of multiple terminals and font size are less useful to other applications but I would be happy enough if I could bodge them in some way (perhaps I just treat images as transient and remove or ignore them unless I unambiguously know the font size).

Reflow is traditionally "unspecified". But i think images don't reflow well anyway. I don't think having the right halve of an image display below the left half is very useful for example.

(Read the following with a grain of salt because i still think that reflow is a fragile thing anyway, so this might be a over simplistic view) A reasonable baseline strategy for reflow could be to simply clip images when the terminal is resized smaller. Rows where an image tile is on the left most column would always retain at least the first image tile. Images tiles in other columns might get removed and lost. If the terminal is resized wider and the right most column contains an image tile the terminal either a) adds new image tiles for the same line if the image tile was not the right most the underlying image has or b) adds spaces / empty characters.

This would block using cropping features to be used for image atlas use cases trivially, but i think we can live with that. Also this might loose images that are not completely to the left.

This simple strategy would allow keeping the storage model (i.e. one that only needs to know all cells for the current terminal width) and also would satisfy not loosing a image that covers the the left-most columns (the usual repl case) when resizing. The whole image can be recovered by making the terminal wide again or terminals could offer a popup or scrollbar based solution for those.

As reflow is unspecified, terminals can always come up with something better.

I think support for tmux like use cases is essential. I think client side terminal libraries that support any kind of overlapping regions, or even clipping will need those too. I don't think adding a multiplexer protocol is a good solution, because that would make multiplexers a special kind of terminal client, but really other use cases have very similar features that are needed. (i.e. line number displays in editors, overlapping alerts, splits in an editor, ...)

I like that the discussion seems to converge on a tile based model. Having named image uploads seems attractive, esp. with redrawing (moving an image -- for scrolling or otherwise interactively -- would be much to expensive otherwise). But it's a whole can of worms with regard to life times. On the other hand (maybe) only full screen applications need to care about life times and those are at least theoretically able to handle a report that the referenced image is no longer available and reupload. (spec could say the last image uploaded is guaranteed to be available until next line break or something like that)

I think the image draw sequence needs to be able to apply clipping to allow refreshing tiles without having to do image processing at all layer again and again.

In abstract drawing could look like (using a syntax that does not invite to bike shed sequence encoding): draw_image(image id, image height in rows, image width in columns, number of lines to cover, number of columns to cover, starting row in image, starting column in image)

(possibly moving image height in rows, image width in columns to the upload sequence)

This would allow addressing each tile generated from an image for multiplexers and full screen applications/libraries while still keeping the terminal coordinate system based only on cells. (Applications only need to know about pixels to avoid blurry rescaling when using images close to the final rendering resolution)

I think this might even allow multiplexers to work without doing actual image decoding/processing. Terminals, full screen libraries and multiplexers would need to semantically keep for each cell: the image id, the size of the image in cells and an coordinate in cells into the scaled image.

It seems we all favor the explicit storage/cache idea, as it avoids re-sending image data over and over. As @textshell already said, this has several disadvantages we have to overcome:

life cycle of cached images

I guess we all agree that this needs some measures to free image data at some point and/or limit the total amount of cached image data. I also think that the details should be left to emulator implementations, they may have different constraints and find better solutions for their particular needs. Still it will need some specification of key aspects to be expected from application side later on (including multiplexers). Aspects that come to my mind:
- single upload might fail if the image data exceeds a certain limit
- image upload is identified by a cache resource id (slot number? freely assignable from application side?)
- cache resource can be deleted explicitly
- last cache resource is guaranteed to be available until next image upload (except it got explicitly deleted) - Other ideas here?
- older cache resources might get removed at any time
- older cache resources might be held up to a certain hard limit (like kitty's quota)
- emulator might hold on cache resources for own needs (like scrollback) unless explicitly deleted
- certain sequences shall be extended to also clear the image storage (reset should also clear images and such)
- caching strategy is not further specified (up to the emulator to use FIFO/LRU whatsoever)
access to cached images

Adanced access to image data is granted by the ressource id. A sequence has to be found that allows to crop into an image by row X col offsets and sizes. Furthermore we might want to have a "fire and forget" sequence version, that allows simplified usage for scrollbuffer REPLs (basically does the upload and drawing in one step).

Downside of the image cache storage is the need of back reporting, kinda every access would have to report whether the action was successful or not. We might get away with a single report if we guarantee, that a cache resource will only invalidate on certain actions (like delete or a new upload). This way an application (including multiplexers) can request the cache resource status and decide beforehand whether to reupload an image before doing several drawing actions. The status report should include row/col size of the image so the application can do the cropping maths.
security aspects

Lets not forget about security here. An image uploaded as root will prolly survive the next user context switch. Imho there is no further harm from this, as most tools like su/sudo and such dont clear the text history/scrollbuffer either (kinda the same "vector"). Still this should find some place in a formal specification as paranoid tools might want to proactively clear the image storage. Furthermore we should take care that sequences typically used to clear/reset parts of the terminal (like DECSTR / RIS / buffer switch + erase) do likewise for images. Imho cached raw image data shall not be re-accessible from application side by a sequence directly. Not sure if HTML / rich-text c&p would impose a threat here.

Edit: Btw, if we go with a model close to this, SIXEL could be handled pretty close to this and also utilize the image storage.

"It seems we all favor the explicit storage/cache idea,"

I'm not sure I do. Or rather: I'd like to support a simple escape sequence that includes image data inline in the escape sequence (like SIXL). The life-time should be as long as all or part of the image is visible or can be made visible through scrolling (or switching from alternatate back to main buffer).

A detached session needs to save image data received inline.

If the image is specified with a URL then I think the default behavior should be for the terminal to download the image when it has received the URL, and then keep it as above. If the terminal is resource constrained, it is reasonable to allow the terminal to free currrely-non-visible images and re-download when needed, possibly controlled by a user preference and/or an option in the escape sequence. This also applies to detached sessions.

I think this is how a REPL would "print" an image, using inline image data or a URL, and "automatic lifetime".

I'm not opposed to an option for more explicit image management, but I would prioritize a simple "dumb" model.

@Per_Bothner You basically describe what I meant with a "fire and forget" sequence. Without additional image handling capabilities multiplexer (or any advanced TUI app) would have a hard time to deal with these alone. Thats at least what I get from the discussion above.

It kinda boils down to the question whether to put the tiling work load to the emulator or to application side. Imho it does not belong to app side - all simple apps never really comprehend and work with explicit cursor sequences thus never get in touch with the text grid itself. It would be an unbearable burden for those and likely not being adopted. But moving it to the terminal side without any cropping functionality more advanced apps like vim/emacs/ncurses apps and multiplexer fall short - they have no way to deal with partial image data output needed for their "canvas drawing" without reuploading data over and over. Furthermore they would have to pretile the images on their own, thus we would force them to do image processing to get something done the emulator has to do anyway.

I think a cache model which also covers the simple "dumb" use case fits both expectations without doubling the workload. A multiplexer for example would have to store up the image data in case the session gets reattached to re-upload things once, but does not need to know anything about the image internals beside col/row size.

I dont quite get your URL idea, if you propose to embed some curl functionality into an emulator I hear several security bells ringing. Imho a nogo. Def. nothing to be discussed as a sidenote in a thread about image support.

"Without additional image handling capabilities multiplexer (or any advanced TUI app) would have a hard time to deal with these alone."

Note I have no objection to a more complex model, with explicitly managed images, translation, and cropping. However, there should also be a simple "fire and forget" model that applications can use - and that would be what a typical REPL would use to print images, and that would be the recommended programming model for simple applications.

A multiplexer may need to translate a simple "fire and forget" image and send to the underlying terminal a more complex "managed image". I don't see any difficulty with that. The multiplexer has to know what is visible in every pane (and detached session). It receives the data, and then translates/crops it in the relevant pane. When the pane no longer needs the image (because it is erase or scrolled out) then the multiplexer frees the image.

"I dont quite get your URL idea, if you propose to embed some curl functionality into an emulator I hear several security bells ringing"

Well, you want to restrict it to filenames (or file: URLs) that's OK with me. However, I have no idea what "security bells" you're thinking of. What is your threat model - one that wouldn't also apply to inline images? Remember we're talking about terminal emulators, where applications can trivially fork off a curl job.

A multiplexer may need to translate a simple "fire and forget" image and send to the underlying terminal a more complex "managed image".

Yepp, that's how it could be done. This would even still work with multiple multiplexer wraps (terminal in terminal in ...), though every multiplexer layer would have to slightly translate incoming image sequences to fit its output model (from innermost up to the outermost being the embedding emulator).

About the URL thingy: In my world a terminal emulator is still an output sink that connects only to ONE data source - the application on the pty slave side. There is no side comm at all. Thats a fundamental principle that ppl have learnt to be trustworthy. Any deviation from this by additional side comm channels needs a very good excuse/benefit, otherwise it is a failure right from the start. Thats what I mean by "... nothing to be discussed as a sidenote ...". And you already noted:

Remember we're talking about terminal emulators, where applications can trivially fork off a curl job.

Exactly. But thats controlled by the user/application side, not by the emulator. My solution to this does not include URLs at all (so no file:// protocol either), the data has to come through the one and only data source - the application side.

simple image display

Designs

Child items ...

Activity

Admin message

Admin message

simple image display

Activity