...
 
Commits (15)
Framebuffer memory notes
=========================
The framebuffer is BGRA8888 format. It likely uses 16x16 tiles (TODO:
verify). There is a stride of zeroes every 1856 bytes for unknown
reasons.
---
Zeroes at:
1345548
1347404
1349260
1351116
Deltas of 1856 between zero regions; groups of 1804 valid pixels in
between
2048 = 1856 + 4 * 48?
Coarse job descriptor memory map
================================
E000 (refreshed -- 0x340)
E040 (descriptor)
E060 (VERTEX)
E120 (referenced in VERTEX)
E140 (referenced in TILER)
E170 (referenced in VERTEX + TILER)
E180 (descriptor)
E1A0 (TILER)
E280 (refreshed -- 0x80)
E300 (descriptor set)
E320 (SET_VALUE)
E340 (refreshed -- 0x80)
E380 (descriptor)
E3A0 (FRAGMENT)
E3C0 (soft job chain, refreshed -- 0x28)
Conclusions:
FRAGMENT <= 32 bytes
SET_VALUE <= 32 bytes
VERTEX <= 192 bytes
TILER <= 224 bytes
This is a job-based architecture. All interesting behaviour (shaders,
rendering) is a result of jobs. Each job is sent from the driver across
the shim to the GPU. The job is encoded as a special data structure in
GPU memory.
There are two families of jobs, hardware jobs and software jobs.
Hardware jobs interact with the GPU directly. Software jobs are used to
manipulate the driver. Software jobs set BASE_JD_REQ_SOFT_JOB in the
.core_req field of the atom.
Hardware jobs contain the jc pointer into GPU memory. This points to the
job descriptor. All hardware jobs begin with the job descriptor header
which is found in the shim headers. The header contains a field,
job_type, which must be set according to the job type:
Byte | Job type
----- | ---------
0 | Not started
1 | Null
2 | Set value
3 | Cache flush
4 | Compute
5 | Vertex
6 | (none)
7 | Tiler
8 | Fused
9 | Fragment
This header contains a pointer to the next job, forming sequences of
hardware jobs.
The header also denotes the type of job (vertex, fragment, or tiler).
After the header there is simple type specific information.
Set value jobs follow:
struct tentative_set_value {
uint64_t write_iut; /* Maybe vertices or shader? */
uint64_t unknown1; /* Trace has it set at 3 */
}
Fragment jobs follow:
struct tentative_fragment {
tile_coord_t min_tile_coord;
tile_coord_t max_tile_coord;
uint64_t fragment_fbd;
};
tile_coord_t is an ordered pair for specifying a tile. It is encoded as
a uint32_t where bits 0-11 represent the X coordinate and bits 16-27
represent the Y coordinate.
Tiles are 16x16 pixels. This can be concluded from max_tile_coord in
known render sizes.
Fragment jobs contain an external resource, the framebuffer (in shared
memory / UMM). The framebuffer is in BGRA8888 format.
Vertex and tiler jobs follow the same structure (pointers are 32-bit to
GPU memory):
struct tentative_vertex_tiler {
uint32_t block1[11];
uint32_t addresses1[4];
tentative_shader *shaderMeta;
attribute_buffer *vb[];
attribute_meta_t *attribute_meta[];
uint32_t addresses2[5];
tentative_fbd *fbd;
uint32_t addresses3[1];
uint32_t block2[36];
}
In tiler jobs, block1[8] encodes the drawing mode used:
Byte | Mode
----- | -----
0x01 | GL_POINTS
0x02 | GL_LINES
0x08 | GL_TRIANGLES
0x0A | GL_TRIANGLE_STRIP
0x0C | GL_TRIANGLE_FAN
The shader metadata follows a (very indirect) structure:
struct tentative_shader {
uint64_t shader; /* ORed with 16 bytes of flags */
uint32_t block[unknown];
}
Shader points directly to the compiled instruction stream. For vertex
jobs, this is the vertex shader. For tiler jobs, this is the fragment
shader.
Shaders are 128-bit aligned. The lower 128-bits of the shader metadata
pointer contains flags. Bit 0 is set for all shaders. Bit 2 is set for
vertex shaders. Bit 3 is set for fragment shaders.
The attribute buffers encode each attribute (like vertex data) specified
in the shader.
struct attribute_buffer {
float *elements;
size_t element_size; /* sizeof(*elements) * component_count */
size_t total_size; /* element_size * num_vertices */
}
The attribute buffers themselves have metadata in attribute_meta_t,
which is a uint64_t internally. The lowest byte of attribute_meta_t is
the corresponding attribute number. The rest appears to be flags
(0x2DEA2200). After the final attribute, the metadata will be zero,
acting as a null terminator for the attribute list.
Comparing a simple triangle sample with a texture mapped sample, the
following differences appear between the fragment jobs:
- Different shaders (obviously)
- Texture mapped attributes aren't decoding at all (vec5?)
- Null tripped!
null1: B291E720
null2: B291E700
(null 3 is still NULL)
Notice that these are both SAME_VA addresses 32 bytes apart.
null1 contains texture metadata. In this case, the 0x20 buffer is:
23 00 00 00 00 00 01 00 88 e8 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
null2 contains a(nother) list of addresses, like how attributes are
encoded. In this case, the buffer is:
c0 81 02 02 01 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
It appears null1 is a zero-terminated array of metadata and null2 is the
corresponding zero-terminated array of textures themselves.
- Different uniforms (understandable)
- Shader metadata is different:
No texture map: 00 00 00 00 00 00 08 00 02 06 22 00
^ mask: 01 00 01 00 00 00 0F 00 00 08 20 00
Texture mapped: 01 00 01 00 00 00 07 00 02 0e 02 00
- Addr[8] is a little different, but not necessarily related
- Addr[9] is one byte different (ditto)
This diff is collapsed.
/*
* © Copyright 2017-2018 The Panfrost Community
*
* This program is free software and is provided to you under the terms of the
* GNU General Public License version 2 as published by the Free Software
* Foundation, and any use by you of this program is subject to the terms
* of such GNU license.
*
* A copy of the licence is included with the program, and can also be obtained
* from Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor,
* Boston, MA 02110-1301, USA.
*
*/
#ifndef __MALI_PROPS_H__
#define __MALI_PROPS_H__
#include "mali-ioctl.h"
#define MALI_GPU_NUM_TEXTURE_FEATURES_REGISTERS 3
#define MALI_GPU_MAX_JOB_SLOTS 16
#define MALI_MAX_COHERENT_GROUPS 16
/* Capabilities of a job slot as reported by JS_FEATURES registers */
#define JS_FEATURE_NULL_JOB (1u << 1)
#define JS_FEATURE_SET_VALUE_JOB (1u << 2)
#define JS_FEATURE_CACHE_FLUSH_JOB (1u << 3)
#define JS_FEATURE_COMPUTE_JOB (1u << 4)
#define JS_FEATURE_VERTEX_JOB (1u << 5)
#define JS_FEATURE_GEOMETRY_JOB (1u << 6)
#define JS_FEATURE_TILER_JOB (1u << 7)
#define JS_FEATURE_FUSED_JOB (1u << 8)
#define JS_FEATURE_FRAGMENT_JOB (1u << 9)
struct mali_gpu_core_props {
/**
* Product specific value.
*/
u32 product_id;
/**
* Status of the GPU release.
* No defined values, but starts at 0 and increases by one for each
* release status (alpha, beta, EAC, etc.).
* 4 bit values (0-15).
*/
u16 version_status;
/**
* Minor release number of the GPU. "P" part of an "RnPn" release
* number.
* 8 bit values (0-255).
*/
u16 minor_revision;
/**
* Major release number of the GPU. "R" part of an "RnPn" release
* number.
* 4 bit values (0-15).
*/
u16 major_revision;
u16 :16;
/**
* @usecase GPU clock speed is not specified in the Midgard
* Architecture, but is <b>necessary for OpenCL's clGetDeviceInfo()
* function</b>.
*/
u32 gpu_speed_mhz;
/**
* @usecase GPU clock max/min speed is required for computing
* best/worst case in tasks as job scheduling ant irq_throttling. (It
* is not specified in the Midgard Architecture).
*/
u32 gpu_freq_khz_max;
u32 gpu_freq_khz_min;
/**
* Size of the shader program counter, in bits.
*/
u32 log2_program_counter_size;
/**
* TEXTURE_FEATURES_x registers, as exposed by the GPU. This is a
* bitpattern where a set bit indicates that the format is supported.
*
* Before using a texture format, it is recommended that the
* corresponding bit be checked.
*/
u32 texture_features[MALI_GPU_NUM_TEXTURE_FEATURES_REGISTERS];
/**
* Theoretical maximum memory available to the GPU. It is unlikely
* that a client will be able to allocate all of this memory for their
* own purposes, but this at least provides an upper bound on the
* memory available to the GPU.
*
* This is required for OpenCL's clGetDeviceInfo() call when
* CL_DEVICE_GLOBAL_MEM_SIZE is requested, for OpenCL GPU devices. The
* client will not be expecting to allocate anywhere near this value.
*/
u64 gpu_available_memory_size;
};
struct mali_gpu_l2_cache_props {
u8 log2_line_size;
u8 log2_cache_size;
u8 num_l2_slices; /* Number of L2C slices. 1 or higher */
u64 :40;
};
struct mali_gpu_tiler_props {
u32 bin_size_bytes; /* Max is 4*2^15 */
u32 max_active_levels; /* Max is 2^15 */
};
struct mali_gpu_thread_props {
u32 max_threads; /* Max. number of threads per core */
u32 max_workgroup_size; /* Max. number of threads per workgroup */
u32 max_barrier_size; /* Max. number of threads that can
synchronize on a simple barrier */
u16 max_registers; /* Total size [1..65535] of the register
file available per core. */
u8 max_task_queue; /* Max. tasks [1..255] which may be sent
to a core before it becomes blocked. */
u8 max_thread_group_split; /* Max. allowed value [1..15] of the
Thread Group Split field. */
enum {
MALI_GPU_IMPLEMENTATION_UNKNOWN = 0,
MALI_GPU_IMPLEMENTATION_SILICON = 1,
MALI_GPU_IMPLEMENTATION_FPGA = 2,
MALI_GPU_IMPLEMENTATION_SW = 3,
} impl_tech :8;
u64 :56;
};
/**
* @brief descriptor for a coherent group
*
* \c core_mask exposes all cores in that coherent group, and \c num_cores
* provides a cached population-count for that mask.
*
* @note Whilst all cores are exposed in the mask, not all may be available to
* the application, depending on the Kernel Power policy.
*
* @note if u64s must be 8-byte aligned, then this structure has 32-bits of
* wastage.
*/
struct mali_ioctl_gpu_coherent_group {
u64 core_mask; /**< Core restriction mask required for the
group */
u16 num_cores; /**< Number of cores in the group */
u64 :48;
};
/**
* @brief Coherency group information
*
* Note that the sizes of the members could be reduced. However, the \c group
* member might be 8-byte aligned to ensure the u64 core_mask is 8-byte
* aligned, thus leading to wastage if the other members sizes were reduced.
*
* The groups are sorted by core mask. The core masks are non-repeating and do
* not intersect.
*/
struct mali_gpu_coherent_group_info {
u32 num_groups;
/**
* Number of core groups (coherent or not) in the GPU. Equivalent to
* the number of L2 Caches.
*
* The GPU Counter dumping writes 2048 bytes per core group,
* regardless of whether the core groups are coherent or not. Hence
* this member is needed to calculate how much memory is required for
* dumping.
*
* @note Do not use it to work out how many valid elements are in the
* group[] member. Use num_groups instead.
*/
u32 num_core_groups;
/**
* Coherency features of the memory, accessed by @ref gpu_mem_features
* methods
*/
u32 coherency;
u32 :32;
/**
* Descriptors of coherent groups
*/
struct mali_ioctl_gpu_coherent_group group[MALI_MAX_COHERENT_GROUPS];
};
/**
* A complete description of the GPU's Hardware Configuration Discovery
* registers.
*
* The information is presented inefficiently for access. For frequent access,
* the values should be better expressed in an unpacked form in the
* base_gpu_props structure.
*
* @usecase The raw properties in @ref gpu_raw_gpu_props are necessary to
* allow a user of the Mali Tools (e.g. PAT) to determine "Why is this device
* behaving differently?". In this case, all information about the
* configuration is potentially useful, but it <b>does not need to be processed
* by the driver</b>. Instead, the raw registers can be processed by the Mali
* Tools software on the host PC.
*
*/
struct mali_gpu_raw_props {
u64 shader_present;
u64 tiler_present;
u64 l2_present;
u64 stack_present;
u32 l2_features;
u32 suspend_size; /* API 8.2+ */
u32 mem_features;
u32 mmu_features;
u32 as_present;
u32 js_present;
u32 js_features[MALI_GPU_MAX_JOB_SLOTS];
u32 tiler_features;
u32 texture_features[3];
u32 gpu_id;
u32 thread_max_threads;
u32 thread_max_workgroup_size;
u32 thread_max_barrier_size;
u32 thread_features;
/*
* Note: This is the _selected_ coherency mode rather than the
* available modes as exposed in the coherency_features register.
*/
u32 coherency_mode;
};
struct mali_ioctl_gpu_props_reg_dump {
union mali_ioctl_header header;
struct mali_gpu_core_props core;
struct mali_gpu_l2_cache_props l2;
u64 :64;
struct mali_gpu_tiler_props tiler;
struct mali_gpu_thread_props thread;
struct mali_gpu_raw_props raw;
/** This must be last member of the structure */
struct mali_gpu_coherent_group_info coherency_info;
} __attribute__((packed));
ASSERT_SIZEOF_TYPE(struct mali_ioctl_gpu_props_reg_dump, 536, 536);
#define MALI_IOCTL_GPU_PROPS_REG_DUMP (_IOWR(0x82, 14, struct mali_ioctl_gpu_props_reg_dump))
#endif
......@@ -10,6 +10,7 @@ conf_data.set(
'IS_MMAP64_SEPERATE_SYMBOL',
cc.links(
'''
#undef _FILE_OFFSET_BITS
#include <sys/mman.h>
void *mmap64(void *addr, size_t length, int prot, int flags, int fd,
......@@ -25,6 +26,7 @@ conf_data.set(
'IS_OPEN64_SEPERATE_SYMBOL',
cc.links(
'''
#undef _FILE_OFFSET_BITS
#include <sys/stat.h>
#include <fcntl.h>
......
This diff is collapsed.
......@@ -341,13 +341,8 @@ PANLOADER_CONSTRUCTOR {
panwrap_log("void main(void) {\n");
panwrap_indent++;
panwrap_log("int fd = pandev_raw_open(), rc;\n");
panwrap_log("int fd = pandev_open(), rc;\n");
panwrap_log("\n");
panwrap_log("if (fd < 0) {\n");
panwrap_indent++;
panwrap_log("printf(\"Error opening kernel\\n\");\n");
panwrap_indent--;
panwrap_log("}\n");
panwrap_log("\n");
}
......