Saturday June 7, 2025

Home | Contact | Support | WebGPU Graphics and Compute ...

WebGPU/WGSL Tutorials and Articles

Graphics and Compute ...

IT Crowd TV series - famous and funny quote when Jan says not to talk about memory or ram - and Moss laughs and jokes that ram ... — IT Crowd TV series - famous and funny quote when Jan says not to talk about memory or ram - and Moss laughs and jokes that ram is memory. Jan is head of IT - but has no idea how computers work.

Memory Array - Alignment Buffers and Pains

Going to talk about memory and how it isn't always what you think it is! Especially for long sequential blocks of data!

For example the below implementation copies from

array0

to

array1

! What could be simplier? Both are just an array of floats. However, in WGSL the

array1

is stored as

array< vec3<f32> >

- which is common as you might want to group every 3 values as positions or vectors (x,y,z) - but what you might not be aware of is how it's stored in memory on the GPU!!

As the

vec3<f32>

is a structure - so all structures are aligned to 16 byte boundaries!

If the memory definitions are like this:

<?php
@group(0) @binding(0) var<storage, read>       array0 : array< vec3<f32> >; 
@group(0) @binding(1) var<storage, read_write> array1 : array< f32 >;

@group(0) @binding(0) var<storage, read> array0 : array< vec3<f32> >;
@group(0) @binding(1) var<storage, read_write> array1 : array< f32 >;

And the WGSL code to copy

array0

to

array1

is:

<?php
	// vec1
	array1[ global_id.x+0 ] = array0[ global_id.x ].x;
    array1[ global_id.x+1 ] = array0[ global_id.x ].y;
    array1[ global_id.x+2 ] = array0[ global_id.x ].z;
    
    // vec2
    array1[ global_id.x+3 ] = array0[ global_id.x+1 ].x;
    array1[ global_id.x+4 ] = array0[ global_id.x+1 ].y;
    array1[ global_id.x+5 ] = array0[ global_id.x+1 ].z;

    // vec1
    array1[ global_id.x+0 ] = array0[ global_id.x ].x;
    array1[ global_id.x+1 ] = array0[ global_id.x ].y;
    array1[ global_id.x+2 ] = array0[ global_id.x ].z;

    // vec2
    array1[ global_id.x+3 ] = array0[ global_id.x+1 ].x;
    array1[ global_id.x+4 ] = array0[ global_id.x+1 ].y;
    array1[ global_id.x+5 ] = array0[ global_id.x+1 ].z;

We get this:

<?php
log:["res:",[1,2,3,5,6,0]]

log:["res:",[1,2,3,5,6,0]]

Instead of this:

<?php
log:["res:",[1,2,3,4,5,6]]

log:["res:",[1,2,3,4,5,6]]

All because of the internal alignment/padding - which can be a real pain if you're not careful.

Below gives the complete implementation - with a link to the webgpulab at the bottom so you can test out the code.

How to Fix?

You can pad the data - by adding an extra 0 every 4 floating point values in the array. Then use two

vec3

structures in the WGSL code.

However, a more compact and simplier way is just to make sure the structure in WGSL are both

f32

- that way the input/output is the same - inside the WGSL code, just convert the array values to a

vec3

structure manually.

<?php
// Convert the flat array of f32 into vec3s
let v0 = vec3<f32>(array0[ global_id.x + 0 ],
                   array0[ global_id.x + 1 ],
                   array0[ global_id.x + 2 ] );
 
let v1 = vec3<f32>(array0[ global_id.x + 3 ],
                   array0[ global_id.x + 4 ],
                   array0[ global_id.x + 5 ] );
                   
// Then use them as normal
// vec1
array1[ global_id.x+0 ] = v0.x;
array1[ global_id.x+1 ] = v0.y;
array1[ global_id.x+2 ] = v0.z;

// vec2
array1[ global_id.x+3 ] = v1.x;
array1[ global_id.x+4 ] = v1.y;
array1[ global_id.x+5 ] = v1.z;

// Convert the flat array of f32 into vec3s
let v0 = vec3<f32>(array0[ global_id.x + 0 ],
                   array0[ global_id.x + 1 ],
                   array0[ global_id.x + 2 ] );

let v1 = vec3<f32>(array0[ global_id.x + 3 ],
                   array0[ global_id.x + 4 ],
                   array0[ global_id.x + 5 ] );

// Then use them as normal
// vec1
array1[ global_id.x+0 ] = v0.x;
array1[ global_id.x+1 ] = v0.y;
array1[ global_id.x+2 ] = v0.z;

// vec2
array1[ global_id.x+3 ] = v1.x;
array1[ global_id.x+4 ] = v1.y;
array1[ global_id.x+5 ] = v1.z;

The above gives the following output:

<?php
log:["res:",[1,2,3,4,5,6]

log:["res:",[1,2,3,4,5,6]

Implementation

The full code that can be run from a `index.js` file (including all the initialization and copying to/from the GPU and the compute shader).

<?php
const adapter = await navigator.gpu.requestAdapter();
const device  = await adapter.requestDevice();

const array0 = new Float32Array( [1, 2, 3, 4, 5, 6] );
const array1 = new Float32Array(  array0.byteLength/4    );


var array0Buffer = device.createBuffer({ size: array0.byteLength, usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST | GPUBufferUsage.COPY_SRC } );
var array1Buffer = device.createBuffer({ size: array1.byteLength, usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST | GPUBufferUsage.COPY_SRC } );

device.queue.writeBuffer(array0Buffer, 0, array0);

const shaderCode = `

@group(0) @binding(0) var<storage, read>       array0 : array< vec3<f32> >; 
@group(0) @binding(1) var<storage, read_write> array1 : array< f32 >;  

@compute @workgroup_size(1)
fn main(@builtin(global_invocation_id) global_id : vec3<u32>) {

	// vec1
	array1[ global_id.x+0 ] = array0[ global_id.x ].x;
    array1[ global_id.x+1 ] = array0[ global_id.x ].y;
    array1[ global_id.x+2 ] = array0[ global_id.x ].z;
    
    // vec2
    array1[ global_id.x+3 ] = array0[ global_id.x+1 ].x;
    array1[ global_id.x+4 ] = array0[ global_id.x+1 ].y;
    array1[ global_id.x+5 ] = array0[ global_id.x+1 ].z;
}
`;

const pipeline = device.createComputePipeline({
    layout: 'auto',
    compute: {
        module: device.createShaderModule({
            code: shaderCode
        }),
        entryPoint: 'main'
    }
});


const bindGroup = device.createBindGroup({
    layout: pipeline.getBindGroupLayout(0),
    entries: [
        { binding: 0, resource: { buffer: array0Buffer } },   
        { binding: 1, resource: { buffer: array1Buffer } },  
    ]
});

const commandEncoder = device.createCommandEncoder();
const passEncoder = commandEncoder.beginComputePass();

passEncoder.setPipeline(pipeline);
passEncoder.setBindGroup(0, bindGroup);
passEncoder.dispatchWorkgroups( 1 , 1, 1);
await passEncoder.end();
device.queue.submit([commandEncoder.finish()]);

await device.queue.onSubmittedWorkDone();

// -------------------------------

// All the compute is done - just a matter of copying the data back from the array for analysis.
// -------------------------------

// Write a small helper function
async function getGPUBuffer( buf, siz, msg )
{
// Note this buffer is not linked to the 'STORAGE' compute (used to bring the data back to the CPU)
const gbufferTmp = device.createBuffer({ size:  siz, usage: GPUBufferUsage.COPY_DST | GPUBufferUsage.MAP_READ});

const commandEncoder = device.createCommandEncoder();
// Encode commands for copying buffer to buffer.
commandEncoder.copyBufferToBuffer(
    buf,           // source buffer
    0,                  // source offset
    gbufferTmp,           // destination buffer
    0,                  // destination offset
    siz  // size
);

// Submit GPU commands.
const gpuCommands = commandEncoder.finish();
await device.queue.submit([gpuCommands]);

// Read buffer.
await gbufferTmp.mapAsync(GPUMapMode.READ);
const arrayBuffer = gbufferTmp.getMappedRange();
const arr = Array.from( new Float32Array(arrayBuffer) );
gbufferTmp.unmap();
//log(msg + 'array contents:', arr);
return arr;
}

// Copy array1 back and print it to the output
let res = await getGPUBuffer( array1Buffer, array1.byteLength, 'array1 ' );

console.log('res:', res );

const adapter = await navigator.gpu.requestAdapter();
const device  = await adapter.requestDevice();

const array0 = new Float32Array( [1, 2, 3, 4, 5, 6] );
const array1 = new Float32Array(  array0.byteLength/4    );

var array0Buffer = device.createBuffer({ size: array0.byteLength, usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST | GPUBufferUsage.COPY_SRC } );
var array1Buffer = device.createBuffer({ size: array1.byteLength, usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST | GPUBufferUsage.COPY_SRC } );

device.queue.writeBuffer(array0Buffer, 0, array0);

const shaderCode = `

@group(0) @binding(0) var<storage, read>       array0 : array< vec3<f32> >;
@group(0) @binding(1) var<storage, read_write> array1 : array< f32 >;

@compute @workgroup_size(1)
fn main(@builtin(global_invocation_id) global_id : vec3<u32>) {

    // vec1
    array1[ global_id.x+0 ] = array0[ global_id.x ].x;
    array1[ global_id.x+1 ] = array0[ global_id.x ].y;
    array1[ global_id.x+2 ] = array0[ global_id.x ].z;

    // vec2
    array1[ global_id.x+3 ] = array0[ global_id.x+1 ].x;
    array1[ global_id.x+4 ] = array0[ global_id.x+1 ].y;
    array1[ global_id.x+5 ] = array0[ global_id.x+1 ].z;
}
`;

const pipeline = device.createComputePipeline({
    layout: 'auto',
    compute: {
        module: device.createShaderModule({
            code: shaderCode
        }),
        entryPoint: 'main'
    }
});

const bindGroup = device.createBindGroup({
    layout: pipeline.getBindGroupLayout(0),
    entries: [
        { binding: 0, resource: { buffer: array0Buffer } },
        { binding: 1, resource: { buffer: array1Buffer } },
    ]
});

const commandEncoder = device.createCommandEncoder();
const passEncoder = commandEncoder.beginComputePass();

passEncoder.setPipeline(pipeline);
passEncoder.setBindGroup(0, bindGroup);
passEncoder.dispatchWorkgroups( 1 , 1, 1);
await passEncoder.end();
device.queue.submit([commandEncoder.finish()]);

await device.queue.onSubmittedWorkDone();

// -------------------------------

// All the compute is done - just a matter of copying the data back from the array for analysis.
// -------------------------------

// Write a small helper function
async function getGPUBuffer( buf, siz, msg )
{
// Note this buffer is not linked to the 'STORAGE' compute (used to bring the data back to the CPU)
const gbufferTmp = device.createBuffer({ size:  siz, usage: GPUBufferUsage.COPY_DST | GPUBufferUsage.MAP_READ});

const commandEncoder = device.createCommandEncoder();
// Encode commands for copying buffer to buffer.
commandEncoder.copyBufferToBuffer(
    buf,           // source buffer
    0,                  // source offset
    gbufferTmp,           // destination buffer
    0,                  // destination offset
    siz  // size
);

// Submit GPU commands.
const gpuCommands = commandEncoder.finish();
await device.queue.submit([gpuCommands]);

// Read buffer.
await gbufferTmp.mapAsync(GPUMapMode.READ);
const arrayBuffer = gbufferTmp.getMappedRange();
const arr = Array.from( new Float32Array(arrayBuffer) );
gbufferTmp.unmap();
//log(msg + 'array contents:', arr);
return arr;
}

// Copy array1 back and print it to the output
let res = await getGPUBuffer( array1Buffer, array1.byteLength, 'array1 ' );

console.log('res:', res );

Alignment with `align` Expression in Structures

There is the

align

expresssion - which you can add into structures to force alignment. However, be aware, the alignment but be a power of 2.

We can do a simple example - create a structure and have an input data - copy 2 array items - the first array item is fine (

data1

), it's only when the second array item is copied you notice the problem.

We use an input array with a sequence of numbers from 1 to 12, and the output is given below:

<?php
log:["res:",[1,2,3,5,6,7,9,10,11,0,0,0]]

log:["res:",[1,2,3,5,6,7,9,10,11,0,0,0]]

As you can see - the array isn't a sequential count, 4 is missing, 8 is missing etc. As the data is aligned - so our sequential data in memory doens't align with the WGSL structures.

<?php
struct AlignedData {
    // This field is aligned to 8 bytes
    @align(8)
    data1: vec3<f32>,

    // This field is aligned to 8 bytes
    @align(8)
    data2: vec2<f32>,

    // This field has the default alignment
    data3: f32,
};

@group(0) @binding(0) var<storage, read>       array0 : array< AlignedData >; 
@group(0) @binding(1) var<storage, read_write> array1 : array< f32 >;  

@compute @workgroup_size(1)
fn main(@builtin(global_invocation_id) global_id : vec3<u32>) {

var c = global_id.x;

// copy 2 array items
for (var i=0; i<2; i++)
{
    array1[ c ] = array0[i].data1.x;  c++;
    array1[ c ] = array0[i].data1.y;  c++;
    array1[ c ] = array0[i].data1.z;  c++;
    
    array1[ c ] = array0[i].data2.x;  c++;
    array1[ c ] = array0[i].data2.y;  c++;
    
    array1[ c ] = array0[i].data3;  c++;
}

}// end main

struct AlignedData {
    // This field is aligned to 8 bytes
    @align(8)
    data1: vec3<f32>,

    // This field is aligned to 8 bytes
    @align(8)
    data2: vec2<f32>,

    // This field has the default alignment
    data3: f32,
};

@group(0) @binding(0) var<storage, read>       array0 : array< AlignedData >;
@group(0) @binding(1) var<storage, read_write> array1 : array< f32 >;

@compute @workgroup_size(1)
fn main(@builtin(global_invocation_id) global_id : vec3<u32>) {

var c = global_id.x;

// copy 2 array items
for (var i=0; i<2; i++)
{
    array1[ c ] = array0[i].data1.x;  c++;
    array1[ c ] = array0[i].data1.y;  c++;
    array1[ c ] = array0[i].data1.z;  c++;

    array1[ c ] = array0[i].data2.x;  c++;
    array1[ c ] = array0[i].data2.y;  c++;

    array1[ c ] = array0[i].data3;  c++;
}

}// end main

Larger offsets

For example if we increase the size of the array test data 1..32 - then make the alignments larger, you'll get:

<?php
struct AlignedData {
    // This field is aligned to x bytes
    @align(16)
    data1: vec3<f32>,

    // This field is aligned to x bytes
    @align(32)
    data2: vec2<f32>,

    // This field has the default alignment
    data3: f32,
};

struct AlignedData {
    // This field is aligned to x bytes
    @align(16)
    data1: vec3<f32>,

    // This field is aligned to x bytes
    @align(32)
    data2: vec2<f32>,

    // This field has the default alignment
    data3: f32,
};

Output

<?php
log:["res:",[0,1,2,8,9,10,16,17,18,24,25,26,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]]

log:["res:",[0,1,2,8,9,10,16,17,18,24,25,26,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]]

You can't just force alignment to 1 byte boundaries - as the offset for each field variable needs to be a power of 2! Hence, you'll always get gaps - but you have a bit more control over the gaps with the align expression.

Resources & Links

• Memory Arrays and Vec3 (WebGPU Lab Live Demo)

Advert (Support Website)

Visitor:

Copyright (c) 2002-2025 xbdev.net - All rights reserved.
Designated articles, tutorials and software are the property of their respective owners.