Vertex Buffers - Allegro Primitives mach_msg_trap

no-reply@allegro.cc (rhuanjl) — Sat, 16 Sep 2017 14:54:40 +0000

(sorry this will be slightly long...)

Brief background
I'm trying to help optimise an allegro programme that uses the primitives add on for drawing, all desired outputs are displayed the problem is performance not functionality.

In running profiling key slow downs were found in a function that creates a vertex buffer, it would do so by calling al_create_vertex_buffer (providing no initial data just specifying the number of vertices) then al_lock_vertex_buffer, then writing the vertex data to the buffer and then al_unlock_vertex_buffer.

The call to al_unlock_vertex_buffer had a cost 8000 times that of anything else according to my profiler (Instruments on macOS); it was approximately 80% of our total execution time; unfolding the stack trace showed that the opengl calls underneath this always ended up at mach_msg_trap which was where all the time was being spent.

Initial fix attempted
I tried rewriting the function to write the vertex data to the stack as a row of ALLEGRO_VERTEX structs and then calling al_create_vertex_buffer and providing that row and therefore not needing the lock and unlock.

This took away all the delay from al_create_vertex_buffer BUT added that same delay to the first time the buffer was used for a drawing operation (only the first time though, the speed was fine for subsequent draws) - this time the profiler showed that the first draw operation was ending up with mach_msg_trap and not the buffer creation.

Next idea
I'll be honest I know almost nothing about using opengl, I tried reading a basic tutorial and it mentioned using glEnableClientState(GL_VERTEX_ARRAY) after creating and before using a buffer, as an experiment I tried adding this into prim_opengl.c on line 670, the slow down vanished almost entirely and mach_msg_trap was no longer appearing in the profile. However the length of each draw call was about 10% longer than it had been without this command there (still meant total execution time was less than half what it had been before but it didn't look right that the draws were taking longer)

A bit more reading tells me that glEnableClientState is deprecated and is meant to have been replaced by glEnableVertexAttribArray which is called before the drawing operation by setup_state within prim_opengl.c.

Current thoughts
It seems to me that for whatever reason the first time glEnableVertexAttribArray is called for a given array it doesn't seem to work properly in this context - some printfs showed me that the correct/expected path was being taken through setup_state.

Questions
1. Has anyone seen an issue like this before? (google found me nothing)
2. Any ideas why this may be happening?

(If relevant I'm doing this testing on a macbook pro with an Intel Iris Pro graphics card)

Code
Original version of vbo_upload function

#SelectExpand
  1bool
  2vbo_upload(vbo_t* it)
  3{
ALLEGRO_VERTEX_BUFFER* buffer;
ALLEGRO_VERTEX*        entries;
vertex_t*              vertex;
  7
iter_t iter;
  9
if (it->buffer != NULL) {
  al_destroy_vertex_buffer(it->buffer);
  it->buffer = NULL;
}
 14
// create the vertex buffer object
if (!(buffer = al_create_vertex_buffer(NULL, NULL, vector_len(it->vertices), ALLEGRO_PRIM_BUFFER_STATIC)))
  return false;
 18
// upload indices to the GPU
if (!(entries = al_lock_vertex_buffer(buffer, 0, vector_len(it->vertices), ALLEGRO_LOCK_WRITEONLY))) {
  al_destroy_vertex_buffer(buffer);
  return false;
}
iter = vector_enum(it->vertices);
while (iter_next(&iter)) {
  vertex = iter.ptr;
  entries[iter.index].x = vertex->x;
  entries[iter.index].y = vertex->y;
  entries[iter.index].z = vertex->z;
  entries[iter.index].u = vertex->u;
  entries[iter.index].v = vertex->v;
  entries[iter.index].color = nativecolor(vertex->color);
}
al_unlock_vertex_buffer(buffer); //<-all delay was here
 35
it->buffer = buffer;
return true;
 38}

Re-written vbo_upload (defers delay to first draw):

#SelectExpand
  1bool
  2vbo_upload(vbo_t* it)
  3{
ALLEGRO_VERTEX_BUFFER* buffer;
vertex_t*              vertex;
  6
iter_t iter;

ALLEGRO_VERTEX vertices[vector_len(it->vertices)];
 10
iter = vector_enum(it->vertices);
while (iter_next(&iter)) {
  vertex = iter.ptr;
  vertices[iter.index].x = vertex->x;
  vertices[iter.index].y = vertex->y;
  vertices[iter.index].z = vertex->z;
  vertices[iter.index].u = vertex->u;
  vertices[iter.index].v = vertex->v;
  vertices[iter.index].color = nativecolor(vertex->color);
}

if (it->buffer != NULL) {
  al_destroy_vertex_buffer(it->buffer);
  it->buffer = NULL;
}
 26
// create the vertex buffer object
if (!(buffer = al_create_vertex_buffer(NULL, vertices, vector_len(it->vertices), ALLEGRO_PRIM_BUFFER_STATIC)))
  return false;
 30
it->buffer = buffer;
return true;
 33}

The draw is done using:
al_draw_vertex_buffer(vbo_buffer(shape->vbo), bitmap, 0, num_vertices, draw_mode);

num_vertices will be the number of vertices used when creating the buffer, bitmap will be a separately specified image to texture the shape with and vbo_buffer simply returns the relevant buffer.

The buffer is not edited by anything else.

no-reply@allegro.cc (beoran) — Sat, 16 Sep 2017 21:45:48 +0000

ALLEGRO_PRIM_BUFFER_STATIC doesn't seem right to me, ALLEGRO_PRIM_BUFFER_STREAM or the other flags might work better.

no-reply@allegro.cc (rhuanjl) — Sat, 16 Sep 2017 22:45:50 +0000

beoran said:

ALLEGRO_PRIM_BUFFER_STATIC doesn't seem right to me, ALLEGRO_PRIM_BUFFER_STREAM or the other flags might work better.

Thanks for the suggestion I've just tried it unfortunately changing flags did not seem to produce any gain.

I note that the intention is only to write to any given buffer once in the function vbo_upload shown below but then to draw it many times; hence the initial choice of ALLEGRO_PRIM_BUFFER_STATIC.

no-reply@allegro.cc (beoran) — Sun, 17 Sep 2017 11:06:50 +0000

Hmmm, it seems like this could be an Allegro performance bug on osx. Perhaps the opengl solution you worked out could be helpful in fixing this.

However, the mach_msg_trap seems to be a bit of a red herring:
https://stackoverflow.com/questions/1488601/how-to-find-out-what-mach-msg-trap-waits-for
https://stackoverflow.com/questions/7945016/how-to-optimize-mach-msg-trap

no-reply@allegro.cc (rhuanjl) — Sun, 17 Sep 2017 13:08:20 +0000

beoran: thanks for looking at this for me.

Some further googling and reading around suggests that I'm not the only person who's had problems with openGL code that uses glEnableVertexAttribArray when used on macOS - and it does sound like it's probably a macOS specific issue.

But what I can't see anywhere is a solution. Thankfully the code runs and as it's only one delay per VBO it's not disastrous - can still create a 4 vertex VBO in 0.15 milliseconds, I just think that it should take more like 0.04 or so. I should probably test higher vertex count cases and see if it becomes a more significant issue.

And I suppose if I want a fix I need to read some openGL 3/4 macOS specific guides.

no-reply@allegro.cc (SiegeLord) — Mon, 18 Sep 2017 04:23:09 +0000

I am confused by your setup a bit. Are you continually creating a vertex buffers? They are meant to be created once and reused multiple times.

no-reply@allegro.cc (rhuanjl) — Mon, 18 Sep 2017 05:47:16 +0000

@SiegeLord: I wouldn't normally be continually creating VBOs I'm well aware that they're designed to be made once and used many times.

For testing performance I had the code create 20,000 VBOs then draw each of them 10 times which is where the time measurements I've mentioned come from.

On the macbook pro I'm using creating the 20,000 VBOs takes 3-3.5 seconds (with the original version of the function) around 0.5 seconds with the edited version.

With the original version the draw operations take about 1 second (all 200,000 draws), with the edited version the first draw operation for each VBO (i.e. the first 20,000 operations) collectively take 3-4 seconds, with the remaining 180,000 taking < 1 second.

Conversely if I added in glEnableClientState(GL_VERTEX_ARRAY) to the relevant line within al_create_vertex_buffer as well as swapping to the alternate loading function the creation process dropped to 0.5 seconds and all the draws together took around 1.1 seconds.