Ok, so after whole day of profiling, digging and fiddling with the code, I achieved 2.5X speed improvement of the rendering method (the call to al_draw_tinted_scaled_rotated_bitmap_region).
Now, I can go up to 40k sprites while keeping 60FPS, and that is only because the sprite preparation and sorting for the render is now slowing it down the most, so after some other changes, I think it could go to 60k or more.
Some of these optimisations were very custom and result of tighter integration of our rendering method with allegro, but big part of it could be applied to allegro to make it generally faster, I believe.
I might propose a patch later.
These changes were the most important
The backup could be easily removed as long as I initialised the identity transform in drawing of primitives, but the overall gain is big
in d3d drawing, it checks if the VERTICAL/HORIZONTAL flip is active, but that is already dealt with in the allegro method (and the flag is turned off), so these ifs are always off and can be removed
The blender was used on every sprite, I put it away completely because of the integration, but some simple condition that would check if it should be applied would help anyway (it takes time)
The internal quad drawing called al_get_current_transform and the color converting functions 4 times in a row, while it could just get it once and use it (it really speeds up a lot)
The internal quad function could be integrated into the d3d drawing function
Custom optimisations were mainly:
I created all the needed functions with postfix "_optimised" and used some other things, like using global transform and bitmap_target objects (not using the apply_transform methods etc, it also slows thing down), I know it is ugly, but it just helps.
I diminished the method calls by using the internals of the public allegro draw method and the drawer method and used its code directly in my system draw routine, as well as connecting some methods.
All my sprites are sub-bitmaps (parts of atlases), so I could remove those ifs that check for sub-bitmaps.
I removed all the branches we never use (non-accelerated drawing, drawing from backbuffer and similar), smaller functions are better for cache hits.