|
Working Thread Pool |
Edgar Reynaldo
Major Reynaldo
May 2007
|
Hello peeps! I created a ThreadPool in response to Alex Glez's thread here : https://www.allegro.cc/forums/thread/617498 You can set the number of threads to use, and add jobs to the queue. It can be paused and resumed. If you need to cancel the work, you can kill it. You can view and clone the repository from here : https://github.com/EdgarReynaldo/AllegroThreadPool Example code is given here : https://www.allegro.cc/forums/thread/617498/1038506#target The example program main.cpp that comes with the repo lets you test saving bitmaps using multiple threads. I attached a static win32 binary you can try out for yourself. Run the program with -h to see a list of command options. It's meant to be run from the console. Increasing the number of jobs from 1 to 8 makes everything faster by a factor of between 2 and 4 or more. Approximate time saved is output by the program. Attached is a zip of the latest repo plus a static win32 binary. Enjoy! Edgar EDIT My Website! | EAGLE GUI Library Demos | My Deviant Art Gallery | Spiraloid Preview | A4 FontMaker | Skyline! (Missile Defense) Eagle and Allegro 5 binaries | Older Allegro 4 and 5 binaries | Allegro 5 compile guide |
bamccaig
Member #7,536
July 2006
|
I have little experience with threads, let alone thread pools, but I did a quick study of the code to try to understand it and these are my thoughts:
I'm exhausted and a bit drunk so maybe I'm missing some things (and threads tend to be a brainfuck in general anyway). Just looking to try to understand what's going on, and if I catch a mistake even better. -- acc.js | al4anim - Allegro 4 Animation library | Allegro 5 VS/NuGet Guide | Allegro.cc Mockup | Allegro.cc <code> Tag | Allegro 4 Timer Example (w/ Semaphores) | Allegro 5 "Winpkg" (MSVC readme) | Bambot | Blog | C++ STL Container Flowchart | Castopulence Software | Check Return Values | Derail? | Is This A Discussion? Flow Chart | Filesystem Hierarchy Standard | Clean Code Talks - Global State and Singletons | How To Use Header Files | GNU/Linux (Debian, Fedora, Gentoo) | rot (rot13, rot47, rotN) | Streaming |
relpatseht
Member #5,034
September 2004
|
grumble grumblelock freegrumble grumbleperfgrumble grumble Starting base case... 1536000000 Starting AL case..... 1536000000 Starting Rel case.... 1536000000 Base time 6648345614 | Sequential | AL Thread Pool | Rel ThreadPool --------------------------------------------------------------- Total time | 6648345614 | 1473114439 | 660840522 Per job | 4.32835e+06 | 959059 | 430235 Overhead | 0 | 8.68719e+08 | 5.64455e+07 Overhead/job | 0 | 6.2213e+06 | 404232 Overhead % | 0 | 58.9716 | 8.54147
|
Peter Hull
Member #1,136
March 2001
|
I'm prepared to be wrong, but it looks like every job submitted via ThreadPool::AddJob gets its own Thread object, and the pool schedules their start so that no more than a certain number are running at once. As I understood it, a Thread Pool contains a number of threads that outlive any one job; they process one job then go back to the queue for the next one. The benefit is they avoid the overhead of thread creation/destruction if you have may small jobs. https://docs.oracle.com/javase/tutorial/essential/concurrency/pools.html
|
Edgar Reynaldo
Major Reynaldo
May 2007
|
Thanks for the replies! bamccaig said: That said, EnqueueJob() and FinishJob() (despite the inconsistent naming) seems to accomplish what I would imagine a pool to be so it's just nitpicking (with almost zero experience working with thread pools, at that). Those are protected though so I'm lost again. They're protected because they're only supposed to be called by the MasterThreadProc. There's a separate master thread that runs all the jobs. bamccaig said: Having a single mutex to manage the entire mechanism seems somewhat wasteful, meaning that I think it may block parts of the program needlessly. Anything accessible through the public interface that the MasterThread needs to access has to be guarded. There are ways to avoid locking altogether, but I don't know how to do that. bamccaig said: Finish doesn't seem to actually wait for the thread to finish. It appears to add it to a new collection/container, but doesn't seem to do anything else with it. Are you referring to Thread::Finish, ThreadPool::Finish, or ThreadPool::FinishJob? FinishJob removes the worker from the work set, and adds it to the list of complete jobs. The complete_jobs list is to safely allow deletion of finished jobs. Relpatseht said: grumble grumblelock freegrumble grumbleperfgrumble grumble Use your words man! I don't know how to program it lock free. Never used atomic variables or other such things before. Relpatseht said:
Starting base case... 1536000000 Starting AL case..... 1536000000 Starting Rel case.... 1536000000 Base time 6648345614 | Sequential | AL Thread Pool | Rel ThreadPool --------------------------------------------------------------- Total time | 6648345614 | 1473114439 | 660840522 Per job | 4.32835e+06 | 959059 | 430235 Overhead | 0 | 8.68719e+08 | 5.64455e+07 Overhead/job | 0 | 6.2213e+06 | 404232 Overhead % | 0 | 58.9716 | 8.54147
Could you explain what this chart is? I have no idea what it means. Peter Hull said: I'm prepared to be wrong, but it looks like every job submitted via ThreadPool::AddJob gets its own Thread object, and the pool schedules their start so that no more than a certain number are running at once. As I understood it, a Thread Pool contains a number of threads that outlive any one job; they process one job then go back to the queue for the next one. The benefit is they avoid the overhead of thread creation/destruction if you have may small jobs. A Thread object doesn't have an actual thread until Thread::Start is called. Allegro starts a thread as soon as you call al_create_thread, which means there's a limit on how many you can create. To alleviate this, I only create a thread when it's job is actually scheduled to run. I could certainly do it that way. It would require a bit of a redesign. I understand what you're saying though. Have a set of working threads and recycle them by sending jobs to them. That's doable. I'd have to think about how to go about it properly though. Again, thanks for the feedback folks!!! Keep it comin! My Website! | EAGLE GUI Library Demos | My Deviant Art Gallery | Spiraloid Preview | A4 FontMaker | Skyline! (Missile Defense) Eagle and Allegro 5 binaries | Older Allegro 4 and 5 binaries | Allegro 5 compile guide |
relpatseht
Member #5,034
September 2004
|
Sorry, I was inebriated slightly. I threw together a lock free thread pool, ran perf comparisons, and pasted the results. The code is all attached to that post. The important number is overhead percent. O world be perfect scaling (running exactly num threads faster than base). Your thread pool has about 60% overhead, the lock free implementation has about 9%. Sequential is base times, running there jobs in one thread. Everything else is time. Smaller numbers are better.
|
Edgar Reynaldo
Major Reynaldo
May 2007
|
I'm going to rework the pool to recycle threads, and then I'll rerun the tests and see how we do. I expect it to improve, but not surpass your code. I'll remove as many locks as I can. As for your code, just looking at it makes me C-sick, get it? (Yes I know it's C++ don't be pedantic). It's way over my head. All those different types I've never heard of before. And I claim to know C++ 11. Right. The comparison may not be fair though, if it's Allegro and pthreads vs C++11 et al. If I made it entirely from other C++11 code, do you think it would compare? My Website! | EAGLE GUI Library Demos | My Deviant Art Gallery | Spiraloid Preview | A4 FontMaker | Skyline! (Missile Defense) Eagle and Allegro 5 binaries | Older Allegro 4 and 5 binaries | Allegro 5 compile guide |
relpatseht
Member #5,034
September 2004
|
How you start a thread is completely irrelevant. std::thread, al_create_thread, pthread_create, etc. On Windows, they'll all eventually call CreateThread and then they'll just be another thread to the OS. The timing code does matter somewhat. std::chrono::high_resolution_clock goes through Windows's QueryPerformanceCounter, which is the highest precision clock available to Windows. At the end of the day, that's all it's doing. Getting a time stamp, comparing time stamps. The values aren't in any units. The std::atomic_* types just map their ops to atomic intrinsics. At least, for the sizes I use. If you use an atomic of a size bigger than the hardware supports atomics for, then it will fall back to mutexes. See std::atomic_is_lock_free. Be careful with that. I align the atomic types to 64 because a cache line is typically 64 bytes. Under the hood, a CPU can't synchronize at lower level than a cache line. If two atomics share a cache line, they share contention. Slinging cache lines around between processors isn't free. In the timing code, the work function is using avx512. Not many processors support that. It melts CPUs, which is why I chose it. Probably need to switch to something else on your end. Anyway, like I said in some other thread, threading is all about perf, which means it all about removing contention and never locking. If I have time tomorrow, I'll add comments, etc. Then probably extend it with real c++11 so I can have nice syntax like: 1std::vector<Foo> bar;
2parallel_for(0, bar.size(), [&](unsigned threadIndex, unsigned barIndex)
3{
4 // do parallel work...
5});
6
7parallel_for_each(bar.begin(), bar.end(), [&](unsigned threadIndex, Foo& f)
8{
9 // do parallel work...
10});
11
12parallel_for(0, bar.size(), [&](unsigned barIndex)
13{
14 // do parallel work...
15});
16
17parallel_for_each(bar.begin(), bar.end(), [&](Foo& f)
18{
19 // do parallel work...
20});
[Edit] I started writing how everything works, then I ran out of time/got bored. I got as far as mostly explaining the SPSC queue, which is the only important bit anyway. ... Simple Lock Free Thread Pool Background The most important performance characteristic of a multithreaded algorithm The number one murderer of scalability is resource contention. All memory Mutexes, semaphores, and other threading primitives exist to solve the In comes lock-free and wait-free algorithms. A lock-free algorithm is But first, a small digression. Your code is a lie. The machine code For all intents and purposes, you only need to know about three kinds of Now, that's all not actually very useful information in practical terms, So, that's all great, but there is still one piece missing: atomic The Job Queue The basis of our lock-free thread pool is the job queue. That's where The queue will be implmented as a ring buffer, so the data for the queue struct job { void (*callback)(int threadIndex, void *userData); void *userData; }; struct sqsc_queue { static const size_t capacity = 31; alignas(64) std::atomic_size_t head = 0; alignas(64) job ringBuffer[capacity]; alignas(64) std::atomic_size_t tail = 0; };
As an aside, everything in the spsc_queue is aligned to 64 because As you can see, the ring buffer's data is pretty simple. There is a The code to add to the queue is quite simple (read, fast): bool try_push(spsc_queue *queue, const job &newJob) { size_t curTail = queue->tail.load(std::memory_order_relaxed); // relaxed: no memory barrier size_t nextTail = (curTail + 1) & capacity; if (nextTail != queue->head.load(std::memory_order_acquire)) { queue->ringBuffer[curTail] = newJob; queue->tail.store(nextTail, std::memory_order_release); // release tail write and ringBuffer writer so visible to acquiring threads return true; } return false; // full }
So to break this down a bit, the function is called "try_push" because The first step is to get the current tail index and determine the Next we check if the queue is full, which is true if our next tail So, if our queue is full, we return false, but otherwise, we That's all there is to writing to the ring buffer. Reading isn't much bool try_pop(spsc_queue *queue, job *outJob) { size_t curHead = queue->head.load(std::memory_order_relaxed); // relaxed: no barriers if (curHead == queue->tail.load(std::memory_order_acquire)) // Acquire tail and ringBuffer writes return false; // empty *outJob = queue->ringBuffer[curHead]; queue->head.store((curHead + 1) % capacity, std::memory_order_release); // release head write to other thread return true; }
|
bamccaig
Member #7,536
July 2006
|
Side note: is it by design that the writer uses bitwise-AND, but the reader uses modulus for the wrap around? -- acc.js | al4anim - Allegro 4 Animation library | Allegro 5 VS/NuGet Guide | Allegro.cc Mockup | Allegro.cc <code> Tag | Allegro 4 Timer Example (w/ Semaphores) | Allegro 5 "Winpkg" (MSVC readme) | Bambot | Blog | C++ STL Container Flowchart | Castopulence Software | Check Return Values | Derail? | Is This A Discussion? Flow Chart | Filesystem Hierarchy Standard | Clean Code Talks - Global State and Singletons | How To Use Header Files | GNU/Linux (Debian, Fedora, Gentoo) | rot (rot13, rot47, rotN) | Streaming |
relpatseht
Member #5,034
September 2004
|
No, I was editing from a template size to 31 (2^n - 1), so it made sense to switch to an and mask, which is marginally more performance. I didn't finish, so the second place wasn't modified. Was anything I wrote useful? The world of performance multi threading is sorely lacking in the industry, so if it so happens I've explained things in a way that makes sense/there's interest, I'll write more maybe. Though, I've also been wanting to do a write up on a performant rendering pipeline in software from scratch. Knowing what the GPU is doing makes graphics a lot easier/intuitive, in my experience.
|
bamccaig
Member #7,536
July 2006
|
I definitely found it useful, but also it seems like something that is very subject to the inner workings of the hardware it runs on. Which I'm unlikely to understand or keep up on. Certainly I wouldn't be able to implement this myself because I don't understand the technology well enough to back up these assertions if a colleague were to question it. But it's interesting nonetheless. I think where it gets flaky is the alignment to prevent "contention". That sounds like something where it will only work on the machines that behave this way. Would it still work on 32-bit software/hardware? I assume some future 128-bit bus will require 128-bit alignment, but again I don't really understand that fully. If you have more to write then write then I'll definitely read it. I can't promise that I'll be able to actually use it though, but I might learn something anyway (I think I already have). -- acc.js | al4anim - Allegro 4 Animation library | Allegro 5 VS/NuGet Guide | Allegro.cc Mockup | Allegro.cc <code> Tag | Allegro 4 Timer Example (w/ Semaphores) | Allegro 5 "Winpkg" (MSVC readme) | Bambot | Blog | C++ STL Container Flowchart | Castopulence Software | Check Return Values | Derail? | Is This A Discussion? Flow Chart | Filesystem Hierarchy Standard | Clean Code Talks - Global State and Singletons | How To Use Header Files | GNU/Linux (Debian, Fedora, Gentoo) | rot (rot13, rot47, rotN) | Streaming |
relpatseht
Member #5,034
September 2004
|
The idea of atomic operations is not hardware specific. Any modern hardware will support at least an atomic compare and swap operation as well as memory fences. Just about all hardware now (all processors, mobile or otherwise) support these primitives in a common enough manner for it to be standardized in the STL, as I was using. Just about all 32 bit hardware also supports/supported atomic operations and fences since the advent of multi core processors. Of course, most 32 bit processors only supported atomic up to 32 bits wide. Some 64 bit processors support 128 bit atomics, but I wouldn't build an algorithm reliant on it. As for 64 bit alignment, that's cache line width, not bus width. This is much less consistent, but is typically 64 bytes. Just about all processors have multiple cache levels. The closer the cache to the core, the faster and smaller it is. Most multi core processors will give each core their own L1 cache, then maybe pairs will share an L2, and the whole processors will share L3. These synchronization primitives are concerned with, as their name suggests, how memory in cache and in ram is synchronized across the processors. So, if every core had their own L1 cache, and they are all looking at the same memory address, then every core had a unique copy of that memory. If one core updates that memory with no fences or atomic instructions, the other cores won't see the update until it is (eventually, but who knows when) written from cache back to ram them read from ram back to the other cores' cache. Again, processors can only reason about memory in cache line sized chunks. Even if the operations in memory are atomic with the right fences, the processor still needs to copy that memory around between all the cores' cache at all the correct levels to ensure the update is seen by the reading cores. All that copying around of cache lines isn't free. So you see why keeping atomics on separate cache lines is good for keeping contention down? It prevents false sharing by ensuring an atomic is only in the cache of a core actually using it. Typing on a phone is hard and I'm sure autocorrect had made a fool of me somewhere. TLDR is this stuff is everywhere and hardware is mostly converged. If you want to make things that go fast, you need to learn it.
|
bamccaig
Member #7,536
July 2006
|
The atomics didn't concern me. Especially the C++ types/library stuff. I'd expect that to work. It's predominantly the alignment which I feel like I cannot rely on or defend. Mind you, I predominantly develop C# applications so professionally this is too low-level to be of use anyway. I doubt it translates directly to "managed" code. But I still love C and C++ and am interested in learning what I can. Side note: typing on phones sucks. I hate using my phone for anything other than texting and calling (and gaming while I poop). It's terrible at Web stuff. -- acc.js | al4anim - Allegro 4 Animation library | Allegro 5 VS/NuGet Guide | Allegro.cc Mockup | Allegro.cc <code> Tag | Allegro 4 Timer Example (w/ Semaphores) | Allegro 5 "Winpkg" (MSVC readme) | Bambot | Blog | C++ STL Container Flowchart | Castopulence Software | Check Return Values | Derail? | Is This A Discussion? Flow Chart | Filesystem Hierarchy Standard | Clean Code Talks - Global State and Singletons | How To Use Header Files | GNU/Linux (Debian, Fedora, Gentoo) | rot (rot13, rot47, rotN) | Streaming |
Edgar Reynaldo
Major Reynaldo
May 2007
|
bambams - learn to use the microphone on your phone, it can read speech just fine. Relpatseht What I don't understand are fences and the .load function. I also don't quite understand how you manage to implement an 'spsc' queue using atomic operations. My Website! | EAGLE GUI Library Demos | My Deviant Art Gallery | Spiraloid Preview | A4 FontMaker | Skyline! (Missile Defense) Eagle and Allegro 5 binaries | Older Allegro 4 and 5 binaries | Allegro 5 compile guide |
bamccaig
Member #7,536
July 2006
|
Edgar Reynaldo said: bambams - learn to use the microphone on your phone, it can read speech just fine. Yes, it's quite good. I use it sometimes. But it's not private unless you're in private, and it's a bit slower to get to. Agreed, the conversation here is awesome right now. Stay on topic for once! -- acc.js | al4anim - Allegro 4 Animation library | Allegro 5 VS/NuGet Guide | Allegro.cc Mockup | Allegro.cc <code> Tag | Allegro 4 Timer Example (w/ Semaphores) | Allegro 5 "Winpkg" (MSVC readme) | Bambot | Blog | C++ STL Container Flowchart | Castopulence Software | Check Return Values | Derail? | Is This A Discussion? Flow Chart | Filesystem Hierarchy Standard | Clean Code Talks - Global State and Singletons | How To Use Header Files | GNU/Linux (Debian, Fedora, Gentoo) | rot (rot13, rot47, rotN) | Streaming |
relpatseht
Member #5,034
September 2004
|
I realize my earlier characterization of acquire/release being the same as sequentially consistent on x64/x64 was incorrect. On x86/x64 systems, there are very strong cache coherency rules. A memory write on any core must be visible to all other cores. As a result, the acquire and release fences are only fences the compiler cannot reorder loads and store around respectively. They are implemented via simple MOV instructions. Sequentially consistent, on the other hand, still requires the heavier XCHG instruction. Simplified, only one XCHG instruction is executed at a time across the whole processor (thus, sequentially consistent). ARM, for instance, doesn't have strong cache coherency rules, so the barriers are actual instructions emit by the compiler to instruct the processor a particular address needs to be made coherent across cores. As for atomics, they are not a special type. They just instruct the compiler to handle the value specially in that they are never stored in registers. Atomics are always accessed via memory. You don't need to use .load and .store with atomics, you can just use the = operator, but then all operations will be sequentially consistent, which is slower. For instance, consider the following: void ThreadProc(void *userData) { uint32_t * const foo = reinterpret_cast<uint32_t*>(userData); while(*foo == 0) std::this_thread::yield(); } int main() { uint32_t bar = 0; std::thread waitThead(ThreadProc, &bar); bar = 1; waitThread.join(); }
It is very possible for this code to never terminate. What could easily happen is foo is stored in a register and the register is never updated, making that while loop go forever. Foo here is just a plain variable, it isn't special to the compiler in any way, so there is no reason for the compiler to emit code that says to re-read the value from memory every time. To the compiler, without the atomic instrinsics, there is no such thing as other threads or processes magically updating the memory behind the scenes. "But isn't that what the 'volatile' keyword is for?" As for the SPSC queue, remember, it only works because it's SPSC, single producer, single consumer. There can only be 1 thread calling try_pop, and only 1 thread calling try_push. Every acquire operation is paired up with a release operation. In try_pop when we acquire the tail, we're guaranteed the memory stores up to and including writing nextTail to tail in try_push are finished. Vice-versa is true for head.
|
|