![]() ![]() Vector IntegersĪll of the vector integer improvements fall into two main categories. Increasing throughput for the FADD/FMUL should provide a good speed up there. Software stacks built upon decades old Fortran still use these instructions, and more often than not in high performance math codes. So why is it even mentioned here? The answer lies in older software. The FADD and FMUL improvements mean the most here, but as stated, using x87 is not recommended. Among the regressions, we’re also seeing some improvements. x87 was originally meant to be an extension of x86 for floating point operations, but based on other improvements to the instruction set, x87 is somewhat deprecated, and we often see regressed performance generation on generation.īut not on Zen 3. There are also some regressions Zen3 Updates (3)įor anyone using older mathematics software, it might be riddled with a lot of x87 code. If anyone asks why we ever need extra transistors for modern CPUs, it’s for things like this. The other element is the introduction of a hardware accelerator with parallel bits: latency is reduced 99% and throughput is up 250x. Software that helps the prefetchers, due to how AMD has arranged the branch predictors, can now process three prefetch commands per cycle. It’s worth highlighting those last two commands. VPSUBB/W/D/Q vec1, vec1, vec1 : Vector subtract packed integersĪs for direct performance adjustments, we detected the following: Zen3 Updates (1).VPCMPGTB/W/D/Q vec1, vec1, vec1 : Vector compare packed integers greater than.VPANDN/VPXOR vec1, vec1, vec1 : Vector bitwise logical (AND NOT)/XOR.VXORPS/VXORPD vec1, vec1, vec1 : Vector bitwise logical XOR Packed FP32/FP64.VANDNPS/VANDNPD vec1, vec1, vec1 : Vector bitwise logical AND NOT Packed FP32/FP64. ![]() (V)MOVAPS/MOVAPD/MOVUPS/MOVUPD vec1, vec1 : Move (Un)Aligned Packed FP32/FP64.Sticking with instruction elimination, a lot of instructions and zeroing idioms that Zen 2 used to decode but then skip execution are now detected and eliminated at the decode stage. Now here’s the stuff that AMD didn’t talk about. We detected that the new core performs better REP MOV instruction elimination at the decode stage, leveraging the micro-op cache better. This means that VPCLMULQDQ has a latency of 4 cycles, with a throughput of 0.5/clock.ĪMD also mentioned to a certain extent that it has increased its ability to process repeated MOV instructions on short strings – what used to not be so good for short copies is now good for both small and large copies. This means that VAES has a latency of 4 cycles with a throughput of 2/clock. In Zen 2, vector-based AES and PCLMULQDQ operations were limited to AVX / 128-bit execution, whereas in Zen 3 they are upgraded to AVX2 / 256-bit execution. The other main update is with cryptography and cyphers. Combine that with the larger 元 cache and improved load/store, some workloads should expect some good speed ups. As we scale up this improvement to the 64 cores of the current generation EPYC Rome, any compute-limited workload on Rome should be freed in Naples. This means that AMD’s FMAs are now on parity with Intel, however this update is going to be most used in AMD’s EPYC processors. In Zen 3, a single FMA takes 4 cycles with a throughput of 2/clock. In Zen 2, a single FMA took 5 cycles with a throughput of 2/clock. The top cover item is the improved Fused Multiply-Accumulate (FMA), which is a frequently used operation in a number of high-performance compute workloads as well as machine learning, neural networks, scientific compute and enterprise workloads. However after getting our hands on the chip, there’s a trove of improvements to dive through. There’s also Control-Flow Enforcement Technology (CET) which enables a shadow stack to protect against ret/ROP attacks. Aside from adding new security functionality, being able to rearchitect the decoder/micro-op cache, the execution units, and the number of execution units allows for a variety of new features and hopefully faster throughput.Īs part of the microarchitecture deep-dive disclosures from AMD, we naturally get AMD’s messaging on the improvements in this area – we were told of the highlights, such as the improved FMAC and new AVX2/AVX256 expansions. When it comes to instruction improvements, moving to a brand new ground-up core enables a lot more flexibility in how instructions are processed compared to just a core update. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |