Views 16
Arm Mali-G77 GPU – the complete ins and outs (Latest version)

Except for the new Cortex-A77 CPU core, Arm has unveiled a next generation GPU dedicated to the next generation of smartphones. The Mali-G77, not confused with the new Mali-D77 display processor, indicates the departure of Army's Bifrost architecture and the transition to Valhall.

In a moment we get the fine detail of the new architecture. First, we will jump right into what users should expect in terms of performance gains.

Mali-G77 performance review

Arm boasts a 40% graphical performance improvement with following Mali-G77 devices compared to today's Mali-G76 models. This number takes into account process as well as architectural improvements. The Mali-G77 is adjustable from 7 to 16 shader cores, and each core is almost exactly the same as the G76 core. This means that high-end smartphones will likely ship with similar GPU core counts as they do today – somewhere in the late teens. Convenient, it allows us to make some speculative performance assessments against existing chipsets.

Looking at the popular Manhattan GFXBench benchmark, a 40% performance boost offers a significant advantage over the current generation of hardware. Qualcomm's next generation Adreno chip will need its own significant performance upgrade to maintain the playing field level. The tables seem to turn into Arm's favor.

Architecture mode, play performance increases 20 to 40%, while machine learns to earn 60% boost

Based on this rather rough ballpark, a 10-core Mali-G77 (a configuration we regularly see from Huawei) looks just about this generation's top-of-the-line mobile graphics hardware. A 12 core configuration, typically seen in Samsung's Exynos, offers a great lead for Arm's latest GPU. Natural real benchmarks will depend on other factors, including the process node, GPU cache memory, LPDDR memory configuration, and the type of program you're testing. So take the above chart with a solid dose of salt.

In terms of the new architecture alone, Arm states that the Mali-G77 improves on average 30% in energy efficiency and performance density. There's also a massive 60 percent boost for machine-learning applications, thanks to INT8-dot product support. Gaming performance expectations are somewhere between a 20 and 40 percent boost, depending on the title and type of graphics workload.

To understand exactly how Arm has achieved this performance boost, let's take a deeper dive into architecture.

Meet Valhall, Bifrost's successor

Vahall is Arm's second-generation scalable GPU architecture. It is a 16-wide execution engine that essentially means that the GPU 16 executes instructions in parallel per cycle per processor per core. It is from 4 and 8 wide in Bifrost.

Other new architectural features include dynamic hardware scheduling scheduling and an all-new instruction set that maintains operational equivalence to Bifrost. Others include support for Arm's AFBC1.3 compression format, FP16 set targets, low versions and vertex shader outputs.

The Mali-G77 makes 33% more math parallel to the G76.

The keys to understanding the most important architectural changes are found by examining the executing unit within the nucleus. This part of the GPU is responsible for the number of crunching.

Within the execution engine

In Bifrost, each GPU core contains three execution engines or two in the case of lower-end Mali-G52 designs. Each engine contains an i-cache, registry file, and warp control unit. In the Mali-G72, each engine handles 4 instructions per cycle, which increased in last year's Mali-G76 to 8. Distribution over these three cores allows for 12 and 24 32-bit fulcrum (FP32) molten multiply accumulation (FMA) instructions per cycle.

With Valhall and the Mali-G77, there is only one single output engine in each GPU core. As before, this engine houses the warp controller, registry, and icache, which are now split into two processing units. Each processor handles 16 warp instructions per cycle, for a total throughput of 32 FP32 FMA instructions per core. This is a 33 percent boost for the Mali-G76 mission.

Poor went from three to just one execution unit per GPU core, but now there are two processing units within a G77 core.

In addition, each of these processing units contains two new mathematical function blocks. The new converter (CVT) handles basic integer, logic, branch and conversion instructions. The special function unit (SFU) speeds up integer multiplication, sections, square root, logarithms and other complex integer functions.

The standard FMA unit has seen some tweaks, which support 16 FP32 instructions per cycle, 32 FP16 or 64 INT8 dot product instructions. These optimizations produce the 60 percent performance lift in machine learning applications.

The Quad Texture Folders

The other key change in the Mali-G77 is the launch of a square texture folder, from a dual texture folder in the previous generation. The texture folder is responsible for mapping the 3D polygons in a 2D scene scene that you see on a screen. It is responsible for sampling, interpolation and filtering to smooth the angular and moving content to avoid hard, low quality edges.

Low cost anti-aliasing remains in place to help with image quality, but doubling textural performance is of great benefit here. The texture unit now processes 4 bilinary texels per bell of 2 previously, 2 trilinear texels per clock, and handle FP16 and FP32 filter faster.

The square texture folder is divided into two paths, which provides a shorter pipeline for wires that store content in the cabinet. The mispad, which handles conversion and texture decompression, contains a larger interface to L2 cache. It is also useful for machine learning loads that often need to turn on new memory data.

Bring everything together in the Mali-G77

Arm made a number of other tweaks to the Mali-G77 to coincide with the major changes in the Valhall architecture. The control block is simplified thanks to the single execution unit design, while the internal dynamic scheduler actually enables a more flexible instruction in each core. With a higher throughput in each core, the data patch is also shorter and lower in latency, up to just 4 cycles from 8 before.

The new design is also better aligned with the Vulkan API, which simplifies driver descriptions to reduce driver overhead for improved performance.

In summary, the Mali-G72 and Valhall are making significant changes to Bifrost, which promises significant performance improvements for game and machine learning applications. Importantly, the design fits within the same power and area budgets as Bifrost, ensuring mobile devices can deliver more peak performance without worrying about heat, power and silicon costs. Based on the performance projections, Mali-G77 Qualcomm's next general Adreno & # 39; should give a good run for his money.