After high expectations and a lot of leaks, NVIDIA finally released its next-generation video cards, the GeForce RTX 30 series, this morning. The series is a gaming and graphics variant design based on the NVIDIA Ampere architecture and built on an optimized version of Samsung’s 8nm process. NVIDIA says these new cards offer significant improvements in gaming performance. The latest generation of GeForce will also have some new features to further differentiate these cards from NVIDIA’s Turing-based RTX 20 series.

The first three cards of NVIDIA’s newly released RTX 30 series are: RTX 3090, RTX 3080 and RTX3070. These cards are all set to launch within the next month and a half. Among them, RTX 3090 and RTX 3080 must be mentioned. The two cards will serve as the successor to the NVIDIA GeForce RTX 2080 Ti and RTX 2080/2080S respectively, and they set new highs in graphics performance. Of course, the price of the RTX 3090 also hit an all-time high.

The first graphics card to hit the market was the GeForce RTX3080. NVIDIA says the new graphics card, which doubles the performance of the previous-generation RTX 2080, will be available on September 17 for $700. A week later, the more powerful GeFoce RTX 3090 will be available on September 24 for $1,500. And the TX 3070, positioned as a more traditional sweet spot card, will be available next month for $499.

Ampere Architecture in Gaming: GA102

As NVIDIA has done in the past, their public presentation this morning was not a deep dive into architecture. But NVIDIA still continues its successful launch experience. That means lots of demos, testimonials and promotional videos, and an overview of several of the technical and engineering design decisions the latest generation of GPUs have taken. The end result is that we have a decent look at the RTX 30 series, but we’ll have to wait for NVIDIA to provide some in-depth technical coverage to get a better understanding.

It is understood that the Ampere and GA102 GPUs used in the top graphics cards bring several major hardware improvements to NVIDIA’s product line. The biggest of these improvements is the ever-shrinking transistor size, thanks to a custom version of Samsung’s 8nm process. We have limited information on this process as it is not used much, but at a high level this is Samsung’s densest traditional non-EUV process, derived from its older 10nm process.

All in all, NVIDIA is a bit late in adopting smaller processors, but since the company has reinvented the affinity to deliver large GPUs first, they need higher wafer yields (fewer defects) to get chips to market .

For NVIDIA’s products, Samsung’s 8nm process is a complete upgrade to their previous process, and TSMC’s 12nm “FFN” is itself an optimized version of TSMC’s 16nm process. Therefore, the new process allows NVIDIA’s transistor density to increase significantly,

In the case of the GA102, 28 billion transistors are integrated in it, which is reflected in the large number of CUDA cores and other available hardware. Mid-generation architectures such as Turing and Maxwell gain most of the gains at the architectural level, while Ampere (like Pascal before it) benefits from modest improvements in the lithography process. The only obstacle to all this is Dennard Scaling which is dead and not coming back. So while NVIDIA can pack more transistors in a chip than ever before, power consumption is increasing, which is reflected in the graphics card’s TDP.

NVIDIA didn’t provide us with a specific die size for the GA102, but based on some photos, we’re confident enough to believe it will exceed 500mm². It’s considerably smaller than the 754mm square TU102, but it’s still a pretty large chip, and one of the largest Samsung has ever produced.

Moving on, let’s talk about the Ampere architecture itself. Launched this spring as part of the NVIDIA A100 accelerator, we’ve only seen Ampere from a compute-oriented perspective until now. The GA100 lacks several graphics features, so NVIDIA can maximize the chip space allocated to computing, so graphics-focused Ampere GPUs like the GA102 are still members of the Ampere family, and there are a lot of differences between the two. That said, NVIDIA has been able to remain mysterious about Ampere’s gaming capabilities so far.

From a computational standpoint, the Ampere looks comparable to the Volta, and from a graphics standpoint, too. The GA102 does not introduce any novel functional blocks such as RT cores or tensor cores, but its functions and relative sizes have been adjusted. The most notable change here is that, like the Ampere GA100, Ampere for gaming inherits and updates the more powerful tensor core, which NVIDIA refers to as the third-generation tensor core. A single Ampere SM can provide twice the tensor throughput than a Turing SM, albeit with half the number of tensor cores. Also, NVIDIA seems to have kept the basic settings on the GA102. That makes NVIDIA’s FP16 tensor core performance more than double that of the previous generation.

Meanwhile, NVIDIA has confirmed the tensor cores used in the GA102, and other Ampere graphics GPUs also support sparsity for higher performance, which means that NVIDIA has not stepped back in terms of tensor core capabilities. Overall, the focus on tensor core performance underscores NVIDIA’s commitment to deep learning and AI performance, as the company sees deep learning as a driver not only for its data center business, but also its gaming business. We just have to dig into NVIDIA’s Deep Learning Super Sampling (DLSS) technology to see why. DLSS relies in part on tensor cores to deliver as much performance as possible, and NVIDIA is still finding more ways to get the most out of its tensor cores.

The Ray Tracing (RT) core has also been enhanced, although we’re not sure to what extent. In addition to the GA102 with more SMs being more capable overall, the individual RT cores are said to be up to 2x faster, and NVIDIA may specifically cite ray/triangle intersection performance. NVIDIA’s presentation slides also had some brief notes on RT core concurrency, but the company didn’t go into any real detail on the topic in the short presentation, so we’re waiting for the tech brief for more details.

Overall, faster RT cores are good news for the gaming industry’s ray tracing ambitions, as ray tracing comes at a high performance cost on RTX 20-series cards. That being said, nothing NVIDIA does can completely eliminate this loss. Ray tracing is hard work and takes a while, but more and rebalanced hardware can help keep costs down.

Last but not least, let’s focus on shader cores. This is the area that matters most to gaming performance, and the one that NVIDIA says the least about today. We know that the new RTX 30-series cards pack a surprising number of FP32 CUDA cores, thanks to NVIDIA marking them as “2x FP32” in their SM configurations. As a result, even the mid-range RTX 3080 delivers 29.8 TFLOPs of FP32 shader performance, more than double the previous generation RTX 2080 Ti. In short, there’s a surprising amount of ALU in these GPUs, and quite frankly, a lot more ALU than I expected given the transistor count.

Of course, shading performance isn’t everything, which is why NVIDIA’s own performance requirements for these cards are not as high as the improvement in shading performance alone. However, given the embarrassing parallelism of computer graphics, shaders must be the bottleneck a lot of the time. That’s why throwing more hardware (in this case, more CUDA cores) at this problem is an effective strategy.

The main question at this point is how these additional CUDA cores are organized and what it means for the execution model in SM. We’re admittedly into more technical details here, but how easily Ampere populates these extra cores will be a key factor in its ability to perform better with all these teraFLOPs. Is this driven by the extra IPC extraction in thread warp? Or run further twists?

As a final note, while we’re waiting for more technical info on the new card, it’s worth noting that neither NVIDIA’s spec sheet or other materials mention any other graphics capabilities in the card. To its credit, Turing is already a step ahead, offering features that will be the new DirectX 12 Ultimate / Feature Level 12_2 in two years, sooner than any other vendor. So, as Microsoft and others play catch-up, NVIDIA doesn’t have the higher features to immediately pursue. Still, it’s quite unusual to see NVIDIA pull a new graphics feature or two out of its well-known hat to the crowd.

I/O: PCI Express4.0, SLI and RTX IO

NVIDIA’s introduction of Ampere in GeForce cards also brings Ampere’s improved I/O capabilities to the consumer market. While nothing here is groundbreaking, everything here helps keep NVIDIA’s latest generation of graphics cards in good hands.

It is understood that the functions of the I/O front end include support for PCI-Express 4.0. This was introduced on NVIDIA’s A100 accelerator, so its addition is to be expected, but it’s still the first increase in NVIDIA’s PCIe bandwidth since the GTX 680 was introduced eight years ago. With a full PCIe 4.0 x16 slot, the RTX 30 series cards achieve 32GB/s of I/O bandwidth in each direction, twice the access speed of the RTX 20 series cards.

As for the performance impact of PCIe 4.0, we currently don’t expect much difference, as there is very little evidence that Turing cards are limited by PCIe 3.0 speeds, even though PCIe 3.0 x8 is sufficient in most cases. The higher performance of the Ampere will undoubtedly increase the need for more bandwidth, but not by much. That’s probably why not even NVIDIA is promoting PCIe 4.0 support as much (although being second to AMD here might be a factor).

Meanwhile, it appears that SLI support will persist for at least one generation. NVIDIA’s RTX 3090 cards include an NVLInk connector for SLI and other multi-GPU uses. Therefore, multi-GPU rendering works even if it hardly happens. NVIDIA’s presentation today didn’t go into any further detail about the feature, but it’s worth noting that the Ampere architecture introduces NVLink 3, which if NVIDIA uses it for the RTX 3090 means the 3090’s NVLink bandwidth could be the RTX of the previous generation Twice as fast as the 2080 Ti at 100GB/s in each direction.

Overall, I suspect that the inclusion of an NVLInk connector on the RTX 3090 is a new play for computing users, many users will feel comfortable with a fast consumer grade card with 24GB of VRAM, knowing that VRAM capacity is an important habit for advanced deep learning models. Saliva… NVIDIA will never pass up an opportunity to upsell on graphics, though.

Finally, with the release of the RTX 30 series, NVIDIA also announced their new suite of I/O features called RTX IO. At a high level, this appears to be NVIDIA’s implementation of Microsoft’s upcoming DirectStorage API, which, like on the first-launch Xbox Series X console, allows for a direct connection from storage to the GPU, enabling asynchronous streaming of assets, via Bypassing the CPU for doing most of the work, DirectStorage (and by extension RTX IO) can improve I/O latency and GPU throughput by letting the GPU get the resources it needs more directly.

Besides Microsoft providing a standardized API for the technology, the most important innovation here is the ability of Ampere GPUs to decompress assets directly. Game assets are often compressed for storage purposes – at least Flight Simulator 2020 takes up even more SSD space, and it’s currently the CPU’s job to decompress those assets into something the GPU can use. Offloading it from the CPU not only frees it up for other tasks, but finally gets rid of the middleman entirely, which helps improve asset streaming performance and game load times.

Pragmatically speaking, we already know the technology has made its way to the Xbox Series X and PlayStation 5, so it’s largely up to Microsoft and NVIDIA to stay on par with the next-gen consoles. However, it does require some real hardware improvements on the GPU side to handle all these I/O requests and be able to efficiently decompress various types of assets.

Ampere Power Efficiency Improvement: 1.9x?maybe not

Aside from overall video card performance, NVIDIA’s second technology pillar is overall power efficiency. Power efficiency is a cornerstone of GPU design, as graphics workloads are embarrassingly parallelized and GPU performance is limited by total power consumption. Power efficiency is a constant focus in all GPU launches. NVIDIA paid some attention to securing the release of the RTX 30 series.

Overall, NVIDIA claims Ampere is 1.9x more power efficient. That’s actually a surprising statement for the full development of a post-Dennard manufacturing process node. Mind you, it’s by no means impossible, but it’s far more than the boost NVIDIA has gained from upgrading from Pascal to Turing.

However, if you dig deeper into NVIDIA’s claims, this 1.9x improvement becomes more and more exaggerated.

The immediate oddity here is that power efficiency is usually measured at a fixed level of power consumption rather than a fixed level of performance. As the power dissipation of the transistors increases approximately to the power of the voltage, a “wider” section like Ampere with more functional blocks can be clocked out at a lower frequency, achieving the same overall performance as Turing. Essentially this graph compares the worst Turing to the best Ampere, so the question is, what if we downclock the Ampere to be as slow as the Turing? Instead of “How much faster is Ampere than Turing under the same constraints?”.

In other words, at a specific power consumption, NVIDIA’s graphs don’t show us a direct performance comparison.

If you actually did a fixed power comparison, the Ampere wouldn’t look as good in NVIDIA’s charts. In this example, Turing hits 60fps at 240W, while Ampere’s performance curve is around 90fps. That’s still a big improvement, to be sure, but only a 50% increase in performance per watt. Ultimately, the exact increase in power efficiency will depend on where you sample in the graph, but it’s clear that NVIDIA’s power efficiency improvement with Ampere won’t be as much as the 90% claimed by NVIDIA’s slides, as defined by more conventional metrics.

All of this is reflected in the TDP of the new RTX 30-series cards. The RTX 3090 consumes a whopping 350W, and even the RTX 3080 consumes 320W. If we believe in NVIDIA’s performance claims, the RTX 3080 offers 100% more performance than the RTX 2080 and consumes 49% more power, then the effective performance per watt increase is only 34%. The RTX 3090 comparison is even harsher, with NVIDIA claiming a 50% increase in performance and a 25% increase in power consumption, which translates to only a 20% increase in net power efficiency.

Ultimately, it’s clear that most of the performance gains NVIDIA will get with the Ampere generation will come from higher power constraints. With 28 billion transistors, these cards will get faster, but it will require more power than ever to light them.

GDDR6X with PAM support

In addition to the core GPU architecture itself, the GA102 introduces support for another new memory type: GDDR6X. This is an evolution of GDDR6 technology developed by Micron and NVIDIA, GDDR6X is designed to achieve higher memory bus speeds (and therefore greater memory bandwidth) by using multiple levels of signaling on the memory bus. By adopting this strategy, NVIDIA and Micron can continue to drive the development of cost-effective discrete storage technologies that continue to meet the requirements of NVIDIA’s latest generation of GPUs. This marks NVIDIA’s third memory technology in past generations, going from GDDR5X to GDDR6 to GDDR6X.

When Micron released some early technical papers on the technology last month, it said that by employing Pulse Amplitude Modulation-4 (PAM4), GDDR6X is capable of sending four different symbols per clock, essentially every clock. Each clock is shifted by two bits instead of the usual one per clock. For the sake of brevity, I won’t fully recap this discussion, but I will highlight it.

At a very high level, PAM4 differs from NRZ (binary encoding) by doubling the number of electrical states that a single cell (or transmission in this case) will hold. PAM4 uses 4 signal levels instead of the traditional 0/1 high/low signaling, so a signal can be encoded in four possible two-bit patterns: 00/01/10/11. In this way, PAM4 can carry twice as much data as NRZ without having to double the transmission bandwidth, which presents a greater challenge.

PAM4, in turn, requires more complex memory controllers and memory devices to handle multiple signal states, but also reduces the memory bus frequency, which simplifies other aspects. Probably the most important point for NVIDIA is that it is more power efficient, with about 15% lower bandwidth consumption per bit. To be sure, total DRAM power consumption is still going up, as this is far more than compensated for by bandwidth gains, but every joule saved on DRAM is being applied to other aspects of the GPU.

According to Micron’s filing, the company designed the first-generation GDDR6X to reach 21Gbps. But NVIDIA is being a little conservative here, with the RTX 3090 at 19.5Gbps and the RTX 3080 at 19Gbps. Even at these speeds, assuming the same memory bus size, memory bandwidth is still 36%-39% higher than the previous generation card. Overall, this progress remains the exception to the norm. Historically, we have typically not seen successive generations of products get such large memory bandwidths. However, with more SMs on offer, I can only imagine that NVIDIA’s product team is happy to have it.

However, GDDR6X does have one glaring disadvantage: capacity.

Although Micron plans to develop 16Gbit chips in the future, starting today, they will only produce 8Gbit chips in the future. This density is the same as the memory chips found on NVIDIA RTX 20-series cards and their GTX 1000-series cards. So, at least for these cards, there is no “free” storage capacity upgrade. The RTX 3080 only gets 10GB of VRAM, while the RTX 2080 only gets 8GB, due to the use of a larger 320-bit memory bus (i.e. 10 chips instead of 8). Meanwhile, the RTX 3090 gets 24GB of VRAM, but only by using 12 pairs of chips in clamshell mode on the 384-bit memory bus, making it more than twice as many memory chips as the RTX 2080 Ti.

Introducing HDMI 2.1 and AV1, VirtualLink out

Finally, on the Display I/O panel, Ampere and the new GeForce RTX 30-series cards have several notable changes here. On top of that, they finally have HDMI 2.1 support. HDMI 2.1 is already in TVs (and will ship in consoles this year), and it brings a few features to the table, most notably support for greater cable bandwidth.

HDMI 2.1 cables can transmit data up to 48Gbps, which is more than 2.6 times that of HDMI 2.0, enabling higher display resolutions and refresh rates, such as 8K TVs or 4K monitors running at frequencies above 165Hz. The leap in bandwidth even puts HDMI ahead of DisplayPort. DisplayPort 1.4 only provides about 66% more bandwidth. While DisplayPort 2.0 will eventually beat it, at the moment, Ampere is too early for the technology.

All that said, I’m still waiting for NVIDIA to confirm whether its new GeForce cards support the full 48Gbps signaling rate. Because some HDMI 2.1 TVs are already shipping supporting lower data rates, it’s not unthinkable that NVIDIA is doing the same thing here.

From a gaming standpoint, the other thing HDMI 2.1 does is support for variable refresh rates over HDMI. However, this feature is not unique to HDMI 2.1, it has indeed been ported to NVIDIA’s RTX 20 cards, so as cable bandwidth increases, support for it will become more useful here, but technically it Not a new feature for NVIDIA cards…

Meanwhile, the VirtualLink port introduced on the RTX 20-series cards is on the way out. Industry attempts to build a port to integrate video, data, and power into a VR headset have failed, and none of the three major VR headset manufacturers (Oculus, HTC, Valve) use it the port. So you won’t find that port on RTX 30 series cards.

Finally, while we were talking about video, NVIDIA also confirmed that the new Ampere GPUs include an updated version of its NVDEC video decoding module. The chipmaker has bumped the feature up to what NVIDIA calls Gen 5, adding decoding support for the new AV1 video codec.

It is widely expected that the upcoming royalty-free codec will be the de facto successor to H.264/AVC, since HEVC has been on the market for many years (and has been supported by all recent GPUs). The madcap royalty situation near the codec is not conducive to its adoption. In contrast, the use of AV1 in distribution should provide quality similar to or slightly better than HEVC, but without paying royalties, making it more attractive to content providers. One downside of AV1 to date has been a heavy CPU load, even in high-end desktops, hardware decoding support is important in order to avoid hogging CPU resources and ensure smooth, distraction-free playback.

NVIDIA doesn’t go into the details of its AV1 support here, but another blog post mentioned 10-bit color support and 8K decoding, so it sounds like NVIDIA has the basics covered.

At the same time, there is no mention of further improvements to the company’s NVENC block. Modifications were recently made for the Turing release, expanding the range of NVIDIA HEVC encoding capabilities and overall HEVC and H.264 image quality. Otherwise, we’re too early for hardware AV1 encoding, as some unique properties of this codec are making hardware encoding harder to crack.

 

The Links:   MD400F640PD1A AA050MC01