Technology content trusted in North America and globally since 1999
8,189 Reviews & Articles | 61,951 News Posts

Intel Haswell-EP Xeon E5-2600 v3 Server Family Processor Overview

By: William Harmon | Editorials in IT/Datacenter | Posted: Sep 8, 2014 12:30 pm

On-Die Interconnects Enhancements and Die Configurations




As the core counts started to increase past 12 cores, and more cache added, a more efficient method of connecting all the cores is necessary. The E5-2600 v3's use new ring style interconnects that have bi-directional buffered connections for both rings.




This die configuration chart shows how all cores can communicate with each other. These interconnects allow for faster core communication with a more direct link using dual rings with bi-directional interconnects. These interconnects are also buffered to improve performance.



Integrated Voltage Regulators




Integrated voltage regulators (IVR) have been simplified, which reduces platform complexity by reducing rails and integration of control.


It also enables a more refined voltage and frequency granularity, faster transitions between power states, and reduces board area to enable factor optimizations.



Turbo and AVX Improvements


Intel Turbo Boost Technology 2.0 will automatically allow processor cores to run faster than the rated and AVX frequencies if they are operating below power, current, and temperature specification limits.


The frequency change with AVX workloads happens when the core detects an AVX instruction; these draw more current, and a higher voltage is needed to sustain these conditions.


The core will signal the Power Control Unit (PCU) to provide more voltage. The core will slow during the execution of the AVX instruction in order to maintain TDP limits, which may cause the frequency to drop. That amount of frequency drop will depend on the workload.


The PCU will signal that the voltage has been adjusted, and cores will return to full execution speed. When finished, the PCU will return to regular (non-AVX) operating modes 1ms after AVX instructions are completed. Turbo state limiting will decrease timing variability and power.


Some HPC software requires limited thread variability, which gives some cluster designers concerns about turbo power surges. To combat this, some disable turbo. Turbo state limiting uniformly caps the maximum number of turbo states for all cores. This provides a predictable range of thread variability and power risk, while allowing some turbo performance benefit.



Cluster On Die (COD) mode




Cluster on Die (COD) is supported on one-socket and two-socket SKU's with two home agents (10+ cores).


COD reduces coherence traffic and cache-to-cache transfer latencies, and targets NUMA (non-uniform memory access) optimized workloads where latency is more important than sharing caching agents. COD is best used for highly NUMA (non-uniform memory access) optimized workloads.


Each Home agent has ~14KB of cache, which is eight-way, 256 sets, and two-sector wide. It stores eight-bit presence vector tracking caching agent, potentially owning a copy a cache line. Allocation on a cache-to-cache transfer and tracks hit-M, hit-E, and hit-S lines, which are hotly contested cache lines.


The result is lower cache-to-cache transfer latencies, and reduced directory updates and reads of hotly contested lines. Snoop traffic is also reduced by sending directed snoops, rather than broadcasting them.



Virtualization (VT-x) Features


The new VM features lower entry/exit latency, which reduces VMM overhead, and increases overall virtualization performance.


VM control structure (VMCS) shadowing enables efficient nested VMM usages, such as manageability and VM protection. Extended page and table (EPT) access/dirty bits enables efficient live migration, and helps SW managed fault tolerant solutions. Intel Cache Allocation Technology (CAT) is now monitored on a per-VM basis. Utilization data allows VM software to make better decisions on workload scheduling and migration.



Advanced Vector Extensions (AVX) 2.0




Advanced Vector Extensions (AVX), has also been updated to AVX2, which now uses 256-bit floating point SIMD instructions. This will allow you to use up to twice the amount of packed data with a single instruction.




AVX2 increases parallelism and throughput in floating point SIMD calculations, and reduces register load. This can be useful for floating-point intensive calculations in multimedia, scientific, financial applications, image & signal processing, and cryptology workloads.



Power Efficiency Improvements




Per-Core P-States (PCPS) allow cores to run at individual frequencies/voltages. Energy efficient turbo mode (EET) monitors stall behavior and increases throughput. Uncore voltage/frequency scaling (USF) in Nehalem would allow cores to turbo up, but uncore would remain at a fixed frequency; Sandy Bridge core and uncore turbo up and down together.


With Haswell-EP, each core and uncore, are now treated independently. Core bound applications can drive frequency higher without needing to increase uncore. LLC/Memory bound applications can drive frequency higher without burning core power.



Intel Cache Monitoring Technology (CMT)




When many VM's are running in a system, the cache can be trashed by what is now called a "noisy neighbor." This VM starts to demand a heavy workload, and has high cache usage. The new demand on the cache starts to degrade performance of VM's running on the same cores/cache. The heavy load of the noisy neighbor starts to degrade the performance of normal acting VMs.


Today, the VMs that require heavy system use are often moved to areas that have the resources to support them, so other normal VMs can continue without being adversely affected by the noisy neighbor.


With Intel Cache Monitoring Technology, the processor is able to detect this, and even move the VM when needed. System Cache can also be partitioned, so the noisy neighbor will have a lesser impact on the VMs around it.


Intel Cache Monitoring Technology (CMT) enables monitoring of last-level cache occupancy on a per-thread/app/VM basis, enabling measurement of application cache sensitivity, profiling, fingerprinting, chargeback models, detection of cache-starved apps/VMs, detection of "noisy neighbors" (which hog the LLC) and advanced cache-aware scheduling policies. The CMT feature is supported on all Xeon E5 v3 SKUs and is enumerated via CPUID.



The DDR4 Difference




The move to DDR4 has many benefits. First, power dropped from 1.5v in DDR3 down to 1.2v with DDR4. There is also smaller page size (1024 -> 512) for x4 devices. This can show a savings of ~2W per DIMM at the wall.


Improved RAS enables better command/address parity error recovery. When multiple DIMMs per channel are installed, DDR4 has higher bandwidth, and increased DIMM frequency.




When 4x DIMMs are installed per CPU, we can maintain higher DIMM speeds. If 8x DIMMs are installed, frequency will drop to the next rated frequency. Larger capacity DDR4 DIMMs will be available so that using four channels will support a larger amount of RAM, at a faster speed.

    PRICING: You can find products similar to this one for sale below.

    United States: Find other tech and computer products like this over at Amazon's website.

    United Kingdom: Find other tech and computer products like this over at Amazon UK's website.

    Canada: Find other tech and computer products like this over at Amazon Canada's website.

    We at TweakTown openly invite the companies who provide us with review samples / who are mentioned or discussed to express their opinion of our content. If any company representative wishes to respond, we will publish the response here.

Related Tags

Got an opinion on this content? Post a comment below!