Joke Collection Website - Cold jokes - Sandy Bridge kernel architecture of Intel Core i7 74QM

Sandy Bridge kernel architecture of Intel Core i7 74QM

From a high-level perspective, SNB architecture is only an evolution, but if we look at the scale of transistor changes since Nehalem/Westmere, it is definitely a revolution. Core 2 introduces a logic block called loop flow detector (LSD). When detecting that the CPU executes a software loop, it will shut down the branch predictor and prefetch/decode engine, and then supply it to the execution unit through its own cached micro-ops. This approach saves power consumption and improves performance by turning off the front end during loop execution. A microinstruction cache is added in SNB for temporary storage during instruction decoding. There is no strict algorithm here, and instructions will be put into the cache as long as they are decoded. When prefetching hardware obtains a new instruction, it will first check whether it exists in the microinstruction cache, if so, the cache will serve the rest of the pipelines, and the front end will be closed. Decoding hardware is a very complicated part of x86 pipeline, and turning it off can save a lot of power consumption. If this technology can also be introduced into the Atom processor architecture, it will undoubtedly benefit a lot.

this cache is directly mapped and can store about 1.5K microinstructions, which is equivalent to 6KB instruction cache. It is located in the first-level instruction cache, the hit rate of most programs can reach about 8%, and the bandwidth is higher and more stable than that of the first-level instruction cache. The real first-level instruction and data cache have not changed, and they are still 32KB, totaling 64KB.

this looks a bit like the trace cache of Pentium 4, but the biggest difference is that it doesn't cache traces, but it is more like an instruction cache, which stores microinstructions instead of x86 instructions (macro-ops). At the same time, Intel completely re-established a branch prediction unit (BPU) with higher accuracy, and made innovations in three aspects.

firstly, the standard BPU is a 2-bit predictor, and each branch is marked with the relevant credibility (strong/weak). Intel found that almost all the branches predicted by this dual-mode predictor have strong reliability, so many branches in SNB use one reliability bit instead of one for each branch. As a result, the same bit can correspond to more branches in the branch history table, thus improving the prediction accuracy.

second, the branch target has also been renovated. In the previous architecture, the size of branch targets was fixed, but most targets were relatively similar. SNB now supports multiple different branch target sizes, instead of blindly expanding addressing ability and saving all branch targets, thus wasting less space, and CPU can track more targets and speed up prediction.

thirdly, the traditional method to improve the accuracy of branch predictor is to use more historical bits, but this is only effective for specific types of branches that require long instructions, so SNB divides branches according to different histories, thus improving the prediction accuracy. Similar to AMD bulldozers and bobcats, Intel SNB also uses physical register files. In Core 2 and Nehalem architecture, each microinstruction needs a copy of each operand, which means that the out-of-order execution hardware (scheduler/reordering cache/associative queue) must be very large to accommodate microinstructions and related data. The era of Core Duo was 8-bit, which increased to 128-bit after adding SSE instruction set, and now there is AVX instruction set, which will double to 256-bit according to the trend. RPF stores the operands of microinstructions in the register file, but microinstructions only carry pointers to operands in the out-of-order execution engine, not the data itself. This greatly reduces the power consumption of out-of-order execution hardware (it takes a lot of electricity to transfer a large amount of data), and also reduces the core area of the pipeline, and the data flow window is also increased by one third.

the reduction of core area is the key to realize AVX instruction set (one of the most important innovations of SNB) and ensure good performance. At the minimum cost of core area, Intel turned all SIMD units to 256-bit.

AVX supports 256-bit operands, which consumes considerable transistor and core area, while the use of RPF increases the out-of-order execution buffer, which can well meet the floating-point engine with higher throughput.

Nehalem architecture has three execution ports and three execution unit stacks:

SNB allows 256-bit AVX instructions to borrow an integer SIMD data path of 128-bit, which uses the smallest core area to achieve double floating-point throughput, and each clock can perform two 256-bit AVX operations. In addition, the upper 128-bit of the execution hardware and path is controlled by the Power Gate, and the standard 128-bit SSE operation will not increase the power consumption due to the expansion of 256-bit.

AMD bulldozer architecture supports AVX differently, using two 128-bit SSE paths to merge into 256-bit AVX operation. Even the throughput of 256-bit AVX of eight-core (four-module) bulldozer is half less than that of four-core SNB, but the actual impact depends entirely on how the application uses AVX. The peak floating-point performance of SNB has doubled, which puts forward higher requirements for loading and storage units. Nehalem/Westmere architecture has three loading and storage ports: loading, storing address and storing data.

in the p>SNB architecture, the loading and storing address ports are symmetrical, and both can load or store addresses, so the loading bandwidth is doubled. The integer execution of SNB has also improved, but it is limited. The throughput of ADC instruction is doubled, and the multiplication operation can be accelerated by 25%. Ring bus, L3 cache and system assistant

Nehalem/Westmere each core is connected to L3 cache separately, which requires about 1 wires. However, the disadvantage of this method is that if L3 cache is accessed frequently, the effect may not be very good. SNB also integrates GPU graphics core and video transcoding engine, and enjoys a three-level cache. Intel did not follow the previous practice and added another 2, lines. Instead, like Nehalem-EX and Westermere-Ex in server versions, Intel introduced a Ring Bus. Each core, each L3 cache (LLC), integrated graphics core, media engine and System Agent all have their own access points on this line, which is a "station" figuratively.

this ring bus consists of four independent rings, namely data ring (DT), request ring (QT), response ring (RSP) and listening ring (SNP). Each station of each ring can accept 32 bytes of data in each clock cycle, and the shortest path will always be automatically selected for ring access to shorten the delay. With the increase of the number of cores and cache capacity, the cache bandwidth also increases synchronously at any time, so it can be well extended to more cores and larger server clusters.

in this way, the third-level cache bandwidth of each core of SNB is 96GB/s, which is comparable to that of high-end Westmere, while the four-core system can reach 384GB/s, because each core has an access point on the ring. The delay of L3 cache is also reduced from about 36 cycles to 26-31 cycles. We already felt this when we previewed it before, and now we finally have the exact figures. The third-level cache is now divided into multiple blocks, each corresponding to a CPU core, and each has its own access point and complete cache pipeline on the ring bus. Each core can access all L3 caches, but the latency is different. Previously, there was only one cache pipeline in the third-level cache, and all core requests had to pass through it. Now it is largely divided and ruled. Different from before, the frequency of the third-level cache is now synchronized with the core frequency, so it is faster. However, the disadvantage is that the third-level cache will also be downgraded with the core, so if the GPU needs to access the third-level cache when the CPU is downgraded, the speed will slow down.

after the change of ring bus and three-level cache, the concept of Uncore still exists, but Intel renamed it system assistant, which is basically equivalent to the former North Bridge chip: PCI-E controller, which can provide 16 PCI-E 2. channels and support a single PCI-E x16 or two PCI-E x8 slots;

The memory delay of the redesigned dual-channel DDR3 memory controller has also returned to normal level (Westmere moved the memory controller out of the CPU and put it on the GPU);

in addition, there are DMI bus interface, display engine and power control unit (PCU).

The frequency of the system assistant is lower than other parts, and it has its own independent power layer. Intel's integrated graphics card always seems to be a joke, but this time it's really different. The CPU performance of SNB is improved by 1-3% compared with now, and the graphics performance of GPU evolved to the sixth generation will be easily doubled several times. Although Westmere also comes with its own graphics core, it is packaged with CPU in dual cores, but its performance is improved by 45nm technology, more coloring hardware and higher frequency. SNB packages CPU and GPU in the same core, all of which adopt 32nm technology, especially improving IPC (instruction/clock) significantly.

SNB GPU has its own power island and clock domain, and also supports Turbo Boost technology, which can independently speed up or down the frequency, and * * * enjoys three-level cache. The graphics card driver will control access to the third-level cache, and even limit how much cache the GPU uses. Putting graphics data in the cache will not have to bypass the remote and "slow" memory, which is of great benefit to improving performance and reducing power consumption. But it's not as simple as it sounds. NVIDIA GF1 core cost a lot of effort, and SNB is actually similar, and it has also been completely redesigned.

by the way, Intel's discrete graphics card project Larrabee. It focuses on the extensive use of all-round programmable hardware (except texture hardware), while SNB uses all-round fixed-function hardware, with functional characteristics corresponding to hardware units. This has the advantages that performance, power consumption and core area are greatly optimized, while the loss is lack of flexibility. Obviously, the center of the Intel world is still the CPU, and the GPU can't be too powerful, which is exactly the opposite of NVIDIA's philosophy.

The programmable shading hardware is called EU, which includes shaders, cores, execution units, etc., and can fetch instructions from multiple threads during dual launch. The internal ISA mapping corresponds to most DX1 API instructions one by one, and the architecture is very similar to CISC. As a result, the width of EU is effectively expanded and IPC is significantly improved.

The hardware in the EU is responsible for abstract mathematical operations, and the performance can be improved synchronously. According to Intel, the speed of sine and cosine operations is several orders of magnitude higher than that of current HD Graphics.

in p>Intel's previous graphics architecture, register files were all redistributed in real time. If one thread needs fewer registers, the remaining registers jiuihui are allocated to other threads. Although this can save the core area, it will also limit the performance, and many times the thread may face the embarrassment that there is no register available. In the era of chipset integration, the average number of registers per thread is 64, and the average number of registers in Westmere HD Graphics is increased to 8, while that in SNB is fixed at 12.

all these improvements add up, and the instruction throughput of each EU in SNB is twice that of HD Graphics.

the GPU graphics core integrated by p>SNB is divided into two versions, with 6 and 12 EUS respectively. The first batch of mobile versions are all 12 EUs, and the desktop version has two configurations according to different models, which may be 12 high-end and 6 low-end. Thanks to the characteristics of double throughput per EU, higher running frequency, and * * * enjoying three-level cache, even if there are only six, the performance will be quite satisfactory. In addition to the GPU graphics core, SNB also has a media processor, which is responsible for video decoding and encoding.

In the new hardware-accelerated decoding engine, the entire video pipeline is decoded through fixed functional units, which is just the opposite. Intel claims that the power consumption of SNB can be reduced by half when playing video. The video coding engine is brand new. The specific details were not announced, but Intel took out a 3-minute-long 18p 3Mbps HD video and converted it into 64×36 iPhone format. As a result, the whole process took only 14 seconds and the conversion speed was as high as 4FPS, which only took about 3 square millimeters of core area. Intel cooperates closely with the software industry, and it is believed that this video transcoding technology will be widely supported soon. Lynnfield Core i7/i5 introduces the intelligent intel dynamic acceleration "Turbo Boost" for the first time, which can automatically turn on all cores at an appropriate speed according to the workload, or turn off some restricted cores to improve the speed of the remaining cores. For example, a quad-core processor with thermal design power consumption (TDP) of 95W may completely turn off three cores, and the last one will speed up greatly until it reaches the limit of 95W TDP. The existing processors assume that once dynamic acceleration is turned on, the TDP limit will be reached, but in fact, this is not the case, the processor will not become very hot immediately, but the calorific value is much worse than TDP for a while.

taking advantage of this feature, SNB allows the unit control unit (PCU) to accelerate the active core above TDP in a short time, and then slowly descend. PCU will track the remaining cooling space when it is idle, and use it when the system load increases. The longer the processor is idle, the longer it can surpass TDP, but the longest time is no more than 25 seconds. However, in terms of stability, PCU will not be allowed to exceed any limit.

as we have said before, the graphics core of SNB GPU can also be accelerated independently and dynamically, up to an astonishing 1.35GHz. If the software needs more CPU resources, the CPU will speed up and the GPU will slow down at the same time, and vice versa. Sandy Bridge family still uses the naming method of brand+sub-series of Core i3/i5/i7, and adopts four digits in the numbering, in which the first digit is "2", indicating the second generation of Core i series, and there is often a letter representing different meanings at the end of the numbering: K stands for unlocked frequency doubling, all of which are high-end products; S stands for performance optimization, the original frequency is much lower than that without letter suffix, but the highest frequency of single-core acceleration is basically the same, and the thermal design power consumption is 65 W; T stands for power optimization, and the thermal design power consumption is only 45W or 35W.