artikel: Desember 2008

Selasa, 09 Desember 2008

Processor superscalar

A superscalar CPU architecture implements a form of parallelism called instruction-level parallelism within a single processor. It thereby allows faster CPU throughput than would otherwise be possible at the same clock rate. A superscalar processor executes more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to redundant functional units on the processor. Each functional unit is not a separate CPU core but an execution resource within a single CPU such as an arithmetic logic unit, a bit shifter, or a multiplier.

While a superscalar CPU is typically also pipelined, they are two different performance enhancement techniques. It is theoretically possible to have a non-pipelined superscalar CPU or a pipelined non-superscalar CPU.

The superscalar technique is traditionally associated with several identifying characteristics. Note these are applied within a given CPU core.

Instructions are issued from a sequential instruction stream
CPU hardware dynamically checks for data dependencies between instructions at run time (versus software checking at compile time)
Accepts multiple instructions per clock cycle

History

Seymour Cray's CDC 6600 from 1965 is often mentioned as the first superscalar design. The Intel i960CA (1988) and the AMD 29000-series 29050 (1990) microprocessors were the first commercial single chip superscalar microprocessors. RISC CPUs like these brought the superscalar concept to micro computers because the RISC design results in a simple core, allowing straightforward instruction dispatch and the inclusion of multiple functional units (such as ALUs) on a single CPU in the constrained design rules of the time. This was the reason that RISC designs were faster than CISC designs through the 1980s and into the 1990s.

Except for CPUs used in some battery-powered devices, essentially all general-purpose CPUs developed since about 1998 are superscalar. Beginning with the "P6" (Pentium Pro and Pentium II) implementation, Intel's x86 architecture microprocessors have implemented a CISC instruction set on a superscalar RISC microarchitecture. Complex instructions are internally translated to a RISC-like "micro-ops" RISC instruction set, allowing the processor to take advantage of the higher-performance underlying processor while remaining compatible with earlier Intel processors.

From scalar to superscalar
The simplest processors are scalar processors. Each instruction executed by a scalar processor typically manipulates one or two data items at a time. By contrast, each instruction executed by a vector processor operates simultaneously on many data items. An analogy is the difference between scalar and vector arithmetic. A superscalar processor is sort of a mixture of the two. Each instruction processes one data item, but there are multiple redundant functional units within each CPU thus multiple instructions can be processing separate data items concurrently.

Superscalar CPU design emphasizes improving the instruction dispatcher accuracy, and allowing it to keep the multiple functional units in use at all times. This has become increasingly important when the number of units increased. While early superscalar CPUs would have two ALUs and a single FPU, a modern design such as the PowerPC 970 includes four ALUs, two FPUs, and two SIMD units. If the dispatcher is ineffective at keeping all of these units fed with instructions, the performance of the system will suffer.

A superscalar processor usually sustains an execution rate in excess of one instruction per machine cycle. But merely processing multiple instructions concurrently does not make an architecture superscalar, since pipelined, multiprocessor or multi-core architectures also achieve that, but with different methods.

In a superscalar CPU the dispatcher reads instructions from memory and decides which ones can be run in parallel, dispatching them to redundant functional units contained inside a single CPU. Therefore a superscalar processor can be envisioned having multiple parallel pipelines, each of which is processing instructions simultaneously from a single instruction thread.

Limitations
Available performance improvement from superscalar techniques is limited by two key areas:

The degree of intrinsic parallelism in the instruction stream, i.e. limited amount of instruction-level parallelism, and
The complexity and time cost of the dispatcher and associated dependency checking logic.
Existing binary executable programs have varying degrees of intrinsic parallelism. In some cases instructions are not dependent on each other and can be executed simultaneously. In other cases they are inter-dependent: one instruction impacts either resources or results of the other. The instructions a = b + c; d = e + f can be run in parallel because none of the results depend on other calculations. However, the instructions a = b + c; d = a + f might not be runnable in parallel, depending on the order in which the instructions complete while they move through the units.

When the number of simultaneously issued instructions increases, the cost of dependency checking increases extremely rapidly. This is exacerbated by the need to check dependencies at run time and at the CPU's clock rate. This cost includes additional logic gates required to implement the checks, and time delays through those gates. Research shows the gate cost in some cases may be nk gates, and the delay cost k2logn, where n is the number of instructions in the processor's instruction set, and k is the number of simultaneously dispatched instructions. In mathematics, this is called a combinatoric problem involving permutations.

Even though the instruction stream may contain no inter-instruction dependencies, a superscalar CPU must nonetheless check for that possibility, since there is no assurance otherwise and failure to detect a dependency would produce incorrect results.

No matter how advanced the semiconductor process or how fast the switching speed, this places a practical limit on how many instructions can be simultaneously dispatched. While process advances will allow ever greater numbers of functional units (e.g, ALUs), the burden of checking instruction dependencies grows so rapidly that the achievable superscalar dispatch limit is fairly small. -- likely on the order of five to six simultaneously dispatched instructions.

However even given infinitely fast dependency checking logic on an otherwise conventional superscalar CPU, if the instruction stream itself has many dependencies, this would also limit the possible speedup. Thus the degree of intrinsic parallelism in the code stream forms a second limitation.

Alternatives
Collectively, these two limits drive investigation into alternative architectural performance increases such as Very Long Instruction Word (VLIW), Explicitly Parallel Instruction Computing (EPIC), simultaneous multithreading (SMT), and multi-core processors.

With VLIW, the burdensome task of dependency checking by hardware logic at run time is removed and delegated to the compiler. Explicitly Parallel Instruction Computing (EPIC) is like VLIW, with extra cache prefetching instructions.

Simultaneous multithreading, often abbreviated as SMT, is a technique for improving the overall efficiency of superscalar CPUs. SMT permits multiple independent threads of execution to better utilize the resources provided by modern processor architectures.

Superscalar processors differ from multi-core processors in that the redundant functional units are not entire processors. A single processor is composed of finer-grained functional units such as the ALU, integer multiplier, integer shifter, floating point unit, etc. There may be multiple versions of each functional unit to enable execution of many instructions in parallel. This differs from a multicore CPU that concurrently processes instructions from multiple threads, one thread per core. It also differs from a pipelined CPU, where the multiple instructions can concurrently be in various stages of execution, assembly-line fashion.

The various alternative techniques are not mutually exclusive—they can be (and frequently are) combined in a single processor. Thus a multicore CPU is possible where each core is an independent processor containing multiple parallel pipelines, each pipeline being superscalar. Some processors also include vector capability.

See also
Super-threading
Simultaneous multithreading
Speculative execution / Eager execution
Software lockout, a multiprocessor issue similar to logic dependencies on superscalars

Pipelining

Pipeline (computing)

A technique used in advanced microprocessors where the microprocessor begins executing a second instruction before the first has been completed. That is, several instructions are in the pipeline simultaneously, each at a different processing stage.
The pipeline is divided into segments and each segment can execute its operation concurrently with the other segments. When a segment completes an operation, it passes the result to the next segment in the pipeline and fetches the next operation from the preceding segment. The final results of each instruction emerge at the end of the pipeline in rapid succession.

Although formerly a feature only of high-performance and RISC -based microprocessors, pipelining is now common in microprocessors used in personal computers. Intel's Pentium chip, for example, uses pipelining to execute as many as six instructions simultaneously.

Pipelining is also called pipeline processing.

(2) A similar technique used in DRAM, in which the memory loads the requested memory contents into a small cache composed of SRAM and then immediately begins fetching the next memory contents. This creates a two-stage pipeline, where data is read from or written to SRAM in one stage, and data is read from or written to memory in the other stage.

DRAM pipelining is usually combined with another performance technique called burst mode. The two techniques together are called a pipeline burst cache.

For the downloading technique, see HTTP pipelining.

Instruction scheduling on the Intel Pentium 4.In computing, a pipeline is a set of data processing elements connected in series, so that the output of one element is the input of the next one. The elements of a pipeline are often executed in parallel or in time-sliced fashion; in that case, some amount of buffer storage is often inserted between elements.

Computer-related pipelines include:

Instruction pipelines, such as the classic RISC pipeline, which are used in processors to allow overlapping execution of multiple instructions with the same circuitry. The circuitry is usually divided up into stages, including instruction decoding, arithmetic, and register fetching stages, wherein each stage processes one instruction at a time.
Graphics pipelines, found in most graphics cards, which consist of multiple arithmetic units, or complete CPUs, that implement the various stages of common rendering operations (perspective projection, window clipping, color and light calculation, rendering, etc.).
Software pipelines, consisting of multiple processes arranged so that the output stream of one process is automatically and promptly fed as the input stream of the next one. Unix pipelines are the classical implementation of this concept.

Concept and motivation

Pipelining is a natural concept in everyday life, e.g. on an assembly line. Consider the assembly of a car: assume that certain steps in the assembly line are to install the engine, install the hood, and install the wheels (in that order, with arbitrary interstitial steps). A car on the assembly line can have only one of the three steps done at once. After the car has its engine installed, it moves on to having its hood installed, leaving the engine installation facilities available for the next car. The first car then moves on to wheel installation, the second car to hood installation, and a third car begins to have its engine installed. If engine installation takes 20 minutes, hood installation takes 5 minutes, and wheel installation takes 10 minutes, then finishing all three cars when only one car can be operated at once would take 105 minutes. On the other hand, using the assembly line, the total time to complete all three is 75 minutes. At this point, additional cars will come off the assembly line at 20 minute increments.

Costs, drawbacks, and benefits

As the assembly line example shows, pipelining doesn't decrease the time for a single datum to be processed; it only increases the throughput of the system when processing a stream of data.

High pipelining leads to increase of latency - the time required for a signal to propagate through a full pipe.

A pipelined system typically requires more resources (circuit elements, processing units, computer memory, etc.) than one that executes one batch at a time, because its stages cannot reuse the resources of a previous stage. Moreover, pipelining may increase the time it takes for an instruction to finish.

Sabtu, 06 Desember 2008

RISC VS CISC, Procesor superskalar

Disingkat dengan RISC. Rangkaian instruksi built-in pada processor yang terdiri dari perintah-perintah yang lebih ringkas dibandingkan dengan CISC. RISC memiliki keunggulan dalam hal kecepatannya sehingga banyak digunakan untuk aplikasi-aplikasi yang memerlukan kalkulasi secara intensif. Konsep RISC pertama kali dikembangkan oleh IBM pada era 1970-an. Komputer pertama yang menggunakan RISC adalah komputer mini IBM 807 yang diperkenalkan pada tahun 1980. Dewasa ini, RISC digunakan pada keluarga processor buatan Motorola (PowerPC) dan SUN Microsystems (Sparc, UltraSparc).

RISC dikembangkan melalui seorang penelitinya yang bernama John Cocke, beliau menyampaikan bahwa sebenarnya kekhasan dari komputer tidaklah menggunakan banyak instruksi, namun yang dimilikinya adalah instruksi yang kompleks yang dilakukan melalui rangkaian sirkuit.

Pada desain chip mikroprosesor jenis ini, pemroses diharapkan dapat melaksanakan perintah-perintah yang dijalankannya secara cepat dan efisien melalui penyediaan himpunan instruksi yang jumlahnya relatif sedikit, dengan mengambil perintah-perintah yang sangat sederhana, akibatnya arsitektur RISC membatasi jumlah instruksinya yang dipasang ke dalam mikroprosesor tetapi mengoptimasi setiap instruksi sehingga dapat dilaksanakan dengan cepat.
Dengan demikian instruksi yang sederhana dapat dilaksanakan lebih cepat apabila dibandingkan dengan mikroprosesor yang dirancang untuk menangan susunan instruksi yang lebih luas.

Dengan demikian chip RISC hanya dapat memproses instruksi dalam jumlah terbatas, tetapi instruksi ini dioptimalkan sehingga cepat dieksekusi. Meski demikian, bila harus menangani tugas yang kompleks, instruksi harus dibagi menjadi banyak kode mesin, terutama sebelum chip RISC dapat menanganinya. Karena keterbatasan jumlah instruksi yang ada padanya, apabila terjadi kesalahan dalam pemrosesan akan memudahkan dalam melacak kesalahan tersebut.
Pada tahun 1980-an kapasitas modul memori meningkat dan harganya turun. Penekanan pada desain CPU bergeser ke kinerja, dan RISC menjadi trend baru. Contoh arsitektur RISC meliputi SPARC dari Sun Microsystems; seri MIPS Rxxxx dari MIPS Technologies; Alpha dari Digital Equipment; PowerPC yang dikembangkan bersama oleh IBM dan Motorola; dan RISC dari Hewlett-Packard.

Chip RISC menggunakan sejumlah kecil instruksi dengan panjang-sama yang relatif sederhana, yaitu panjangnya selalu 32 bit. Walaupun hal ini memboroskan memori karena harus dibuat program lebih besar, instruksi lebih mudah dan cepat dieksekusi.
Karena chip ini berurusan dengan jenis instruksi lebih sedikit, chip RISC membutuhkan lebih sedikit transistor ketimbang chip CISC dan umumnya berkinerja lebih tinggi pada kecepatan clock yang sama, walaupun chip ini harus mengeksekusi lebih banyak instruksi lebih pendek untuk menyelesaikan sebuah fungsi.

Kesederhanaan RISC juga mempermudah merancang prosesor superscalar - chip yang dapat mengeksekusi lebih dari satu instruksi pada satu saat. Hampir semua prosesor RISC dan CISC modern adalah superscalar; tetapi untuk mencapai kemampuan ini membuat desain lebih rumit.

Kebalikan dari arsitektur chip mikrprosesor dari RISC adalah CISC (baca ”sisk”, yang merupakan singkatan dari complex instruction set computing, dimana mikroprosesor memiliki lebih banyak instruksi yang terdapat di dalamnya.
Disingkat dengan RISC. Rangkaian instruksi built-in pada processor yang terdiri dari perintah-perintah yang lebih ringkas dibandingkan dengan CISC. RISC memiliki keunggulan dalam hal kecepatannya sehingga banyak digunakan untuk aplikasi-aplikasi yang memerlukan kalkulasi secara intensif. Konsep RISC pertama kali dikembangkan oleh IBM pada era 1970-an. Komputer pertama yang menggunakan RISC adalah komputer mini IBM 807 yang diperkenalkan pada tahun 1980. Dewasa ini, RISC digunakan pada keluarga processor buatan Motorola (PowerPC) dan SUN Microsystems (Sparc, UltraSparc).

RISC dikembangkan melalui seorang penelitinya yang bernama John Cocke, beliau menyampaikan bahwa sebenarnya kekhasan dari komputer tidaklah menggunakan banyak instruksi, namun yang dimilikinya adalah instruksi yang kompleks yang dilakukan melalui rangkaian sirkuit.

Pada desain chip mikroprosesor jenis ini, pemroses diharapkan dapat melaksanakan perintah-perintah yang dijalankannya secara cepat dan efisien melalui penyediaan himpunan instruksi yang jumlahnya relatif sedikit, dengan mengambil perintah-perintah yang sangat sederhana, akibatnya arsitektur RISC membatasi jumlah instruksinya yang dipasang ke dalam mikroprosesor tetapi mengoptimasi setiap instruksi sehingga dapat dilaksanakan dengan cepat.
Dengan demikian instruksi yang sederhana dapat dilaksanakan lebih cepat apabila dibandingkan dengan mikroprosesor yang dirancang untuk menangan susunan instruksi yang lebih luas.

Dengan demikian chip RISC hanya dapat memproses instruksi dalam jumlah terbatas, tetapi instruksi ini dioptimalkan sehingga cepat dieksekusi. Meski demikian, bila harus menangani tugas yang kompleks, instruksi harus dibagi menjadi banyak kode mesin, terutama sebelum chip RISC dapat menanganinya. Karena keterbatasan jumlah instruksi yang ada padanya, apabila terjadi kesalahan dalam pemrosesan akan memudahkan dalam melacak kesalahan tersebut.

Pada tahun 1980-an kapasitas modul memori meningkat dan harganya turun. Penekanan pada desain CPU bergeser ke kinerja, dan RISC menjadi trend baru. Contoh arsitektur RISC meliputi SPARC dari Sun Microsystems; seri MIPS Rxxxx dari MIPS Technologies; Alpha dari Digital Equipment; PowerPC yang dikembangkan bersama oleh IBM dan Motorola; dan RISC dari Hewlett-Packard.
Chip RISC menggunakan sejumlah kecil instruksi dengan panjang-sama yang relatif sederhana, yaitu panjangnya selalu 32 bit. Walaupun hal ini memboroskan memori karena harus dibuat program lebih besar, instruksi lebih mudah dan cepat dieksekusi.
Karena chip ini berurusan dengan jenis instruksi lebih sedikit, chip RISC membutuhkan lebih sedikit transistor ketimbang chip CISC dan umumnya berkinerja lebih tinggi pada kecepatan clock yang sama, walaupun chip ini harus mengeksekusi lebih banyak instruksi lebih pendek untuk menyelesaikan sebuah fungsi.

Kesederhanaan RISC juga mempermudah merancang prosesor superscalar - chip yang dapat mengeksekusi lebih dari satu instruksi pada satu saat. Hampir semua prosesor RISC dan CISC modern adalah superscalar; tetapi untuk mencapai kemampuan ini membuat desain lebih rumit.
Kebalikan dari arsitektur chip mikrprosesor dari RISC adalah CISC (baca ”sisk”, yang merupakan singkatan dari complex instruction set computing, dimana mikroprosesor memiliki lebih banyak instruksi yang terdapat di dalamnya.

CISC
Complex instruction-set computing atau Complex Instruction-Set Computer (CISC; "Kumpulan instruksi komputasi kompleks") adalah sebuah arsitektur dari set instruksi dimana setiap instruksi akan menjalankan beberapa operasi tingkat rendah, seperti pengambilan dari memory, operasi aritmetika, dan penyimpanan ke dalam memory, semuanya sekaligus hanya di dalam sebuah instruksi. Karakteristik CISC dapat dikatakan bertolak-belakang dengan RISC.

Sebelum proses RISC didesain untuk pertama kalinya, banyak arsitek komputer mencoba menjembatani celah semantik", yaitu bagaimana cara untuk membuat set-set instruksi untuk mempermudah pemrograman level tinggi dengan menyediakan instruksi "level tinggi" seperti pemanggilan procedure, proses pengulangan dan mode-mode pengalamatan kompleks sehingga struktur data dan akses array dapat dikombinasikan dengan sebuah instruksi. Karakteristik CISC yg "sarat informasi" ini memberikan keuntungan di mana ukuran program-program yang dihasilkan akan menjadi relatif lebih kecil, dan penggunaan memory akan semakin berkurang. Karena CISC inilah biaya pembuatan komputer pada saat itu (tahun 1960) menjadi jauh lebih hemat.

Memang setelah itu banyak desain yang memberikan hasil yang lebih baik dengan biaya yang lebih rendah, dan juga mengakibatkan pemrograman level tinggi menjadi lebih sederhana, tetapi pada kenyataannya tidaklah selalu demikian. Contohnya, arsitektur kompleks yang didesain dengan kurang baik (yang menggunakan kode-kode mikro untuk mengakses fungsi-fungsi hardware), akan berada pada situasi di mana akan lebih mudah untuk meningkatkan performansi dengan tidak menggunakan instruksi yang kompleks (seperti instruksi pemanggilan procedure), tetapi dengan menggunakan urutan instruksi yang sederhana.

Satu alasan mengenai hal ini adalah karena set-set instruksi level-tinggi, yang sering disandikan (untuk kode-kode yang kompleks), akan menjadi cukup sulit untuk diterjemahkan kembali dan dijalankan secara efektif dengan jumlah transistor yang terbatas. Oleh karena itu arsitektur -arsitektur ini memerlukan penanganan yang lebih terfokus pada desain prosesor. Pada saat itu di mana jumlah transistor cukup terbatas, mengakibatkan semakin sempitnya peluang ditemukannya cara-cara alternatif untuk optimisasi perkembangan prosesor. Oleh karena itulah, pemikiran untuk menggunakan desain RISC muncul pada pertengahan tahun 1970 (Pusat Penelitian Watson IBM 801 - IBMs)

Contoh-contoh prosesor CISC adalah System/360, VAX, PDP-11, varian Motorola 68000 , dan CPU AMD dan Intel x86.

Istilah RISC dan CISC saat ini kurang dikenal, setelah melihat perkembangan lebih lanjut dari desain dan implementasi baik CISC dan CISC. Implementasi CISC paralel untuk pertama kalinya, seperti 486 dari Intel, AMD, Cyrix, dan IBM telah mendukung setiap instruksi yang digunakan oleh prosesor-prosesor sebelumnya, meskipun efisiensi tertingginya hanya saat digunakan pada subset x86 yang sederhana (mirip dengan set instruksi RISC, tetapi tanpa batasan penyimpanan/pengambilan data dari RISC). Prosesor-prosesor modern x86 juga telah menyandikan dan membagi lebih banyak lagi instruksi-instruksi kompleks menjadi beberapa "operasi-mikro" internal yang lebih kecil sehingga dapat instruksi-instruksi tersebut dapat dilakukan secara paralel, sehingga mencapai performansi tinggi pada subset instruksi yang lebih besar.

Mikrokontroler yang beredar saat ini dibedakan menjadi dua macam, berdasarkan arsitekturnya:

* Tipe CISC atau Complex Instruction Set Computing yang lebih kaya instruksi tetapi fasilitas internal secukupnya saja (seri AT89 memiliki 255 instruksi);
* Tipe RISC atau Reduced Instruction Set Computing yang justru lebih kaya fasilitas internalnya tetapi jumlah instruksi secukupnya (seri PIC16F hanya ada sekitar 30-an instruksi).

Fasilitas internal yang saya maksudkan di sini antara lain: jumlah dan macam register internal, pewaktu dan/atau pencacah, ADC atau DAC, unit komparator, interupsi eksternal maupun internal dan lain sebagainya.

Yang mana sesuai dengan Anda? Sesuaikan dengan kebutuhan dan fitur-fitur yang ditawarkan oleh masing-masing mikrokontroler. Tidak ada ukuran secara pasti suatu jenis mikrokontroler lebih baik dibandingkan mikrokontroler lainnya.

Superscalar

Prosesor super scalar out-of-order x86 seperti PII da PII juga memiliki multiple fungdional unit yang dapat mengoperasikan banyak operasi secara paralel (seperti RISC). Tetapi karena pemberian instruksi terurut itu menggunakan perangkat keras, maka prosesor menjadi lebih kompleks. Dan lagi karena instruksi x86 yang kompleks, maka perangkat keras yang dibutuhkan untuk pengkode menjadi banyak dan membutukan disipasi panas yang lebih besar. Berikut ini ditampilkan gambar alur eksekusi dari instrusi pada prosesor x86. Alur pemberian instruksi pada x86 Superscalar

Eksekusi prosesor superskalar

Dalam disain superskalar konvensional, penulis perangkat lunak menulis program sekuensial pada bahasa pemrograman tingkat tinggi dan lalu dikompilasi ke bahasa mesin. Kode mesin ini dijalankan secara sekuensial juga, dan instruksi CPU diatur sehingga diumpankan pada perangkat keras dan dijalankan secara paralel. Scheduler secara teliti akan menguji keterkaitan kode, dan mengurutkannya sebelum dijalankan. Pada akhirnya, kode yang sekuensial itu dijalankan oleh CPU secara paralel. Melakukan hal ini dengan program membutuhkan beban kerja yang berat di CPU, dan sering membutuhkan biaya tambahan (baik siklus clock atau jumlah transistor)

artikel

Selasa, 09 Desember 2008

Processor superscalar

Pipelining

Sabtu, 06 Desember 2008

RISC VS CISC, Procesor superskalar

Pengikut

Arsip Blog

Mengenai Saya