an053_01.pdf

Implementing Multipliers
in FLEX 10K Devices
®
March 1996, ver. 1
Introduction
Application Note 53
The Altera FLEX 10K embedded programmable logic device (PLD) family
provides the first PLDs in the industry with an embedded array. The
embedded array consists of a series of embedded array blocks (EABs) that
can implement complex logic functions, such as multipliers. Each EAB
can be configured as an 8-input, 8-output look-up table (LUT). Therefore,
a single EAB can create a multiplier with up to 8 inputs—such as a 4 × 4,
5 × 3, or 6 × 2 multiplier. Figure 1 shows a graphical representation of the
flexible multiplier sizes that can be implemented in an EAB.
Figure 1. Multiplier Configuration for a Single EAB
4×4
5×3
6×2
This application note describes how to implement large multipliers using
several EABs and compares parallel multiplier and time-domainmultiplexed multiplier implementations.
1
Single-EAB
Multipliers
Altera Corporation
A-AN-053-01
The design files described in this application note are available
from the Altera BBS via modem at (408) 954-0104 and the Altera
FTP site at ftp.altera.com. The self-extracting files are: an_53.exe
and an_53.tar.
You can implement a multiplier with up to 8 inputs in a single EAB using
a function from the library of parameterized modules (LPM). The LPM is
a set of architecture-independent modules with scalable widths that
completely describes the logical operation of a circuit. Using the LPM
function, lpm_mult, you can define the width of the multiplicand for the
multiplier. Then, you can use MAX+PLUS II to place the multiplier in an
EAB by following these steps:
1.
Select the lpm_mult function in any MAX+PLUS II application.
2.
Choose the Logic Options command (Assign menu). In the Logic
Options dialog box, the name of the function is displayed in the
Node Name box.
1
AN 53: Implementing Multipliers in FLEX 10K Devices
Multiple-EAB
Multipliers
3.
Choose the Individual Logic Options button and turn on the
Implement in EAB option. Choose OK.
4.
Choose OK to implement the multiplier in the EAB.
A multiplier with more than 8 inputs must be implemented in two or
more EABs. Each EAB computes a single partial product, generated
from a 4 × 4 multiplier. To illustrate how to split the multiplier across
multiple EABs, consider how a 2-digit by 2-digit multiplication is
calculated using base 10 multiplication. See Figure 2.
Figure 2. Base 10 Multiplication
12
× 37
+
7×1
3×2
3×1
3 × 102
+
7×2
(7 + 6) × 101 + 14 × 100
Rather than using base 10 (as shown in Figure 2), the EAB performs
the same operation in hexadecimal radix. Each partial product is
calculated within a single EAB. See Figure 3.
Figure 3. Hexadecimal Multiplication
X[7..4] X[3..0]
× Y[7..4] Y[3..0]
Each partial product
is generated by one
EAB.
Partial products are
summed to produce
the final product.
X[7..4] × Y[3..0]
+
X[7..4] × Y[7..4]
X[3..0] × Y[3..0]
X[3..0] × Y[7..4]
X[7..4] × Y[7..4] × 162 + ((X[7..4] × Y[3..0]) + (X[3..0] × Y[7..4])) × 161 + X[7..4] × Y[3..0] × 160
To account for the relative significance in hexadecimal radix, each
partial product is multiplied by 16n (where n = 0, 1, 2,...) and then
added together to determine the final product. You can choose one
of two design methods to generate the final product: a parallel
multiplier or a time-domain-multiplexed multiplier.
2
Altera Corporation
AN 53: Implementing Multipliers in FLEX 10K Devices
Parallel Multiplier
The parallel multiplier design method uses multiple EABs to
generate all of the partial products in parallel. For example, an 8 × 8
parallel multiplier uses four EABs (one for each partial product) to
simultaneously generate four 4 × 4 partial products. Before adding
the partial products together, each partial product is shifted to
account for the 16n term (i.e., each partial product is shifted over n
hexadecimal digits or 4 × n bits). The adder assembles the final
product by shifting the data into different bits. Addition is normally
generated by a two-stage adder with 8 bits for the first stage and 12
bits for the second stage (see Figure 4).
Figure 4. 2-Stage Adder
+
R6
R5
R4
S7
S6
S5
S4
S3
S2
S1
S0
T7
T6
T5
T4
T3
T2
T1
T0
U1
U0
U7
U6
U5
U4
U3
U2
Q15
Q14
Q13
Q12
Q11
Q10
Where
R7
Q9
Q8
Q7
Q6
Q5
Q4
R3
R2
Q3
Q2
R1
Q1
R0
Q0
R = X[3..0] × Y[3..0]
Addition performed in the first stage
S = X[3..0] × Y[7..4]
Addition performed in the second stage
T = X[7..4] × Y[3..0]
U = X[7..4] × Y[7..4]
You can pipeline the parallel multiplier to enhance design speeds by
using registers to process logic over multiple Clock cycles. The
registers within the EAB can be used for pipelining (see Figure 5).
Altera Corporation
3
AN 53: Implementing Multipliers in FLEX 10K Devices
Figure 5. Parallel Multiplier with Pipelining
Optional Pipelining Registers
EAB
X[3..0]
4
4
4
Z[3..0]
Y[3..0]
4
4
X[3..0]
Y[7..4]
4
4
X[7..4]
4
4
Y[3..0]
4
4
4
4
4
4
4
4
4
4
Z[7..4]
Z[11..8]
X[7..4]
Y[7..4]
Z[15..12]
Multiplier
An 8 × 8 parallel multiplier is implemented in 3 stages: a multiplier
stage using 4 EABs, and 2 adder stages with 8 bits for the first stage
and 12 bits for the second stage. To pipeline the multiplier, each bit
must be registered after each stage, which requires 21 registers for
the first stage and 16 registers for the second stage. For the multiplier
stage, each EAB has registers available at the inputs and outputs.
Therefore, additional logic elements (LEs) are not required for the
multiplier stage. The LEs containing the adder logic provide 21
registers; therefore only 20 additional LEs are required for the entire
circuit.
Time-Domain-Multiplexed Multiplier
The time-domain-multiplexed multiplier design method uses a
single EAB to generate all partial products on different Clock cycles
(see Figure 6). Therefore, the appropriate bits need to be loaded into
the EAB before each multiplication. After multiplication, the
accumulator shifts the data to account for the 16n term and then sums
the different partial products to produce the final product.
4
Altera Corporation
AN 53: Implementing Multipliers in FLEX 10K Devices
Figure 6. Simulation Waveform for Time-Domain-Multiplexed Multiplier
Clock
EAB Output
R
S
T
U
Accumulator Output
(1)
(2)
(3)
(4)
Where R = X[3..0] × Y[3..0]
S = X[3..0] × Y[7..4]
T = X[7..4] × Y[3..0]
U = X[7..4] × Y[7..4]
Notes:
(1)
(2)
(3)
(4)
X[3..0] × Y[3..0] × 160
(X[3..0] × Y[3..0] × 160) + (X[3..0] × Y[7..4] × 161)
(X[3..0] × Y[3..0] × 160) + ((X3..0] × Y[7..4]) + (X[7..4] × Y[3..0])) × 161
(X[3..0] × Y[3..0] × 160) + [((X[3..0] × Y[7..4]) + (X[7..4] × Y[3..0])) × 161] + (X[7..4] × Y[7..4] × 162)
To pipeline the time-domain-multiplexed multiplier, insert registers
between the EAB performing the multiplication and the accumulator
performing the addition and shifting. Figure 7 shows a timedomain-multiplexed multiplier.
Figure 7. Time-Domain-Multiplexed Multiplier
X[7..4]
Optional Input Registers
D
Q
ENA
4
X[3..0]
D
4
Q
Loadable Accumulator
4
EAB
16
Multiplier
ENA
8
Y[7..4]
D
16
16
16
D
Q
Z[15..0]
4
4
D
16
12
8
Q
ENA
Y[3..0]
Shift 8
Shift 4
Shift 0
4
Q
ENA
Control
Altera Corporation
5
AN 53: Implementing Multipliers in FLEX 10K Devices
You can also increase throughput in the time-domain-multiplexed
multiplier design method by implementing the multiplier in two or
more EABs. Then, the multiplier computes multiple partial products
simultaneously, which reduces the number of Clock cycles. The
time-domain-multiplexed multiplier implementation is well-suited
for very large multiplications, such as 16 × 16 or 32 × 32, because it
conserves EABs and logic cells. In contrast, large multiplications
would consume a prohibitive amount of logic cells or EABs if
computed in parallel.
Design Speed
The parallel multiplier generates all of the partial products and sums
the response within a single Clock cycle. In addition, data is loaded
on every Clock cycle, giving the parallel multiplier high throughput
and fast calculation times. Designers can pipeline the parallel
multiplier for faster Clock speeds. Pipelining requires multiple
Clock cycles and more latency time to generate the multiplication for
a single multiplier. However, it decreases the Clock period while still
allowing new data to be loaded on every Clock cycle. The faster
Clock speeds generated by pipelining allow for the highest
throughput for consecutive operations because pipelining can
generate a new product on every Clock cycle. See Figure 8.
Figure 8. Simulation Waveforms for Non-Pipelined & Pipelined Parallel
Multipliers
Non-Pipelined Parallel Multiplier
Clock
Data
2
1
Product
3
1
4
3
2
4
1 Computation
4 Computations
Pipelined Parallel Multiplier
Clock
Data
Product
1
2
3
1
4
2
3
4
1 Computation
4 Computations
6
Altera Corporation
AN 53: Implementing Multipliers in FLEX 10K Devices
The typical time-domain-multiplexed multiplier uses a single EAB to
compute all partial products on different Clock cycles. Therefore,
multiplication requires the same number of Clock cycles as partial
products. In the 8 × 8 bit multiplication example shown in Figure 7,
the multiplication requires 4 Clock cycles. When consecutive
multiplications are required, the first multiplication must be
completed before the second multiplication can begin. Designers can
pipeline the time-domain-multiplexed multiplier for faster Clock
speeds. Pipelining creates faster Clock speeds by reducing the Clock
period and generating higher throughput. Table 1 summarizes the
performance of parallel and time-domain-multiplexed multipliers.
Table 1. Circuit Performance
Design
Clock Cycles for an 8 × 8 Multiplier
One Multiplication Two Multiplications
Device Utilization
Parallel Multiplier
1
2
Parallel Multiplier with 3-Stage
Pipeline
3
4
Time-Domain-Multiplexed
Multiplier
4
9
Time-Domain-Multiplexed
Multiplier with 2-Stage Pipeline
5
10
The 8 × 8 parallel multiplier design uses 4 EABs plus 21 additional
LEs required for the 12-bit and 8-bit adders. A 3-stage pipeline
requires 20 additional registers to store data. A parallel multiplier
with 3-stage pipelining will not require any additional LEs when the
registers are implemented in the EAB.
In contrast, the time-domain-multiplexed multiplier uses only one
EAB. The multiplier uses logic, rather than EABs, to select which bits
are used for multiplication. A time-domain-multiplexed multiplier
with 2-stage pipelining does not require any additional LEs.
Altera Corporation
7
AN 53: Implementing Multipliers in FLEX 10K Devices
Table 2 summarizes the number of EABs and LEs required for each
type of multiplier.
Table 2. Device Utilization for an 8 × 8 Multiplier
Design
Conclusion
8
EABs Required
LEs Required
Parallel Multiplier
4
24
Parallel Multiplier with 3-Stage
Pipeline
4
45
Time-Domain-Multiplexed Multiplier
1
65
Time-Domain-Multiplexed Multiplier
with 2-Stage Pipeline
1
65
Large multipliers can be implemented in FLEX 10K devices with
either a parallel multiplier or time-domain-multiplexed multiplier
design method. The parallel multiplier offers the fastest Clock
speeds but requires more space and device resources. The timedomain-multiplexed multiplier conserves space and device
resources but offers slower Clock speeds. Both design methods can
be pipelined for faster Clock speeds.
Altera Corporation
Copyright © 1995, 1996 Altera Corporation, 2610 Orchard Parkway,
San Jose, California 95134, USA, all rights reserved.
By accessing any information on this CD-ROM, you agree to be
bound by the terms of Altera’s Legal Notice.