Thursday, 14 September 2017

microcontroller - Why does hardware division take much longer than multiplication?


Why does hardware division take so much longer than multiplication on a microcontroller? E.g., on a dsPIC, a division takes 19 cycles, while multiplication takes only one clock cycle.


I went through some tutorials, including Division algorithm and Multiplication algorithm on Wikipedia. Here is my reasoning.


A division algorithm, like a slow division method with restoring on Wikipedia, is a recursive algorithm. This means that (intermediate) results from step k are used as inputs to step k+1, which means that these algorithms cannot be parallelized. Therefore, it takes at least n cycles to complete the division, whereas n is a number of bits in a dividend. For 16-bit dividends, this is equal to at least 16 cycles.



A multiplication algorithm doesn't need to be recursive, which means that it is possible to parallelize it. However, there are many different multiplication algorithms, and I don't have a clue which one may be used by microcontrollers. How does multiplication work on a hardware/microcontroller?


I've found a Dadda multiplier algorithm, which is supposed to take only one clock cycle to finish. However, what I don't get here is that Dadda's algorithm proceeds in three steps, whereas results from step 1 are used in step 2, etc. According to this, this would take at least three clock cycles to finish.



Answer



A divider maps much less elegantly to typical hardware. Take Lattice ICE40 FPGAs as examples.


Let us compare two cases: this 8x8 bit to 16 bit multiplier:


module multiply (clk, a, b, result);
input clk;
input [7:0]a;
input [7:0]b;
output [15:0]result;

always @(posedge clk)
result = a * b;
endmodule // multiply

and this divider that reduces 8 and 8 bit operands to 8 bit result:


module divide(clk, a, b, result);
input clk;
input [7:0] a;
input [7:0] b;
output [7:0] result;

always @(posedge clk)
result = a / b;
endmodule // divide

(Yes, I know, the clock doesn't do anything)


An overview of the generated schematic when mapping the multiplier to an ICE40 FPGA can be found here and the divider here.


The synthesis statistics from Yosys are:


multiply



  • Number of wires: 155


  • Number of wire bits: 214

  • Number of public wires: 4

  • Number of public wire bits: 33

  • Number of memories: 0

  • Number of memory bits: 0

  • Number of processes: 0

  • Number of cells: 191

    • SB_CARRY 10

    • SB_DFF 16


    • SB_LUT4 165




divide



  • Number of wires: 145

  • Number of wire bits: 320

  • Number of public wires: 4

  • Number of public wire bits: 25


  • Number of memories: 0

  • Number of memory bits: 0

  • Number of processes: 0

  • Number of cells: 219

    • SB_CARRY 85

    • SB_DFF 8

    • SB_LUT4 126







It's worth noting that the size of the generated verilog for a full-width multiplier and a maximally-dividing divider aren't that extreme. However, if you'll look at the pictures below, you'll notice the multiplier has maybe a depth of 15, whereas the divider looks more like 50 or so; the critical path (i.e. the longest path that can occur during operation) is what defines the speed!




You won't be able to read this, anyway, to get a visual impression. I think the differences in complexity are possible to spot. These are single cycle multiplier/dividers!


Multiply


Multiply on an ICE40 (warning: ~100 Mpixel image)


Scaled image of the multiplier


Divide


(Divide on an ICE40) (warning: ~100 Mpixel image)



Scaled image of the divider


No comments:

Post a Comment

arduino - Can I use TI's cc2541 BLE as micro controller to perform operations/ processing instead of ATmega328P AU to save cost?

I am using arduino pro mini (which contains Atmega328p AU ) along with cc2541(HM-10) to process and transfer data over BLE to smartphone. I...