Implementation of distributed arithmetic-based symmetrical 2-D block finite impulse response filter architectures

Background: This paper presents an efficient two-dimensional (2-D) finite impulse response (FIR) filter using block processing for two different symmetries. Architectures for a general filter (without symmetry) and two symmetrical filters (diagonal and quadrantal symmetry) are implemented. The proposed architectures need fewer multipliers because of the symmetry of the filter coefficients. Methods: A distributed arithmetic (DA)- based multiplication method is used in the proposed architecture. A dual-port memory-based lookup table (DP-MLUT) is used in the multiplication instead of lookup-table (LUT) to reduce the area and power of the FIR filter. The filter's throughput is increased by using block processing. Memory reuse and memory sharing methods are introduced, which reduces the need for many registers and hence the circuit complexity. The architectures are written in Verilog Hardware Description Language and synthesized using Genus Synthesis tool-19.1 in 45nm technology with a generic library of Cadence vendor constraints. The synthesis tool generates the area, delay, and power reports. Power consumption of architectures is calculated with an image size of 64 X 64 and at 20 MHz frequency. Results: Compared to existing architectures, the synthesis results show improvements in power, area, area delay product (ADP), and power delay product (PDP). The proposed MLUT-based 2-D block Quadrantal Symmetry Filter (QSF) for length 8 with block size 4 consumes 58.94% less power, occupies 59.5% less area, 48.44% less ADP and 47.78% less PDP compared to best existing methods. Conclusions: A novel DA-based 2-D block FIR filter architecture with various symmetries is realized. Symmetry is incorporated into the filter coefficients to minimize the number of multipliers. The LUT size is optimized by odd multiples or even multiples storage techniques. Also, the overall area of the architecture is decreased by DP-LUT-based multipliers. The proposed filter architecture is area-power-efficient. It is best suited for applications that have fixed coefficients.


Introduction
Many image and video processing applications, including image enhancement, template matching, image restoration, and video communication, use 2-D digital filters. 1,2Finite impulse response (FIR) filters are preferred over infinite impulse response (IIR) filter when the numerical stability, ease of design and linear phase are the primary concerns. 1ecause 2-D FIR filters need numerous computations, the efficient structure design is challenging for researchers.In, 1 Parhi proposed a systolic structure for a 2-D FIR filter and suggested many techniques to optimize the implementation of 1-D and 2-D FIR and IIR filter architectures with more computational blocks.The block-based 2-D FIR filter banks consisting of separable and non-separable architectures with a significant reduction in memory are described. 3,4In, 3 conventional multipliers, which consume power, are used for the convolution of input samples and filter coefficients, and there is no consideration of the internal architectures of symmetry filters.
The power-efficient and memory-efficient 2-D FIR filter architectures (FIRAs) are constructed with high-speed multipliers and parallel prefix modified carry look ahead adder (MCLAA). 5The low area-memory-based non-symmetry type 2-D FIRA is proposed with a new multiplication technique. 6In the above works, no symmetry concept is considered.The arithmetic computations are decreased by coefficient symmetry in the systolic filter architecture. 7,8The low-power multimode architectures for 2-D IIR filters are designed and implemented with four symmetries.The critical path analysis is addressed for symmetry filters, but the architectures are implemented only for single input processing.Another single input processing-based quadrantal symmetry is implemented using the 2-D L 1 -technique to minimize the filter coefficients and hardware blocks. 9Recently, Chowdari et al. [24][25][26] have proposed efficient implementation of DA based adaptive filter.
Mohanty et al. 10 proposed a 1-D block filter for narrowband applications using a Distributed Arithmetic (DA)-based reconfigurable filter for the software define radio SDR channelizer.Introduced the memory sharing concept to implement a 1-D finite impulse response (FIR) filter with a low area-power-delay.Several authors have implemented only DA-based 1-D filters.In recent years, DA techniques have attained great importance in FIR filter implementation to reduce the complexity of the architecture with high throughput and regularity.Kumar et al. 11,12 recently proposed block-based 2-D FIR and IIR filter architectures using DA with a memory-sharing approach but did not discuss the symmetry of coefficients.DA-based FIRAs are described in, 13,14 and the review of DA methods for cost-effective and efficient FIRAs is summarized.Park et al. 15 have suggested reconfigurable FIR architecture using DA.
In all the DA-based filter implementation schemes, the authors focused only on the decreasing adders' quantity and multiplier complexity.Memory complexity is one of the key factors while designing the filter, affecting power consumption and area.Many researchers have addressed the 1-D and 2-D filters using symmetry or block processing in filters. 16,17Few researchers have realized the filter structures with Lookup Table (LUT)-based or DA multipliers without block processing or symmetry.
A new approach to memory-based DA multiplication is proposed by Meher et al. 18,19 This memory-based LUT (MLUT) multiplication approach is used to realize the 1-D FIR filters.The comparison analysis is presented with conventional multiplier-based filter architectures.Vinitha et al. 6,20 also developed the LUT-based multiplication and incorporated it into the filer architectures with fewer hardware blocks.Chiper et al. 21suggested the dual-port concept in the MLUTbased DA multiplication rather than Single-Port LUT (SPLUT) multipliers.The modified memory-based multipliers are realized to implement an efficient filter architecture by Sharma et al. 22 Alawad et al. 23 presented a stochastic-based 2-D FIRA with low hardware complexity and high throughput.The probabilistic convolution theorem is used for the proposed non-separable systolic 2-D FIRA.The proposed work solves this problem within a predetermined accuracy range.The probability density function represents the 2-D input signal kernels by exploiting the convolution theorem.This wellknown probabilistic convolution theorem replaces the expensive multipliers with simple adders.The memory storage complexity is also reduced by memory sharing and memory reuse.This work is more suitable for applications like perception-based image processing, which can inherently tolerate some computing inaccuracy.
The addressed points motivate developing and implementing the block-based 2-D FIRAs using various symmetries and multiplier-less DA-based approaches.In this research, two types of symmetries, diagonal symmetry and quadrantal symmetry, are considered to reduce the multipliers.The hardware in adders is increased by block processing in symmetry filters, although multipliers are more complex than adders.A novel MLUT multiplication approach is introduced in the 2-D block FIRAs.Two types of symmetries for 2-D FIRA and one non-symmetry filter are explored to decrease the number of multipliers.Conventional multipliers are replaced with MLUT multipliers to decrease each symmetry filter's power consumption, delay, and area.
The paper is organized as follows: The novel approach to designing the two types of symmetries and an optimized memory-based multiplication approach for 2-D FIRAs are discussed in background section.The next section describes the proposed 2-D FIRAs, and the individual symmetry filter architectures are explored according to the block processing using enhanced Dual-Port Memory-based LUT (DP-MLUT)-based multipliers.

Background: Block-based design and symmetry of 2-D FIR filters
This section explains the various coefficient symmetry concepts and the MLUT multiplication approach to replacing normal multipliers.

Block processing and memory reuse
In the digital filters, the block processing concept increases the throughput of the architecture.If the input block size is 'N', the filter produces 'N' outputs per one iteration, which means N-times throughput increases.The input matrix X n 1 , n 2 ð Þis needed at different systolic stages to generate a 2-D filter output Y n 1 ,n 2 ð Þis of the length of the filter (L). 3 : : :: :: :: : : : :: : (Eq.1) Þ th output of the filter is expressed as 3 : : : :: :: :: (Eq.3) Thus, the 2-D FIR filter block output at each systolic stage is expressed as 11 : (Eq.4) The filter coefficient vector required at each stage is expressed as 11 : Each iteration of the 2-D block FIRA needs the parallel calculation of a block of input samples and produces a block of output.At each systolic stage, a set of L À 1 delayed inputs is required to generate a block of input.The input pixels at Þ, which is given in matrix form as 11 : : : :: :: :: x n 1 À p,n 2 À L ð Þ : : : :: : To facilitate parallelism, we further decompose the input pixel matrix G n 1 ,n 2 ð Þand coefficient vector by a factor of s.The input pixel matrix G n 1 ,n 2 ð Þis decomposed into L S sub matrices represented as X q p of dimension G N X S ð Þ, and also the coefficient vector Where 11 (Eq.8) ). Equation ( 7) is re-writtenas Where (Eq.10)

Symmetry concepts of 2-D FIR filter structure
The symmetry concept is considered for the reduction of complex multipliers.In this paper, two types of symmetry, Diagonal Symmetry Filter (DSF) and Quadrantal Symmetry Filter (QSF) 7,8 for 2-D FIRAs, are studied and explored.The following transfer functions are used to design the two types of symmetries in the 2-D FIRA.
, where z 1 ¼ e jθ1 and The filter coefficient symmetry is given by h ij ¼ h LÀi ð Þj for all i,j: Equation ( 12) expresses the transfer function of the filter. 17 (Eq.12) The general filter coefficients and two types of symmetry coefficient matrices are shown in Figure 1.
The proposed work implements two efficient symmetrical 2-D FIRAs and one generic filter architecture.Because of the symmetry of the filter coefficient, fewer multipliers are needed to design the filter.The LUT-DA multiplication process A LUT is treated as memory in memory-based multiplication, and the precomputed outputs of filter coefficients are saved in the LUT.DA multiplication is the process of shifting and accumulating LUT output values.The input sample and coefficient are multiplied in the process of memory-based multiplication.The LUT memory can save 2 w possible values for the binary input of word length of w bits and a coefficient of bit length of c bits.In the process of standard LUT-based multiplication, it requires 2 w words to save the precomputed partial products in LUT.
Even multiples can be obtained from memory using left shift operations on odd multiples.This work uses (2 w /2) words to save the odd multiples of coefficient C.This approach is shown for w = 4-bits of input sample in Table 1.In this table, the 8-address locations are stored by odd multiples of coefficient C, such as C, 3C, 5C, 7C, 9C, 11C, 13C, and 15C.Even multiples are evaluated using left shift operations of C, such as 2C, 4C, and 8C by 1-, 2-, and 3-times left shift operation to C, respectively.Next, 6C and 12C products are produced by a left shift of 3C; the remaining 10C is derived from 5C, and 14C is derived from 7C, respectively.The product output for the input sample consists of all zeros x ¼ 0000 produced by resetting the LUT.
The single-port MLUT-DA multiplier is realized with reference to Table 1, as shown in Figure 2A.The structure has one 4-to-3 encoder block, one 3-to-8 decoder block, one control logic to produce Reset (RST), and control lines {S0, S1} to accommodate the shifts required for the computation of even multiples of coefficients such as 2C, 4C, 8C, 10C, 12C, and 14C.A maximum of three shifts are required, so two bits of control signals are contemplated in the structure.
Using a control logic block, the RST is formed from the applied input sample.It results in eight odd multiples of coefficients with c þ 4 ð Þ bits.An extra 4 bits are essential to computing the highest odd multiple value 15C is precomputed and stored in the LUT.The decoder output corresponding location is read and fed to the NOR cell, which is made up of c þ 4 ð ÞNOR gates with one common input of RST.The NOR cell outputs are shifted by a barrel shifter based upon the control signals {S0, S1} coming from the control logic.The barrel shifter has 2Â(c + 4) AOI (AND_OR_INVERT) gates or 2Â1 multiplexers (MUXs).Finally, the barrel shifter output is the multiplication result of the input sample and coefficient.
where A 2 A 1 A 0 are address bits derived from the actual input bits x 3 x 2 x 1 x 0 .The control logic signals (RST and S0, S1) are given by equations ( 16), (17), and (18). ) In very large scale integration (VLSI) design, the conventional multipliers consume more power and occupy more area, whereas the LUT-based multipliers save area and power consumption.Hence, a further reduction in the hardware is achieved by LUT-DA multipliers.
In the LUT, only the even multiples of the coefficients are saved.Hence, only 2 w /2 words are required instead of all 2 w words.Even multiples can be translated into odd multiples by adding one filter coefficient magnitude.The barrel shifter and encoder blocks are not required for this modified multiplier, and one 2 Â 1 MUX is required to choose the odd or even-multiple coefficients.Table 2 depicts the even multiples storing technique for w = 4.The even values of constantcoefficient 0,2C,4C,…,12C,14C f g are precomputed corresponding to x 3 x 2 x 1 f gusing the 3-to-8 decoder and saved in the 8-LUT locations.The other input for the 2 Â 1 MUX is the LUT even output, and the other input is the odd output from the adder.The selection lines of the MUX are the least significant bit LSB-bits of the input sample x 0 .Whether the coefficients are even or odd multiples depends on the input sample's LSB bit. Figure 2B represents the modified LUTbased multiplier.
Likewise, odd multiples of the coefficient can be saved in an improved LUT-DA multiplier by using a subtractor to generate the required even multiples.In the proposed work, the SPLUT multiplier is converted into a DPLUT multiplier using the DA approach.When the input sample bits are more, the dual-port memory helps decrease the LUT size.The common filter coefficient is multiplied simultaneously with two separate input samples using a DPMLUT-based multiplier.The following section explains how the proposed filters use an improved MLUT-DA multiplier with even multiples storage.

Proposed architectures of block-based 2-D FIR filters
The block-based 2-D FIRA is shown in Figure 3

General filter structure with MLUT multipliers
The arithmetic module of the general filter architecture is realized by the L number of Processing Units (PU) and an Adder Tree (AT) block, which receives LN samples from DUB.
Each PU block is constructed by N number of Product Cells (PC), which are used to multiply the input sample by the corresponding filter coefficients.Generally, the product is done by conventional multipliers.MLUT multipliers are used in place of these power-hungry conventional multipliers.At last, the AT adds the outputs of the PU block and generates the Nfilter outputs corresponding to the N block of inputs.Value to be stored in LUT input bits (w) x 3 x 2 x 1 x 0 Result W0 0 0 0 0 0 0 0 0 0 This general filter architecture is modified by DPMLUT multipliers, as presented in Figure 5.In this architecture, the inputs multiplied with the common filter coefficients are given to a DPLUT-multiplier.Hence, a total of L Â L ð ÞDPLUT multipliers are needed to process the complete multiplication of input samples of L= 4 and filter coefficients.In this work, the multiplier quantity is decreased by symmetry in the filter coefficients.Two different symmetries are described in this section, and these symmetry filters can be used to design circular symmetry, fan-type and diamond filters.

Structure of 2-D FIR Diagonal Symmetry Filter (DSF)
In the DSF coefficient matrix, the sixteen coefficients are reduced to ten, such as {h 00 ,h 01 ,h 02 ,h 03 ,h 11 ,h 12 ,h 13 ,h 22 ,h 23 , h 33 g for L = 4 as shown in Figure 1B. Figure 6 represents the arithmetic module of the DSF-based 2-D FIRA, and it is designed by diagonal symmetry.Before the multiplication process, the input samples to be multiplied with common filter coefficients are added.
For the one input of N = 2, seven adders are required to accumulate symmetry input samples.The seven highlighted colored adders indicate the adders for the other input sample.The adder is a simple block than the multiplier.The diagonal Hence, half of the area is optimized.Finally, all the multiplier output samples are accumulated by N-AT blocks to produce N outputs.
Because multipliers are responsible for most of the power consumption, DPLUT-based multipliers are used to optimize them.Hence, ten DPMLUT multipliers are needed to produce the N = 2 outputs from the diagonal symmetry filter.The DSF architecture for L = 4 with N = 2 needed 20 individual SPLUT multipliers.
DPLUT decreases the LUT size for input samples with greater bit lengths by adding an additional shifter.Because of parallel block processing, two inputs are multiplied with the common filter coefficient in a 2-D FIR filter.This concept can be used to replace two SPLUT multipliers with a single DPLUT multiplier.The internal structure of the conventional DPLUT-based multiplier and the modified DPLUT-based multiplier are shown in Figure 7A and B.
The common filter odd coefficient multiples are precomputed and placed in the LUT memory.According to the input bits, the address of the location in the LUT is determined by the address encoder and address decoder.DPLUT fetches the corresponding locations based on the given addresses of two ports and provides two parallel outputs.Furthermore, each output is shifted by barrel shifters after passing through the corresponding NOR gate.The control lines for shifting are generated from the input sample bits handled by some control circuit logic, as explained earlier.
This conventional DPLUT-DA multiplier has been revised, shown in Figure 7B for w = 4 bits of input sample using even multiples storage in LUT.It can be observed that the modified even multiples storage LUT-DA multipliers need less memory and area.Control logic for RST, barrel shifter, NOR cell, 4-to-3 encoder, and control signals of barrel shifter s o ,s 1 f gare not needed to enhance the DPLUT multiplier, and this feature reduces area further.
The conversion of SPLUT into DPLUT is a critical task.The common filter coefficients stored in the LUT must be shared by two inputs simultaneously.For this, the control logic is introduced related to the clock signal to choose the address locations with a slight delay.Figure 8 represents the control logic using multiplexers for a DPLUT-DA multiplier.Structure of a 2-D FIR Quadrantal Symmetry Filter (QSF) The QSF consists of eight unique filter coefficients are given as {:h 00 , h 01 , h 02 , h 03 , h 10 , h 11 ,h 12 ,h 13: }. Figure 9 represents the architecture of QSF for L = 4 with N = 2.A total of 16 SPLUT multipliers are needed for this structure, and it is modified with eight DPLUT multipliers to produce N -block outputs.
The summary of the number of single-port and dual-port multipliers needed for each symmetry is presented in Table 3.

Experiment/validation
This section analyzes the implementation and results for the proposed 2-D FIRAs.Multipliers, registers, and adders construct the architecture of the proposed filters.The hardware block's complexity depends on the filter input sample bits,    Table 3.The multipliers count for constructing various symmetry filters for L = 4 with N =2.

Results and discussion
The data associated with the results is available in Underlying data. 27Table 4 presents synthesis results of two individual types of symmetry 2-D FIR filters and general filters for L = 4 with N = 2.
The power consumption, delay, and area results are represented in graphs, as shown in Figures 10, 11, and 12, respectively.The proposed DPLUT-based 2-D FIR DSF-filter architecture needs 20.54% and 5.84% less area than normal and SPLUT multiplier-based filter architectures.34.2% and 3.2% of power savings are obtained by the proposed filter architecture compared to the normal and single-port multiplier-based filter architectures, respectively.The proposed DSF architecture is 27.9%, and 10.4% has less delay than normal multiplier and SPLUT-based architectures.Similarly, the proposed QSF 2-D FIRA power is decreased by 44%, 20%, than normal and SPLUT multipliers, the area is decreased by 33%, 18.6%, and delay is decreased by 24.7%, 5.3% than normal and SPLUT multipliers, respectively.
The filter architectures of 2-D FIR with two symmetries and one general filter are implemented by block processing and dual-port memory-based multipliers.Here, the memory reuse concept is used to get the filter outputs, and memory saving is obtained.The VLSI performance metrics, such as area, delay, and power values of the proposed filters, are compared for input bits w = 4 and 8 in Table 5.
The area of the proposed DSF and QSF symmetry filter architectures is reduced by 24.1% and 39.3% to the general 2-D FIRA for w = 4.The power-saving obtained by DSF and QSF is 9.9% and 27.8% less than the general filter architecture.
The delay values of DSF, QSF and general filter are almost the same.Figure 13 represents the comparison of area, delay, and power consumption of the proposed DPLUT-based 2-D FIRAs for w = 4 and 8.It can observe that the VLSI performance metrics increase correspondingly when the filter's input sample bits increase.for w = 4 and 8.Where DSF is diagonal symmetry filter, QSF is quadrantal symmetry filter and LUT is look up table.The proposed symmetry 2-D FIR filters with DPLUT-based multipliers are compared to previous works.The performance metrics obtained from the synthesis tool are tabulated in Table 6.
The proposed filter architecture implementation is extended for L =8 with N = 4.The 2-D FIRA for L =8 with N = 4 is also compared with the state-of-the-art works in Table 6.It can be observed that the proposed architecture is improved in terms of power, area, delay, ADP, and PDP than existing architectures.
A graphical comparison of results of rea, power consumption, delay, ADP, and PDP of the proposed structure with existing filter architecture for L = 8 with N = 4 is shown in Figure 14.

Conclusions
This paper implements two novel symmetry 2-D block FIRAs using QSF and DSF and one general filter (without symmetry) with DPMLUT-based multipliers.The conventional multipliers are replaced with the MLUT multipliers; DPMLUT-based multipliers save power and area compared to SPMLUT-based multipliers.Individual symmetry filters are implemented using fewer multipliers.Block processing is used to achieve memory reuse.This project contains the following underlying data: -Data set.xlsx(testbench which defines the operational relation between input and output, results for different values of block length and also the results table in which existing and proposed models be compared).
-larea_DSF.rep (area report of DSF_2D FIR filter, a tool generated area report downloaded while executing).
-larea_QSF.rep (area report of QSF_2D FIR filter, a tool generated area report downloaded while executing).
-lpower_DSF.rep(power report of DSF_2D FIR filter, a tool generated area report downloaded while executing).
-lpower_QSF.rep(power report of QSF_2D FIR filter, a tool generated area report downloaded while executing).
-ltiming_DSF.rep(timing report of DSF_2D FIR filter, a tool generated area report downloaded while executing).
-ltiming_QSF.rep(timing report of QSF_2D FIR filter, a tool generated area report downloaded while executing).
Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).
One ingenious way to lower the requirement for registers and, hence, circuit complexity is to implement memory reuse and memory sharing solutions.The suggested architectures' practicality is greatly enhanced by this method.The benefits of the suggested designs in terms of power, area, area delay product (ADP), and power delay product (PDP) are well demonstrated by the comparison data presented in the study.These are significant advancements that have potential for many different uses.
The synthesis produces a 45 nm technological environment that demonstrates how the suggested designs may be used in practical settings.Particularly with the Quadrantal Symmetry Filter (QSF), the power and area savings are very noteworthy.The appropriateness of the suggested filter design for applications with fixed coefficients is emphasized in the paper's conclusion.This understanding of possible uses gives the research more usefulness.
Request the Corresponding author to provide justifications for these queries.
1) Please explain how the area (micrometers square) was calculated.
2) The proposed filter was designed with L=8 and N=4, it can be further extended to 32-bit with 64 taps.
3) If the comparisons of your proposed results with existing results within a table format is appreciable."Department of ECE, Aditya Institute of Technology and Management, Srikakulam, Andhra Pradesh, India 1.This paper presents a well-structured and highly efficient approach to 2-D finite impulse response (FIR) filter design, focusing on symmetry-based optimizations.The incorporation of various symmetries to minimize multipliers is a commendable strategy.

References
2. The utilization of distributed arithmetic (DA) and the dual-port memory-based lookup table (DP-MLUT) in multiplication is a noteworthy innovation.It not only reduces area and power consumption but also enhances the overall efficiency of the FIR filter.
3. The introduction of memory reuse and memory sharing methods is a clever technique to reduce the need for registers and, consequently, circuit complexity.This approach contributes significantly to the practicality of the proposed architectures.
4. The comparative results provided in the paper clearly demonstrate the advantages of the proposed architectures in terms of power, area, area delay product (ADP), and power delay product (PDP).These improvements are substantial and hold promise for a wide range of applications.
5. The synthesis results in a 45nm technology environment showcase the real-world applicability of the proposed architectures.The power and area savings, especially in the Quadrantal Symmetry Filter (QSF), are quite impressive.
6.The paper's conclusion emphasizes the suitability of the proposed filter architecture for applications with fixed coefficients.This insight into potential applications adds practical value to the research.
7. Overall, this paper offers a valuable contribution to the field of FIR filter design.The innovative approaches, clear presentation, and substantial improvements in efficiency make it a significant advancement in the area of signal processing.

Is the work clearly and accurately presented and does it cite the current literature? Yes
Is the study design appropriate and is the work technically sound?Yes

Are sufficient details of methods and analysis provided to allow replication by others? Yes
If applicable, is the statistical analysis and its interpretation appropriate?Yes Are all the source data underlying the results available to ensure full reproducibility?Yes Are the conclusions drawn adequately supported by the results?

Figure 2 .
Figure 2. (A) Structure of conventional Memory-based Lookup Table (MLUT) multiplier for odd multiples (B) Modified MLUT multiplier for even multiples.Where MUX is multiplexer and RST is reset.

, x k 1 }
for L = 4, with N = 2 without any symmetry, and is considered a general filter.The input samples {x k 0 are from the same row of the image input matrix given to the shift register unit (SRU) array, and input samples are given in serial order, block by block and row by row.The SRU array contains L À 1 ð ÞSRUs, each with L shift registers with M words.Here, SRU1 is termed as {SR1, SR2}.Likewise, SRU2 and SRU3 are placed in an array form considered the SRU array for order L = 4.Each L -Delay Unit Block (DUB) produces NL samples by applying each set of past N and present samples to the total L sets.The input block of L input samples from the image matrix M Â M ð Þare applied as present inputs.The L À 1 ð ) SRU array receives these parallel inputs.Figure4Arepresents the structure of SRU using the L number of registers.The present input sample and the past input sample blocks are applied to the N-DUBs of the DUB array.Each DUB consists of L À 1 ð ) flipflops.It produces the present and past samples required for block processing.As shown in Figure4B, each DUB generates LN samples.The L-DUBs give the L Â LN of input samples to the filter's arithmetic module.Structures of block-based symmetric 2-D FIR filter arithmetic modulesThis section explores two symmetry-type 2-D FIR filters and one general filter of L = 4 with N = 2.
2 L Â L ð Þ multipliers are needed if SPLUT-based multipliers are used.The DPLUT-based multipliers save 50% of the area compared to SPLUT multipliers.Each DPLUT-based multiplier produces the L number of filter outputs.Total L memory multipliers generate L Â L ð ÞN number of outputs, and these are parallelly added by N -AT blocks and give N outputs with a size of c þ 4 ð )-bits.

Figure 3 .
Figure 3. Conventional 2-D finite impulse response filter architecture (FIRA) for L = 4 with N = 2.Where PU is Processing Unit, X k m is input and Y k m is output.

Figure 6 .
Figure 6.Structure of a diagonal symmetry 2-D Finite Impulse Response (FIR) filter with dual-port look-up table (DPLUT) multipliers.SRU, shaft register unit; DUB, Delay Unit Block; LUT, look up table.

Figure 10 .
Figure 10.The power consumption comparison of different proposed 2-D Finite Impulse Response filters with different multiplier techniques.Where DSF is diagonal symmetry filter, QSF is quadrantal symmetry filter and LUT is look up table.

Figure 11 .
Figure 11.The delay comparison of different proposed 2-D Finite Impulse Response filters with different multiplier techniques.Where DSF is diagonal symmetry filter, QSF is quadrantal symmetry filter and LUT is look up table.

Figure 12 .
Figure 12.The area comparison of different proposed 2-D Finite Impulse Response filters with different multiplier techniques.Where DSF is diagonal symmetry filter, QSF is quadrantal symmetry filter and LUT is look up table.

Figure 14 .
Figure 14.Area, delay-power consumption, ADP, and PDP comparison of proposed DSF and QSF architectures with existing architectures for L = 8 with N = 4.

Is the work clearly
and accurately presented and does it cite the current literature?YesIs the study design appropriate and is the work technically sound?YesAre sufficient details of methods and analysis provided to allow replication by others?YesIf applicable, is the statistical analysis and its interpretation appropriate?YesAre all the source data underlying the results available to ensure full reproducibility?YesAre the conclusions drawn adequately supported by the results?YesCompeting Interests: No competing interests were disclosed.Reviewer Expertise: FIR FIlter Design, Distributed Arithmetic, Residue Number System, VLSI Signal Processing I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Table 2 .
The Memory-based Lookup Table (MLUT) multiplier using even multiples.
L, input block size N, and filter coefficients.Hence, DSF and QSF symmetry-based 2-D FIR filters are designed and explored to reduce the quantity of the multipliers.Next, the multiplier architectures are optimized by dual-port even multiples storage LUT-based multipliers.The architectures are synthesized using the Genus Synthesis tool-19.1 in 45nm 27ble (DPLUT) control logic.MUX.Multiplexer; LUT, look up table.lengthtechnologywithageneric library of Cadence vendor constraints.There is a free synthesis tools available like Xilinx Integrated Synthesis Environment, which can be used instead of Genus Synthesis tool in Cadence to replicate our methods.Power consumption of architectures is calculated with an image size of 64 X 64 and at 20 MHz frequency.The synthesized results (reports in Underlying data27) have been analyzed and compared with the existing architecture's results.All Verilog code associated with the work is available in Software availability.28

Table 6 .
Comparison of the proposed filters with existing filter for L = 8 with N= 4. DSF, Diagonal Symmetry Filter; QSF, Quadrantal Symmetry Filter; ADP, area delay product; PDP, power delay product.
11 the other hand, the proposed MLUT-based 2-D block QSF filter for L= 8 with N= 4 requires 59.5% less area, consumes 58.94% less power, 48.44% less ADP and 47.78% less PDP, but has 27% more delay compared to existing HLUT-based 2-D block FIR filter.11The2-D block FIRA using QSF has fewer unique coefficients than the 2-D block FIRA using DSF.Hence, QSF performs well in terms of performance metrics.