S. Roy, A. Asenov, A.R. Brown and J.R. Barker
Department of Electronics and Electrical Engineering,
University of Glasgow, Glasgow, G12 8QQ, Scotland, UK
Topologically rectangular grids of two or three dimensions are particularly attractive for the solution of partial differential equations using parallel computing systems. Their inherently matrix like nature makes them especially amenable to partitioning over pipelined (1D) or mesh connected (2D & 3D) arrays of processors by a physical domain decomposition. This gives a number of advantages in the production of efficient and robust code for the solution of a wide range of simulation problems :
Firstly, the choice of domain decomposition presents the simulation programmer with a simple model mapping interprocessor communications to the exchange of information at partition subdomain boundaries (thus allowing easy coding of a maintainable and portable communications harness) and ensures scalability of the code to large processor arrays.
Secondly, the use of topologically rectangular grids allows the programmer to work within the 'mesh connected processor' paradigm, making use of system communications primitives often dedicated to this task, and giving optimum performance on popular mesh connected parallel processing architectures.
We consider the partitioning of 2D topologically rectangular grids onto a 2D mesh connected array of processors with 4-way connectivity (although we have considered 3D grids and their partitioning [1] and the results presented here are directly extensible to the 3D case). We investigate the efficiency of the most simple rectilinear partitionings and introduce simulated annealing as a method of further optimising such partitionings.
PARTITIONING OF FINITE ELEMENT GRIDS
Although rectilinear finite difference grids are more usually considered, topologically rectangular finite element grids can also be designed for a wide range of simulation problems. As an example, Fig.1 shows such a grid generated for the simulation of a cellular Insulated Gate Bipolar Transistor and partitioned over a 3×3 processor mesh. The grid conforms to the shape of the transistor gate and metallurgical pn junctions.
Fig.1 Top view of finite element grid for an IGBT and its partition over a 3×3 processor mesh.
We wish to map an n×m finite element grid onto an N×M processor mesh where the processors (and thus the partition subdomains) have an enumeration scheme (I,J), and the boundaries of subdomain (I,J) must abut only subdomains (I,J±1) and (I±1,J) . When solving partial differential equations, greatest efficiency is obtained when the largest subdomain size p (i.e number of grid nodes in the largest subdomain), and the largest horizontal and vertical subdomain boundary lengths h,v are at a minimum. Where processing and communications times are proportional to p and h+v then E = ap + b(h+v) must be minimised for optimal speedup. This problem is closely related to that of minimising the ratio between largest and smallest partition sizes and boundary lengths.
The simplest mapping is a rectilinear partitioning of the problem domain [2]. The grid is split into N rows and M columns: the first m-[m/M].M columns of the processor mesh span [(m-1)/M]+1 grid points in the x direction, further columns span [(m-1)/M] grid points ([x] represents the floor function). Rows are partitioned similarly. Fig.2 Shows such a rectilinear partitioning for a 5×5 problem grid and 3×3 processor mesh. This definition of subdomain boundaries has the advantage of being simple to code, however when N,M are not integer divisors of n,m it will not be the most efficient partitioning possible. In this case p=4 is the maximum partition size, h=v=4 are the maximum horizontal and vertical partition wall lengths. The ratio of maximum to minimum partition sizes is 4, implying that the load balancing of such a partition will be very poor.
The effect of imperfections in load and communications balancing in practice are shown in Fig.3. Here the speedup of a parallel Block Newton SOR solver with rectilinear partitioning is shown as a function of grid size (using a grid of side n) when using an 8×8 processor mesh. Deep valleys in the speedup curve are seen when the problem size is not a multiple of 8.
![]() p=4, h=4, v=4 | ![]() |
An exhaustive search for the most efficient partitioning for problem grids of even moderate (20×20) size is computationally prohibitive. Therefore the simple rectilinear partition is transformed by simulated annealing - an efficient way to obtain approximate solutions to large combinatorial problems [3]. Every state (partitioning of the grid) has an associated score function (i.e, the function E above) and transformations which move the problem through state space. Transformations are chosen at random, and are followed only if they decrease the score function (i.e Ef-Ei<0), or if random (0,1) < exp[-(Ef-Ei)/T] (where T is the annealing temperature) [4]. If transformations merely perturb the score function and their probability depends solely on initial and final states, then an appropriate annealing schedule of transformations and T reductions will give a close to optimal final state.
ANNEALING SCHEDULE AND STATE TRANSFORMATIONS
The annealing schedule used is due to Otten and van Ginneken [5]. For the state transformations indicated above, a plot of average score as a function of reducing temperature is found to fall into three distinct areas. At high T > Tc (Tc a problem dependant critical temperature) there is a region of weak control where the score function average <E> decreases hyperbolically, but its deviation sigma remains constant. At lower T there is a region of strong control where <E> drops more slowly but where sigma falls rapidly. Finally, at very low T the reduction in <E> and sigma become problem specific.
To perform an annealing run, initial estimates are made of E
and sigma
, the average score function and its deviation, for a random walk through state space at high T. H
and h
, the global and local accessibilities of state space, are also calculated. For each subsequent temperature, enough state transformations are made to form a Markov chain in equilibrium. Estimates of <E>, sigma, H and h are made and along with the E
, sigma
, H
, h
values are used to update predictions of Tc. In turn, these predictions are used to govern the criteria for T reduction and completion of the annealing process. When the temperature falls into the problem specific regime, annealing is halted and a local minimum of the score function is sought. The formula ultimately controlling temperature is
where sigma' is a smoothed function of previous sigma values.
The state transformations used in our annealing algorithm are shown in Fig.4. Their iterative use allows transformation of the simple rectilinear partitioning into any state which preserves the four way connectivity of the partition subdomains. By design, the length of the external boundary of each subdomain (the number of grid elements in the 1st & n th rows and 1st & m th columns) is not preserved. The transformation set will allow subdomains to divide into multiple unconnected regions as long as the sum of those regions still preserves four way connectivity (although it is unusual for split subdomains to give lower average score functions). With minor housekeeping the transformations deal automatically with edge effects at the external grid boundary.
The transformations may produce a maximum ±1 change in subdomain partition size, or ±2 change in subdomain horizontal or vertical wall length. If the score function is based solely on minimising the maximum partition size and wall lengths then the structure of state space is not suited to simulated annealing. Especially for problem grids of moderate (20×20) size, state space consists of wide areas with no score change, and a few deep, narrow valleys.
To improve the efficiency of annealing, the score function is augmented to include further information on solution quality. An estimate of the variation of the partition sizes (pi ) and wall lengths (hi , vi ) from their optimum (non-integer) values are obtained, and scaled so that their variation due to random state transformations has standard deviation s. When added to the score function this adds further structure to state space and in our problems improves the efficiency of the annealing process by a factor of approximately 20 to 100. We have found s values of 0.5 to 1.0 to be most efficient.
RESULTS AND CONCLUSIONS
Fig.5 compares the annealed partitioning of a 5×5 problem grid onto a 3×3 processor mesh with the rectilinear partitioning of Fig.2. The annealing score function is E = p+h+v plus 'quality of fit' information. This is a pathological case, where the weighting of processing and communication times is equal and where state space is at its most discrete. The advantages of the annealed partitioning in the ratios of partition sizes and boundary lengths can clearly be seen. Fig.6 shows the same process applied to the partitioning of a 19×19 grid.
Finally, Fig.7 shows the theoretical speedup of various problem sizes when partitioned over a 3×3 mesh by both the simple partitioning algorithm and after simulated annealing (again in the pathological case of equal processing and communications time). When N,M are integer divisors of n,m and the ratio of minimum to maximum partition sizes is 1, both methods are equally effective (with the simple rectilinear partitioning easier to code in practice). For other problem sizes, the annealed partitions are clearly more efficient.
In conclusion, we have described the Diophantine problem of partitioning a 2D, topologically rectangular, finite element grid onto a 2D processor mesh while retaining 4-way connectivity. A simple rectilinear partitioning achieves optimal efficiency when the span of the processor mesh is an integer divisor of the problem grid size. When this is not the case, we have described details of a simulated annealing procedure which is effective in improving the efficiency of the partition.
REFERENCES
[1] Brown, A.R., et al. Parallel 3D Finite Element Power Semiconductor Device Simulator based on Topologically Rectangular Grid. in Simulation of Semiconductor Devices and Processes (SISDEP '95). 1995. Erlangen, Germany
[2] Asenov, A., et al. The implementation and speed-up of coloured SOR methods solving 3D problems on arrays of transputers. in Transputer Applications and Systems '93. 1993. IOE Press Amsterdam.
[3] Wong, D.F., H.W. Leong, and C.L. Liu, Simulated Annealing for VLSI Design. The Kluwer International Series in Engineering and Computer Science, ed. J. Allen. 1988, Boston/Lancaster/Dordrecht: Kluwer Academic Publishing.
[4] Metropolis, N., et al., Equation of State Calculations by Fast Computing Machines. Journal of Chemical Physics, 1953. 21: p. 1087-1092.
[5] Otten, R.H.J.M. and L.P.P.P.v. Ginneken, The Annealing Algorithm. The Kluwer international series in engineering and computer science, ed. J. Allen. 1989.