[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Help-glpk] Optimization and Multicore GLPK
From: |
Reginald Beardsley |
Subject: |
Re: [Help-glpk] Optimization and Multicore GLPK |
Date: |
Thu, 27 Dec 2012 07:56:10 -0800 (PST) |
Thanks to all for the links to papers.
A parallel implementation of either the primal or dual simplex method is a
difficult research topic and not something I have nor had any motivation to
undertake.
I'm interested in the optimization of the existing serial revised simplex
algorithm with particular attention to low effort performance improvements.
This leads to focusing on matrix multiply. Row - column operations are
trivially data parallel. One can spread the problem over multiple cores by
using modulo arithmetic to select the rows processed by each core. In a shared
memory machine there is very low overhead for doing this. It also does not
require making GLPK threadsafe. I've not counted how many functions would be
affected by doing this, but it is very small and the changes are relatively
simple.
For dense rows the operation should be vectorized so that the SSE extensions
can be used to full effect. The effort required is modest. As I recall
padding the arrays will suffice, though compilers & hardware may take care of
that now.
For hyper sparse rows the use of a sparse multiply algorithm will outperform
the vectorized version. So some logic is required to select which of the two
methods is used and to keep track of the row density.
The sort of multicore operation I'm suggesting will eventually be done
automatically by the compiler. At present it's being done on a per core basis
to schedule multiple FPUs. It may well already be available in some of the
commercial HPC compilers, though with some restrictions.
Attached processors such as the GPUs have been popular at various times. While
they offer significant speedups, it comes at a price. Whether they are
worthwhile depends upon the ratio of communication to computation time. They
often demand significant code changes to use effectively. Historically, the
CPU has caught up and the labor of coding to the AP has been rendered useless
or even harmful.
If the code accounting for 40% of the execution time can be speeded up by a
factor of 4x, the total improvement will be a less than 30% reduction in
execution time. I'd be delighted if the combination of vectorization, sparse
arithmetic and multicore operation did better than that, but it's very hard to
make large performance improvements in good codes. Most of the time
improvements to one aspect of a code result in a bottleneck somewhere else.
For large LP problems I would expect cache & main memory behavior to limit the
gains from faster arithmetic.
Have Fun!
Reg