Re: fibers,questions about thread id and mutation of vectors

i made some test of openMP and Guile with Guile 3.0.8.99-f3ea8 on MacOS M1 and Linux Intel because i was not sure of the performances. I find a problem on Linux the code is slower (could be a factor of 5) with openMP and in Mac OS the gain is is of 100% (divide by 2) or 15% depending of computation complexity.

i can not explain why it works under MacOS and not Linux, the only difference of compilation is that under Mac OS i had to force this option to succeed compiling:

configure --enable-mini-gmp

Anyway it is not good performance for openMP with scheme, under openMP with n CPUs i have gain of almost n x 100% of speedup, in C language or Fortran OpenMP when use for astronomical numerical simulation.

in the // region i have only this code on MacOS:

scm_init_guile();

#pragma omp parallel for

for (i=start; i<=stop; i++) { /* i is private by default */

scm_call_1( func , scm_from_int(i) );

with linux this create a segmentation fault unless i move inside the for loop the line scm_init_guile();

like this:

#pragma omp parallel for

for (i=start; i<=stop; i++) { /* i is private by default */

scm_init_guile();
scm_call_1( func , scm_from_int(i) );

https://github.com/damien-mattei/library-FunctProg/blob/master/guile-openMP.c#L91

the scheme+ code for speed test looks like that (i use collatz function to make the computation unpredictable for any C compiler optimisations when i compare with pur C code):

;; only for speed tests
{vtstlen <+ 2642245}
{vtst <+ (make-vector vtstlen 0)}

{fct <+ (lambda (x) {x * x * x})}

(define (fctapply i) {vtst[i] <- fct(vtst[i])}) ;; neoteric _expression_ of {vtst[i] <- (fct vtst[i])}

(define (fctpluscollatzapply i) {vtst[i] <- fctpluscollatz(vtst[i])})

(define (speed-test)

;; init data
(display-nl "speed-test : Initialising data.")
(for ({i <+ 0} {i < vtstlen} {i <- {i + 1}})
{vtst[i] <- i})

;; compute
(display-nl "speed-test : testing Scheme alone : start")
(for ({i <+ 0} {i < vtstlen} {i <- {i + 1}})
(fctpluscollatzapply i));;(fctapply i))
(display-nl "speed-test : testing Scheme alone : end")

(newline)

;; display a few results
(for ({i <+ 0} {i < 10} {i <- {i + 1}})
(display-nl {vtst[i]}))
(display-nl ".....")
(for ({i <+ {vtstlen - 10}} {i < vtstlen} {i <- {i + 1}})
(display-nl {vtst[i]}))

;; init data
(display-nl "speed-test : Initialising data.")
(for ({i <+ 0} {i < vtstlen} {i <- {i + 1}})
{vtst[i] <- i})

;; compute
(display-nl "speed-test : testing Scheme with OpenMP : start")
(openmp 0 {vtstlen - 1} (string->pointer "fctpluscollatzapply"));;"fctapply"))
(display-nl "speed-test : testing Scheme with OpenMP : end")

(newline)

;; display a few results
(for ({i <+ 0} {i < 10} {i <- {i + 1}})
(display-nl {vtst[i]}))
(display-nl ".....")
(for ({i <+ {vtstlen - 10}} {i < vtstlen} {i <- {i + 1}})
(display-nl {vtst[i]}))

)

(define (collatz n)
(cond ({n = 1} 1)
({(modulo n 2) = 0} {n / 2})
(else {{3 * n} + 1})))

(define (fctpluscollatz x)
(declare c)
(if {x = 0}
{c <- 0}
{c <- collatz(x)})
{{x * x * x} + c})

(define openmp (foreign-library-function "./libguile-openMP" "openmp" #:return-type int #:arg-types (list int int '*)))

(define libomp (dynamic-link "libomp")) ;; note: require a link : ln -s /opt/homebrew/opt/libomp/lib/libomp.dylib libomp.dylib
;; export LTDL_LIBRARY_PATH=. under linux with a link as above
;; or better solution: export LTDL_LIBRARY_PATH=/usr/lib/llvm-14/lib

(define omp-get-max-threads
(pointer->procedure int
(dynamic-func "omp_get_max_threads" libomp)
'()))

https://github.com/damien-mattei/library-FunctProg/blob/master/guile/logiki%2B.scm#L3581

output:

scheme@(guile-user)> (speed-test )
speed-test : Initialising data.
speed-test : testing Scheme alone : start
speed-test : testing Scheme alone : end

0
2
9
37
66
141
219
365
516
757
.....
18446514741354254581
18446535685572961374
18446556629820732765
18446577574071146391
18446598518350624637
18446619462632745120
18446640406943930245
18446661351257757609
18446682295600649637
18446703239946183906
speed-test : Initialising data.
speed-test : testing Scheme with OpenMP : start
speed-test : testing Scheme with OpenMP : end

0
2
9
37
66
141
219
365
516
757
.....
18446514741354254581
18446535685572961374
18446556629820732765
18446577574071146391
18446598518350624637
18446619462632745120
18446640406943930245
18446661351257757609
18446682295600649637
18446703239946183906

the sequential region : 4"

the // region: 2" (twice faster)

of course if i run a pure C eqivlent code it is instantaneous:

// openMP cube - collatz test

#include <omp.h>
#include <stdio.h>
#include <stdlib.h>

// OpenMP on macOS with Xcode tools:
// https://mac.r-project.org/openmp/

// export OMP_NUM_THREADS=8

// this main() in a library was only for testing openMP with Mac OS Xcode and Linux for use uncomment main() and comment openmp() functions

// mac os :
// clang -I/opt/homebrew/opt/libomp/include -L/opt/homebrew/opt/libomp/lib -Xclang -fopenmp -o collatz -lomp collatz.c

// gcc -L/usr/lib/llvm-14/lib/ -fopenmp -o collatz -lomp collatz.c

unsigned long long *vtst;

unsigned long long collatz(unsigned long long n) {

if (n == 1) return 1;

if ((n % 2) == 0)
return n / 2;
else
return 3*n + 1;

}

unsigned long long fct(unsigned long long x) {

unsigned long long c;
if (x == 0)
c = 0;
else
c = collatz(x);

return (x * x * x) + c;
}

unsigned long long fctapply(unsigned long long i) {
return vtst[i] = fct(vtst[i]);
}

int main() {
int vtstlen = 2642245; // cubic root of 18,446,744,073,709,551,615 https://en.wikipedia.org/wiki/C_data_types
vtst = calloc(vtstlen, sizeof(unsigned long long));

int ncpus = omp_get_max_threads();
printf("Found a maximum of %i cores.\n",ncpus);
printf("Program compute cube of numbers and add collatz result (1) with and without parallelisation with OpenMP library.\n\n");
printf("Initialising data.\n\n");
//int iam,nthr;

// init data sequential
for (int i=0; i<vtstlen; i++) { /* i is private by default because it is the for indice*/
//iam = omp_get_thread_num();
//printf("iam=%i\n",iam);
//nthr = omp_get_num_threads() ;
//printf("total number of threads=%i\n",nthr);
vtst[i]=i;

}

printf("STARTING computation without //.\n");

for (int i=0; i<vtstlen; i++) {

fctapply(i);

}

printf("ENDING computation without //.\n\n");

// display a few results
for (int i=0;i < 10; i++) {
printf("%llu\n",vtst[i]);
}
printf( ".....\n");
for (int i=vtstlen - 10; i < vtstlen; i++) {
printf("%llu\n",vtst[i]);
}

printf("Initialising data in //.\n\n");
//int iam,nthr;

#pragma omp parallel for private(vtstlen) shared(vtst)

for (int i=0; i<vtstlen; i++) { /* i is private by default because it is the for indice*/

vtst[i]=i;

}

printf("STARTING computation in //.\n");

// setting private disable unecessary // overload work on some variables (mutex...)
#pragma omp parallel for private(vtstlen) shared(vtst)

for (int i=0; i<vtstlen; i++) { /* i is private by default */

fctapply(i);

}

printf("ENDING computation in //.\n\n");

// display a few results
for (int i=0;i < 10; i++) {
printf("%llu\n",vtst[i]);
}
printf( ".....\n");
for (int i=vtstlen - 10; i < vtstlen; i++) {
printf("%llu\n",vtst[i]);
}

}

https://github.com/damien-mattei/library-FunctProg/blob/master/collatz.c

in conclusion openMP with Guile give a few improvement of a factor between 1.15 (with logic algo) of 2 (benchmarks with cube and collatz) of speed only on MacOS under Linux it fails with segfault or is slower.

there should be difference in implementation of Guile between Mac OS and Linux but i do not know the inner mechanism and algorithm used to run Guile in a C environment,what scm_init_guile() is doing?

why must it be placed under the // region on Linux (with slower result) and anywhere under MacOS ? (speed up code)

possibly this could be improved. It is already a good result to see it works with OpenMP in Scheme .

Best wishes,

Damien

On Fri, Jan 6, 2023 at 6:06 PM Maxime Devos <maximedevos@telenet.be> wrote:

> no it returns something based on address:
> scheme@(guile-user)> (current-thread)
> $1 = #<thread 8814535936 (102a61d80)>
> the good thing it is that it is different for each address, the bad is that i do not know how to extract it from the result and anyway i need a number : 0,1,2,3... ordered and being a partition to make scheduling that each thread deal with a part of the array (vector) the way it is in OpenMP like in the FOR example i posted a week ago

You could define a (weak key) hash table from threads to numbers, and
whenever a thread is encountered that isn't yet in the table, assign it
an unused number and insert it in the table. Requires locking (or an
atomics equivalent) though, so not ideal.

(Maybe there's a method to get a number, directly, but I don't know any.)

> just do a 'for like in openMP (mentioned above)

In that case, when implementing slicing the array between different new
fibers, you can give each of the fibers you spawn (one fiber per slice,
if I understand the terminology correctly) an entry in the vector, and
after all the fibers complete do the usual 'sum/multiply/... all
entries' trick.

As each fiber has its own (independent) storage, not touched by the
other fibers, that should be safe.

I suppose this might take more memory storage than with openMP.

> i undertand fibers is better for scheduling web server request but not for parallelizing like openMP - it is two differents world.

You can do parallelisation with fibers (see ‘In that case, when
implementing slicing ...’), but from what I'm reading, it will be
somewhat unlike openMP.

On 06-01-2023 16:06, Damien Mattei wrote:
>
> (define omp-get-max-threads
> (pointer->procedure int
> (dynamic-func "omp_get_max_threads" libomp)
> (list void)))
>
> but i get this error:
> ice-9/boot-9.scm:1685:16: In procedure raise-exception:
> In procedure pointer->procedure: Wrong type argument in position 3: 0
>
> i do not understand why.

‘int omp_get_max_thread(void);’ is C's way to declare that
omp_get_max_thread has no arguments -- there is no 'void'-typed argument.

Try (untested):

(define omp-get-max-threads
(pointer->procedure int
(dynamic-func "omp_get_max_threads" libomp)
(list)))

Greetings,
Maxime.

From:	Damien Mattei
Subject:	Re: fibers,questions about thread id and mutation of vectors
Date:	Fri, 13 Jan 2023 12:10:12 +0100