[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [igraph] Performance issue regarding when calculating induced_subgra
From: |
Tamas Nepusz |
Subject: |
Re: [igraph] Performance issue regarding when calculating induced_subgra |
Date: |
Tue, 26 Apr 2016 11:23:22 +0200 |
Hi,
> Loop without rbinding:
>
> system.time(for (i in 1:100) {induced_subgraph(g, clustm[clustm[, ".id"] ==
> i, 2])$vector })
> user system elapsed
> 16.96 1.28 18.24
You have left out probably the most time-consuming part of the entire
operation: the rbinding part :) The differences will not be obvious at
such a small scale (i.e. 100 clusters only), though. The problem with
rbind() is that it adds a new row to a matrix by creating a whole new
matrix, adding the new row to the copy and then returning the copy.
The copying will take a significant amount of time once the size of
the matrix becomes large. So, by all means, don't use rbind(), use
lapply().
> Based on the last one and on a rough calculation, preparing induced subgraph
> 100 times takes approx. 11.4 seconds and lapply is the rest.
Well, it all depends on the sizes of the induced subgraphs - preparing
a larger induced subgraph is going to be much slower than preparing a
small one. So, if the sizes of your clusters are not evenly
distributed, taking the first few of them might not be a reliable
sample from which you could reasonably extrapolate the whole runtime.
> Do you have any idea how to further speed up the calculation?
I'm not sure whether this will be a speedup in the end or a slowdown,
but here's an alternative approach. It seems like the bottleneck in
your code is the repeated calls to induced_subgraph(), and the only
reason why we need that is because the authority scores cannot be
calculated by igraph on a part of a graph only. However, the authority
scores are simply the principal eigenvector of A^T * A, where A is the
adjacency matrix of the graph and A^T is its transposed.
_Theoretically_, you could use as_adjacency_matrix(g, sparse=T) to get
a sparse representation of the adjacency matrix, then for each
cluster, take the appropriate submatrix B of A and call the arpack()
function to find the principal eigenvector of B^T * B by providing it
with a multiplication function that takes an arbitrary vector and
multiplies it by B^T * B. Something along the lines of:
library(Matrix)
a <- get.adjacency(g, sparse=T)
get.authority.score.for.cluster <- function (members) {
b <- a[members, members]
bt <- t(b)
func <- function(x, extra=NULL) { as.vector(bt %*% (b %*% x)) }
scores <- arpack(func, options=list(n=length(members), nev=1,
ncv=3), sym=T)$vectors
scores <- scores / max(scores)
}
lapply(get.authority.score.for.cluster, clust)
I haven't tested the code above so it might not work or give erroneous
results, but it shows the general idea. I'm not sure whether it would
be faster, though - there could be another bottleneck here, which is
that the arpack() function (which runs in the C layer) has to call the
matrix multiplication function 'func' in the R layer, and the
translation between the two layers is costly. But at least it saves
you from having to call induced_subgraph().
T.