[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [igraph] Performance issue regarding when calculating induced_subgra
From: |
Tamas Nepusz |
Subject: |
Re: [igraph] Performance issue regarding when calculating induced_subgra |
Date: |
Thu, 28 Apr 2016 00:05:46 +0200 |
Hi Roland,
Further investigations show that induced_subgraph() in R seems to
perform poorly if the original graph has a "name" vertex attribute,
and the performance also depends on the type of the vertex attribute.
For instance, here's a graph with 15000 vertices and ~90K edges on my
machine without a "name" attribute:
> n <- 15000
> radius <- 0.2 / ((n / 100.0) ** 0.5)
> g <- grg.game(n, radius)
> cl <- label.propagation.community(g)
> system.time(lapply(groups(cl), function(x) { induced.subgraph(g, x) }))
user system elapsed
0.194 0.023 0.224
Here's the same benchmark with a "name" vertex attribute using numeric values:
> n <- 15000
> radius <- 0.2 / ((n / 100.0) ** 0.5)
> g <- grg.game(n, radius)
> V(g)$name <- 10000000:(10000000+n-1)
> cl <- label.propagation.community(g)
> system.time(lapply(groups(cl), function(x) { induced.subgraph(g, x) }))
user system elapsed
2.543 0.090 2.779
And the same benchmark with strings as names:
> n <- 15000
> radius <- 0.2 / ((n / 100.0) ** 0.5)
> g <- grg.game(n, radius)
> V(g)$name <- sapply(10000000:(10000000+n-1), toString)
> cl <- label.propagation.community(g)
> system.time(lapply(groups(cl), function(x) { induced.subgraph(g, x) }))
user system elapsed
10.071 0.095 10.382
This is a significant slowdown, and I have no idea what causes it. The
Python interface does not seem to exhibit this behaviour, so it seems
to be R-specific.
Fun fact: adding the names *after* the community detection does not
seem to slow things down:
> n <- 15000
> radius <- 0.2 / ((n / 100.0) ** 0.5)
> g <- grg.game(n, radius)
>
> cl <- label.propagation.community(g)
> V(g)$name <- 10000000:(10000000+n-1)
> system.time(lapply(groups(cl), function(x) { induced.subgraph(g, x) }))
user system elapsed
0.221 0.024 0.253
So, if you are still struggling with this issue, one possibility would
be to remove the "name" vertex attribute from your graph before
calling induced.subgraph(). You can safely rely on the fact for the
time being that vertex i (1-based indexing) in the subgraph
corresponds to the i-th element of the vector that you pass to
induced.subgraph() as long as the vector has unique elements and is
sorted in ascending order.
All the best,
T.
On Tue, Apr 26, 2016 at 11:23 AM, Tamas Nepusz <address@hidden> wrote:
> Hi,
>
>> Loop without rbinding:
>>
>> system.time(for (i in 1:100) {induced_subgraph(g, clustm[clustm[, ".id"] ==
>> i, 2])$vector })
>> user system elapsed
>> 16.96 1.28 18.24
> You have left out probably the most time-consuming part of the entire
> operation: the rbinding part :) The differences will not be obvious at
> such a small scale (i.e. 100 clusters only), though. The problem with
> rbind() is that it adds a new row to a matrix by creating a whole new
> matrix, adding the new row to the copy and then returning the copy.
> The copying will take a significant amount of time once the size of
> the matrix becomes large. So, by all means, don't use rbind(), use
> lapply().
>
>> Based on the last one and on a rough calculation, preparing induced subgraph
>> 100 times takes approx. 11.4 seconds and lapply is the rest.
> Well, it all depends on the sizes of the induced subgraphs - preparing
> a larger induced subgraph is going to be much slower than preparing a
> small one. So, if the sizes of your clusters are not evenly
> distributed, taking the first few of them might not be a reliable
> sample from which you could reasonably extrapolate the whole runtime.
>
>> Do you have any idea how to further speed up the calculation?
> I'm not sure whether this will be a speedup in the end or a slowdown,
> but here's an alternative approach. It seems like the bottleneck in
> your code is the repeated calls to induced_subgraph(), and the only
> reason why we need that is because the authority scores cannot be
> calculated by igraph on a part of a graph only. However, the authority
> scores are simply the principal eigenvector of A^T * A, where A is the
> adjacency matrix of the graph and A^T is its transposed.
> _Theoretically_, you could use as_adjacency_matrix(g, sparse=T) to get
> a sparse representation of the adjacency matrix, then for each
> cluster, take the appropriate submatrix B of A and call the arpack()
> function to find the principal eigenvector of B^T * B by providing it
> with a multiplication function that takes an arbitrary vector and
> multiplies it by B^T * B. Something along the lines of:
>
> library(Matrix)
> a <- get.adjacency(g, sparse=T)
>
> get.authority.score.for.cluster <- function (members) {
> b <- a[members, members]
> bt <- t(b)
> func <- function(x, extra=NULL) { as.vector(bt %*% (b %*% x)) }
> scores <- arpack(func, options=list(n=length(members), nev=1,
> ncv=3), sym=T)$vectors
> scores <- scores / max(scores)
> }
>
> lapply(get.authority.score.for.cluster, clust)
>
> I haven't tested the code above so it might not work or give erroneous
> results, but it shows the general idea. I'm not sure whether it would
> be faster, though - there could be another bottleneck here, which is
> that the arpack() function (which runs in the C layer) has to call the
> matrix multiplication function 'func' in the R layer, and the
> translation between the two layers is costly. But at least it saves
> you from having to call induced_subgraph().
>
> T.