[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Maposmatic-dev] daily update stats
From: |
David Decotigny |
Subject: |
Re: [Maposmatic-dev] daily update stats |
Date: |
Thu, 07 Jan 2010 15:01:40 +0100 |
User-agent: |
Thunderbird 2.0.0.23 (X11/20090817) |
Hello,
Jeroen van Rijn wrote:
On Thu, Jan 7, 2010 at 09:58, David Decotigny <address@hidden> wrote:
All in all, that's roughly 60% penalty on both sides.
We'll hit a major critical show-stopper when osm2pgsql reaches 24h to
complete. Any bet on /when/ it /will/ happen ? :) Any offer for a higher-end
hosting ? :)
Hello David,
A 60% penalty is substantial enough to try and improve the influence
of running these things concurrently.
http://wiki.openstreetmap.org/wiki/Osm2pgsql#Slim_mode tells me that
the daily update diffs mean osm2pgsql is being run in slim mode, which
should mean that the daily diffs themselves could be split into chunks
after downloading them, and then running osm2pgsql on the resulting
smaller planet diffs. These could then be scheduled at times of lower
load, with a deadline to start any remaining updates if not completed
by then to assure all updates finish in time.
From what I understood, you are proposing to split the diff updates
into chunks, and to schedule the renderings inbetween the processing of
these chunks, effectively serializing things "manually" in order to
control the impact of the renderings on the diff update.
The idea is nice, indeed, it could allow us to survive a little longer.
But, still, I'm afraid this solution could have a significant overhead
on the diff updates (ie. overhead of parsing the diff update, splitting
it, etc.), and furthermore, doing so would remove the benefit of having
2 CPUs available, not to mention the pain to implement it (need to
synchronize django with the diff update, with all the mess related to
fault-tolerance when a process crashes, etc.).
For the same strategy, another, lighter (imho), solution I was thinking
of, was to keep the parallelism we have, but to control it: regulate the
flow of renderings so that we have lower than 60% penalty on the diff
updates. That is, when the rendering queue is populated, we don't
constantly render the maps while the diff update is running (that's what
we do now). Instead, we control when renderings are allowed or not
(think of some "fluid" scheduling technic), while osm2pgsql runs till
completion. That way, we don't have to bother about osm2pgsql (it runs
continuously), but we do regulate the renderings so that the overhead
they incur on the diff update is controlled and moderate.
But both solutions have their limit: at some point, the diff update,
even alone on the machine, will require 24h to process, based on the
assumption that OSM gains in popularity. So, at best we will eventually
not be able to render anything, and at worst, we will even not be able
to update the DB... Of course, this will happen later with the strategy
above, than if we keep the current scheme. But this will eventually
happen, these solutions will just allow us to survive a few weeks/months
longer. That's the main reason why I would recommend some "easy"
technical implementation if we decide to adopt this strategy in the
meantime.
In the longer run, either we find the correct way to tune the whole
system (pgsql, nice, etc.) so that we significantly reduce the pain it
takes to run the diff updates. Or we enjoy a higher-end machine. Or we
optimize osm2pgsql and/or the DB indexes in postgis. Or all of these
options.
While I don't have higher-end hosting to offer, I'd be more than happy
to investigate tuning the update process on my local development
server, and submit patches and findings where applicable. I'll be
installing a copy of the mapsosmatic codebase this weekend as it is,
once I have it up and running I'll start paying attention to what's
what as far as these updates are concerned.
That is, is the contention for disk i/o slowing things down, is it
that osm2pgsql dominates the cpu? What happens when we change the
priority of the update and/or rendering tasks, and so on. It may take
me some time to get down and dirty with this codebase, as it's new to
me, but I hope to be of some use to the project in due time. ;)
To answer your first question, I didn't personnally investigate. But I
have the intuition it's either i/o-based, or lacking some index to speed
things up, or inefficiently serially sending several queries that could
be grouped. Having more RAM should help anyhow (imho). The OSM people
would probably know a lot better on that subject, and I'd be interested
to hear on that.
As for the 2nd point, you have first to follow the instructions in the
INSTALL file for ocitysmap. We recommend using postgres 8.3. These
instructions have been followed several times by several people running
ubuntu jaunty, karmic, and debian sid (both 32 and 64 bits). Then, you
follow the INSTALL file in maposmatic.
The box in question is an AMD Athlon64 X2 6000 (@ stock 3GHz), with
4Gb DDR2, my old workstation now converted to server, basically.
I take it you've already looked into the following (from the same page):
"Optimization
Large imports into PostGIS are very sensitive to maintenance and
monitoring configuration: it is smart to increase the value of
checkpoint_segments so that autovacuum tasks don't slow down imports."
Regards,
Jeroen.
We are very interested in any postgres/system parameter we could tune.
Best regards,