[Maposmatic-dev] [PATCH maposmatic] Improve the file cleanup mechanism

maposmatic-dev

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Maposmatic-dev] [PATCH maposmatic] Improve the file cleanup mechanism

From:	Maxime Petazzoni
Subject:	[Maposmatic-dev] [PATCH maposmatic] Improve the file cleanup mechanism
Date:	Wed, 13 Jan 2010 14:55:46 +0100

CAUTION: this change alters the database layout! The new column is:
    "has_files" bool NOT NULL

Proper action must be taken when deploying this change.

Previously, when the rendering directory was over the defined threshold,
files where removed progressively, oldest first, to make up some space.
No information was kept about jobs whose files were removed, making it
harder to keep track of valid jobs with files available.

This new mechanism brings the following features:

  * files are now sorted by content modification time and not creation
    time (a simple chmod could mess up the order);
  * when a file needs to be removed, all files from its parent job are
    removed, and the job's has_files flag is set to false. It does not
    make a lot of sense to keep partial renderings anyway (the map
    without the index? in different formats? Not good.)
  * thumbnails of jobs in the database are always kept, so we can
    display what the map looked like even if we don't have the files
    around anymore;
  * if no parent job can be found, it's an orphaned file and can be
    safely removed. Files starting with a '.' are preserved though
    (.htaccess for example).
  * better logging or the cleanup process.
---
 scripts/maposmaticd               |   74 ++++++++++++++++++++++++++++++-------
 www/maposmatic/models.py          |   71 ++++++++++++++++++++++++++++++++---
 www/media/style.css               |    4 ++
 www/templates/maposmatic/job.html |    2 +-
 4 files changed, 130 insertions(+), 21 deletions(-)

diff --git a/scripts/maposmaticd b/scripts/maposmaticd
index 4464913..a1f74e3 100755
--- a/scripts/maposmaticd
+++ b/scripts/maposmaticd
@@ -148,25 +148,71 @@ def render_job(job):
             job.end_rendering(resultmsg)
             return
 
-# This function checks that the total size of the files in
-# RENDERING_RESULT_PATH does not exceed 80% of
-# RENDERING_RESULT_MAX_SIZE_GB. If it does, the function removes as
-# many files as needed, oldest first
 def cleanup_files():
-    files = [ os.path.join(RENDERING_RESULT_PATH, f) for f in 
os.listdir(RENDERING_RESULT_PATH)]
-    files = [(f, os.stat(f).st_ctime, os.stat(f).st_size) for f in files]
+    """This cleanup function checks that the total size of the files in
+    RENDERING_RESULT_PATH does not exceed 80% of the defined threshold
+    RENDERING_RESULT_MAX_SIZE_GB. If it does, files are removed until the
+    constraint is met again, oldest first, and grouped by job."""
+
+    def get_formatted_value(v):
+        return '%.2f MiB' % (v/1024.0/1024.0)
+    def get_formatted_details(saved, size, threshold):
+        return 'saved %s, now %s/%s' % \
+                (get_formatted_value(saved),
+                 get_formatted_value(size),
+                 get_formatted_value(threshold))
+
+    files = [os.path.join(RENDERING_RESULT_PATH, f)
+                for f in os.listdir(RENDERING_RESULT_PATH)
+                if not f.startswith('.')]
+    files = [(f, os.stat(f).st_mtime, os.stat(f).st_size)
+                for f in files]
+
+    # Compute the total size occupied by the renderings, and the actual 80%
+    # threshold, in bytes
     size = reduce(lambda x, y: x + y[2], files, 0)
     threshold = 0.8 * RENDERING_RESULT_MAX_SIZE_GB * 1024 * 1024 * 1024
+
+    # Stop here if we are below the threshold
     if size < threshold:
         return
-    files.sort(lambda x, y: cmp(x[1], y[1]))
-    for f in files:
-        os.remove(os.path.join(RENDERING_RESULT_PATH, f[0]))
-        size -= f[2]
-        LOG.debug("remove '%s', %f GB consumed over a %f GB threshold" % \
-                            (f[0], (size / 1024 / 1024 / 1024), (threshold / 
1024 / 1024 / 1024)))
-        if size < threshold:
-            break
+
+    LOG.info("%s consumed for a %s threshold. Cleaning..." %
+            (get_formatted_value(size), get_formatted_value(threshold)))
+
+    # Sort files by timestamp, oldest last, and start removing them by
+    # pop()-ing the list
+    files.sort(lambda x, y: cmp(y[1], x[1]))
+
+    while size > threshold:
+        if not len(files):
+            LOG.error("No files to remove and still above threshold! 
Something's wrong!")
+            return
+
+        # Get the next file to remove, and try to identify the job it comes
+        # from
+        f = files.pop()
+        try:
+            job = MapRenderingJob.get_job_from_filename(f[0])
+            removed, saved = job.remove_all_files()
+            size -= saved
+
+            # If files were removed, log it. If not, it only means only the
+            # thumbnail remained, and that's good.
+            if removed:
+                LOG.info("Removed %d files from job #%d (%s)." %
+                         (removed, job.id, get_formatted_details(saved, size, 
threshold)))
+
+
+        except MapRenderingJob.DoesNotExist:
+            # If we didn't find a parent job, it means this is an orphaned
+            # file, and we can safely remove it to get back some disk space.
+            os.remove(f[0])
+            saved = f[2]
+            size -= saved
+            LOG.info("Removed orphan file %s (%s)." %
+                     (os.path.basename(f[0]), get_formatted_details(saved, 
size, threshold)))
+
 
 if not os.path.isdir(RENDERING_RESULT_PATH):
     LOG.error("ERROR: please set RENDERING_RESULT_PATH ('%s') to an existing 
directory" % \
diff --git a/www/maposmatic/models.py b/www/maposmatic/models.py
index dc3b3ce..d953df6 100644
--- a/www/maposmatic/models.py
+++ b/www/maposmatic/models.py
@@ -81,6 +81,7 @@ class MapRenderingJob(models.Model):
     submitterip = models.IPAddressField()
     index_queue_at_submission = models.IntegerField()
     map_language = models.CharField(max_length=16)
+    has_files = models.BooleanField(default=True)
 
     objects = MapRenderingJobManager()
 
@@ -98,6 +99,27 @@ class MapRenderingJob(models.Model):
                              
self.startofrendering_time.strftime("%Y-%m-%d_%H-%M"),
                              self.maptitle_computized())
 
+
+    @staticmethod
+    def get_job_from_filename(name):
+        """Tries to find the parent job of the given file from its filename.
+        Both the job ID found in the first part of the prefix and the entire
+        files_prefix is used to match a job.
+
+        Throws MapRenderingJob.DoesNotExist when no parent job was found.
+        """
+
+        name = os.path.basename(name)
+        try:
+            jobid = int(name.split('_', 1)[0])
+            job = MapRenderingJob.objects.get(id=jobid)
+            if name.startswith(job.files_prefix()):
+                return job
+        except (ValueError, IndexError):
+            pass
+
+        raise MapRenderingJob.DoesNotExist
+
     def start_rendering(self):
         self.status = 1
         self.startofrendering_time = datetime.now()
@@ -107,6 +129,8 @@ class MapRenderingJob(models.Model):
         self.status = 2
         self.endofrendering_time = datetime.now()
         self.resultmsg = resultmsg
+        if self.is_done_ok():
+            self.has_files = True
         self.save()
 
     def is_waiting(self):
@@ -137,24 +161,59 @@ class MapRenderingJob(models.Model):
         return os.path.join(www.settings.RENDERING_RESULT_PATH, 
self.files_prefix() + "_index." + format)
 
     def output_files(self):
+        """Returns a structured dictionary of the output files for this job.
+        The result contains two lists, 'maps' and 'indeces', listing the output
+        files. Each file is reported by a tuple (format, path, title, size)."""
+
         allfiles = {'maps': [], 'indeces': []}
 
         for format in www.settings.RENDERING_RESULT_FORMATS:
             # Map files (all formats but CSV)
-            if format != 'csv' and 
os.path.exists(self.get_map_filepath(format)):
-                allfiles['maps'].append((format, self.get_map_fileurl(format),
-                    _("%(title)s %(format)s Map") % {'title': self.maptitle, 
'format': format.upper()}))
+            map_path = self.get_map_filepath(format)
+            if format != 'csv' and os.path.exists(map_path):
+                allfiles['maps'].append((format, map_path,
+                    _("%(title)s %(format)s Map") % {'title': self.maptitle,
+                                                     'format': format.upper()},
+                    os.stat(map_path).st_size))
+
             # Index files
-            if os.path.exists(self.get_index_filepath(format)):
-                allfiles['indeces'].append((format, 
self.get_index_fileurl(format),
-                    _("%(title)s %(format)s Index") % {'title': self.maptitle, 
'format': format.upper()}))
+            index_path = self.get_index_filepath(format)
+            if os.path.exists(index_path):
+                allfiles['indeces'].append((format, index_path,
+                    _("%(title)s %(format)s Index") % {'title': self.maptitle,
+                                                       'format': 
format.upper()},
+                    os.stat(index_path).st_size))
 
         return allfiles
 
     def has_output_files(self):
+        """Tells if this jobs still has its output files present in the
+        RENDERING_RESULT_PATH. Their actual presence is checked even if
+        has_files is True."""
+
+        if not self.has_files:
+            return False
+
         files = self.output_files()
         return len(files['maps']) + len(files['indeces'])
 
+    def remove_all_files(self):
+        """Removes all the output files from this job, and returns the space
+        saved in bytes (Note: the thumbnail is not removed)."""
+
+        files = self.output_files()
+        saved = 0
+        removed = 0
+
+        for f in (files['maps'] + files['indeces']):
+            saved += f[3]
+            removed += 1
+            os.remove(f[1])
+
+        self.has_files = False
+        self.save()
+        return removed, saved
+
     def get_thumbnail(self):
         thumbnail_file = os.path.join(www.settings.RENDERING_RESULT_PATH, 
self.files_prefix() + "_small.png")
         thumbnail_url = www.settings.RENDERING_RESULT_URL + "/" + 
self.files_prefix() + "_small.png"
diff --git a/www/media/style.css b/www/media/style.css
index da271e1..a542a55 100644
--- a/www/media/style.css
+++ b/www/media/style.css
@@ -269,6 +269,10 @@ table.jobinfo td.info {
   vertical-align: top;
 }
 
+p.nofiles {
+  font-style: italic;
+}
+
 div.mapsearch {
   float: right;
   margin-top: 1em;
diff --git a/www/templates/maposmatic/job.html 
b/www/templates/maposmatic/job.html
index cd181d6..1e270b3 100644
--- a/www/templates/maposmatic/job.html
+++ b/www/templates/maposmatic/job.html
@@ -64,7 +64,7 @@
       <li>{% trans "Index: " %} {% for file in job.output_files.indeces %}<a 
href="{{ file.1 }}" title="{{ file.2 }}">{{ file.0|upper }}</a>{% if not 
forloop.last %}, {% endif %}{% endfor %}.</li>
     </ul>
     {% else %}
-      {% trans "The generated files are no longer available." %}
+      <p class="nofiles">{% trans "The generated files are no longer 
available." %}</p>
     {% endif %}
     {% endif %}
   </td>
-- 
1.6.3.3.277.g88938c

[Prev in Thread]

Current Thread

[Next in Thread]

[Maposmatic-dev] [PATCH maposmatic] Improve the file cleanup mechanism, Maxime Petazzoni <=
- Re: [Maposmatic-dev] [PATCH maposmatic] Improve the file cleanup mechanism, David MENTRE, 2010/01/13
  - Re: [Maposmatic-dev] [PATCH maposmatic] Improve the file cleanup mechanism, Maxime Petazzoni, 2010/01/13
    - Re: [Maposmatic-dev] [PATCH maposmatic] Improve the file cleanup mechanism, David MENTRE, 2010/01/13
    - [Maposmatic-dev] [PATCH maposmatic] Improve the file cleanup mechanism, Maxime Petazzoni, 2010/01/13
    - Re: [Maposmatic-dev] [PATCH maposmatic] Improve the file cleanup mechanism, David MENTRE, 2010/01/13
    - Re: [Maposmatic-dev] [PATCH maposmatic] Improve the file cleanup mechanism, David MENTRE, 2010/01/13
    - Re: [Maposmatic-dev] [PATCH maposmatic] Improve the file cleanup mechanism, Maxime Petazzoni, 2010/01/13
    - Re: [Maposmatic-dev] [PATCH maposmatic] Improve the file cleanup mechanism, Maxime Petazzoni, 2010/01/13
- Re: [Maposmatic-dev] [PATCH maposmatic] Improve the file cleanup mechanism, David MENTRE, 2010/01/13

Prev by Date: [Maposmatic-dev] pt_BR django.po
Next by Date: Re: [Maposmatic-dev] [PATCH maposmatic] Improve the file cleanup mechanism
Previous by thread: [Maposmatic-dev] pt_BR django.po
Next by thread: Re: [Maposmatic-dev] [PATCH maposmatic] Improve the file cleanup mechanism
Index(es):
- Date
- Thread