Re: [Gnumed-devel] Brasil cities and states (demographics)

From:

Busser, Jim

Subject:

Date:

Wed, 16 Nov 2011 06:56:02 +0000

On 2011-11-15, at 6:47 PM, Jim Busser wrote:

The only hiccup is that the original bootstrapped GNUmed data contained most (but not all) of Brazil's states

The above is incorrect:

1) GNUmed bootstrapped all of the states with correct (unaccented) names, it was only that a handful of the state abbreviations (2-character codes) were either incorrect or subsequently revised.

2) the patch in Brasilian bootstrap will correctly provide pt_BR for those states which are accented:

select i18n.upd_tx('pt_BR', 'Ceara', 'Ceará');
select i18n.upd_tx('pt_BR', 'Espirito Santo', 'Espírito Santo');
select i18n.upd_tx('pt_BR', 'Goias', 'Goiás');
select i18n.upd_tx('pt_BR', 'Maranhao', 'Maranhão');
select i18n.upd_tx('pt_BR', 'Para', 'Pará');
select i18n.upd_tx('pt_BR', 'Paraiba', 'Paraíba');
select i18n.upd_tx('pt_BR', 'Parana', 'Paraná');
select i18n.upd_tx('pt_BR', 'Piaui', 'Piauí');
select i18n.upd_tx('pt_BR', 'Rondonia', 'Rondônia');
select i18n.upd_tx('pt_BR', 'Sao Paulo', 'São Paulo');

However, my questions about the approach to be taken for populating unaccented vs accented names remain of interest to answer.

Regarding:

Found these

http://postgresql.1045698.n5.nabble.com/GENERAL-Remove-diacritical-marks-in-SQL-td1874140.html

http://scottbarnham.com/blog/2010/12/20/make-a-slug-in-postgresql-translating-diacritics/

However

1) postgres does not support SQL99's convert('string', 'ENCODING')

http://oreilly.com/catalog/sqlnut/chapter/ch04.html

2) to_ascii() supports only LATIN1, LATIN2, LATIN9, and WIN1250 and not UTF8 and even if we made the encodings LATIN1 or WIN1250 the output seems not what we want:

SELECT to_ascii('Ceará', 'LATIN1');

--> CearA

SELECT to_ascii('Rondônia', 'LATIN1');

--> RondA'nia

3) Postgres 9 appears to support an unaccent() function

http://www.postgresql.org/docs/9.0/static/unaccent.html

https://gist.github.com/1013892

http://readthedocs.org/docs/django-postgresql/en/1.4/aggregates.html

http://postgresql.1045698.n5.nabble.com/unaccent-extension-missing-some-accents-td4969070.html

4) python

http://bytes.com/topic/python/answers/29477-replace-accented-chars-unaccented-ones

http://bytes.com/topic/python/answers/889610-accented-characters-unaccented

5) perl

http://blog.endpoint.com/2010/03/postgresql-utf-8-conversion.html

http://postgresql.1045698.n5.nabble.com/GENERAL-Remove-diacritical-marks-in-SQL-td1874140.html

-- Jim