[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
SMS text messages ... how they work ... charactersets
From: |
Richard Frith-Macdonald |
Subject: |
SMS text messages ... how they work ... charactersets |
Date: |
Tue, 5 Apr 2011 21:45:05 +0100 |
I guess now is a good time to go through some text messaging fundamentals for
newcomers (and as a refresher for others). It's important to remember when
dealing with people in other companies (and I definitely include telecoms
companies here) that they usually don't understand characterset issues.
First a few terms ...
A 'character' is a symbol that you read/understand to haver a particular
meaning in a language ... for instance a lowercase 'a' is a character.
A 'codepoint' is a number used to represent a character within a computer...
the computer's internal representation of the character.
A 'glyph' is the pictorial representation of a character ... what you actually
see ... the glyph you see depends on the typeface/font used to display the
character and on the hardware displaying it.
Mostly for SMS, we aren't interested in glyphs, but we *are* interested in
characters and codepoints.
A 'characterset' is a set of characters and the term is also often used to
refer to the 'encodings' of those characters ... an associated set of
codepoints.
The ASCII (or IA5) characterset is a 7bit characterset ... that means that each
codepoint in the characterset is a number which can be represented in 7 bits
(ie a number in the range 0 to 127)
The GSM alphabet is a 7-bit characterset, but with a few escape sequences ...
cases where a character is actually reresented by two 7-bit values rather than
one.
When a message is sent to/from a handset, it travels over the GSM network using
a protocol with a 140 byte payload. The protocol allows that payload to be
interpreted as text basically in two ways:
1. as up to 160 7-bit codepoints from the GSM default alphabet (characterset
gsm0338) where the 7-bit values are packed into 140 bytes
2. as up to 70 16-bit codepoints from the Unicode alphabet (characterset UCS-2)
So, when we need to send a text message, we first need to see if the characters
in that message are all in the gsm0338 characterset ... if they are, then we
can send the text to be delivered as 7-bit data (and get up to 160 characters
in a single message), if they aren't then we have to send them to be delivered
as 16-bit data which means we might need to send more messages to deliver the
same text.
So far, so simple ... we only have two choices really (there are actually some
extensions for a few foreign languages, but they are not widely supported by
the networks , and we don't use them).
However ... that's the situation within the GSM network. First we have to get
a message from Dragon to the network. This is usually done by using the SMPP
protocol to talk to an SMSC run by a network operator.
Dragon sends an SMPP submit PDU to the SMSC and the SMSC converts it to a GSM
deliver PDU which is sent to the handset.
If we want to send up to 70 16-bit characters, things are quite good ... we set
the data coding scheme in the SMPP PDU to 'unicode' (to say that the data we
are supplying is a series of codepoints from the UCS-2 characterset) and simply
send the unicode data and the SMSC sends it on to the handset. The problems
here are that we get fewer than half as many characters per message, and quite
often the SMSCs we connect to simply don't support UCS-2 :-(
If we want to send up to 160 7-bit characters, things with SMPP are not good at
all. The SMPP protocol is badly designed and does not have a data coding
scheme which unambiguously means that the data in a PDU is from the gsm0338
characterset.
Instead it has data coding values meaning the data is in the 'ascii'
characterset, or the 'iso8859-1' characterset or a few others ... none of which
actually contain completely the same characters that are in the gsm0338
characterset. This means, that if we were to use one of those charactersets,
there would be some characters we simply couldn't send, even though the GSM
network supports them.
However, there is one data coding you can set in the SMPP protocol, called
'SMSC default alphabet' which is undefined by SMPP: it says the SMSC decides
what characterset is to be used.
Conventionally this is the data coding scheme used to send text messages.
Many SMSCs define this encoding to be gsm0338 ... these are well behaved SMSCs
since we can then simply send up to 160 bytes, each containing a codepoint from
gsm0338, and the SMSC will pack them into 140 bytes and send them to the
handset.
Some other SMSCs define their default alphabet to be some other characterset,
and when they receive an SMPP PDU they map those characters to the gsm
characterset, pack them into 140 bytes, and send them. If the characterset
they define is a superset of gsm0338 then we can send all the characters in the
gsm alphabet, but if it isn't we obviously can't.
So, what happens if the SMSC default alphabet is not the same as the GSM
default alphabet?
1. we have a character to send which exists in the gsm alphabet but not the
smsc alphabet ... it can't be sent ... so we try to map it to a similar
character or to a placeholder such as a question mark.
2. we have a character which exists in the smsc alphabet but not the gsm
alphabet ... we know it can't get to the handset ... so again we try to map it
(if we didn't the SMS would probably fail the message)
And of course a similar process applies in reverse for Mobile Originated
messages coming in to Dragon from a network operator's SMSC.
The fact that the normal/default characterset used by SMPP is actually
undefined is a major cause of problems when we set up new connections.
The vast majority of charactersets (including gsm0338) use the same codepoints
for the common western characters(a-ZA-Z0-9 and common punctuation), so many
messages will in practice work with any setting for the characterset used.
However, less usual characters ('@', and '€' are the most frequently used) will
come through incorrectly if the charactersets configured at the two ends of the
SMPP connection differ.
In theory this is easy to resolve ... we ask the SMSC people what their system
is using as the SMSC default alphabet, then we configure Dragon to use the same
characterset.
In practice, our contacts at the SMSC end may not know what characterset they
are using, or they may misread their configuration, or they may think they
know, but be wrong. They may lack the technical knowledge to find out (or even
to understand the question). One possibility which seems to be quite common is
that they read something in their system saying 'SMSC default alphabet', and
think that it means 'GSM default alphabet' ... and insist that their end is
therefore using the GSM alphabet.
My impression is that, only a small minority of the companies we ask for this
characterset information supply us with the correct answer first time!
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- SMS text messages ... how they work ... charactersets,
Richard Frith-Macdonald <=