SMS text messages ... how they work ... charactersets

gnustep-dev
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
SMS text messages ... how they work ... charactersets

From:	Richard Frith-Macdonald
Subject:	SMS text messages ... how they work ... charactersets
Date:	Tue, 5 Apr 2011 21:45:05 +0100
I guess now is a good time to go through some text messaging fundamentals for 
newcomers (and as a refresher for others).  It's important to remember when 
dealing with people in other companies (and I definitely include telecoms 
companies here) that they usually don't understand characterset issues.

First a few terms ...
A 'character' is a symbol that you read/understand to haver a particular 
meaning in a language ... for instance a lowercase 'a' is a character.
A 'codepoint' is a number used to represent a character within a computer... 
the computer's internal representation of the character.
A 'glyph' is the pictorial representation of a character ... what you actually 
see ... the glyph you see depends on the typeface/font used to display the 
character and on the hardware displaying it.

Mostly for SMS, we aren't interested in glyphs, but we *are* interested in 
characters and codepoints.
A 'characterset' is a set of characters and the term is also often used to 
refer to the 'encodings' of those characters ... an associated set of 
codepoints.
The ASCII (or IA5) characterset is a 7bit characterset ... that means that each 
codepoint in the characterset is a number which can be represented in 7 bits 
(ie a number in the range 0 to 127)
The GSM alphabet is a 7-bit characterset, but with a few escape sequences ... 
cases where a character is actually reresented by two 7-bit values rather than 
one.

When a message is sent to/from a handset, it travels over the GSM network using 
a protocol with a 140 byte payload.  The protocol allows that payload to be 
interpreted as text basically in two ways:
1. as up to 160 7-bit codepoints from the GSM default alphabet (characterset 
gsm0338) where the 7-bit values are packed into 140 bytes
2. as up to 70 16-bit codepoints from the Unicode alphabet (characterset UCS-2)

So, when we need to send a text message, we first need to see if the characters 
in that message are all in the gsm0338 characterset ... if they are, then we 
can send the text to be delivered as 7-bit data (and get up to 160 characters 
in a single message), if they aren't then we have to send them to be delivered 
as 16-bit data which means we might need to send more messages to deliver the 
same text.

So far, so simple ... we only have two choices really (there are actually some 
extensions for a few foreign languages, but they are not widely supported by 
the networks , and we don't use them).

However ... that's the situation within the GSM network.  First we have to get 
a message from Dragon to the network.  This is usually done by using the SMPP 
protocol to talk to an SMSC run by a network operator.
Dragon sends an SMPP submit PDU to the SMSC and the SMSC converts it to a GSM 
deliver PDU which is sent to the handset.

If we want to send up to 70 16-bit characters, things are quite good ... we set 
the data coding scheme in the SMPP PDU to 'unicode' (to say that the data we 
are supplying is a series of codepoints from the UCS-2 characterset) and simply 
send the unicode data and the SMSC sends it on to the handset.  The problems 
here are that we get fewer than half as many characters per message, and quite 
often the SMSCs we connect to simply don't support UCS-2 :-(

If we want to send up to 160 7-bit characters, things with SMPP are not good at 
all.  The SMPP protocol is badly designed and does not have a data coding 
scheme which unambiguously means that the data in a PDU is from the gsm0338 
characterset.
Instead it has data coding values meaning the data is in the 'ascii' 
characterset, or the 'iso8859-1' characterset or a few others ... none of which 
actually contain completely the same characters that are in the gsm0338 
characterset.  This means, that if we were to use one of those charactersets, 
there would be some characters we simply couldn't send, even though the GSM 
network supports them.
However, there is one data coding you can set in the SMPP protocol, called 
'SMSC default alphabet' which is undefined by SMPP: it says the SMSC decides 
what characterset is to be used.
Conventionally this is the data coding scheme used to send text messages.
Many SMSCs define this encoding to be gsm0338 ... these are well behaved SMSCs 
since we can then simply send up to 160 bytes, each containing a codepoint from 
gsm0338, and the SMSC will pack them into 140 bytes and send them to the 
handset.
Some other SMSCs define their default alphabet to be some other characterset, 
and when they receive an SMPP PDU they map those characters to the gsm 
characterset, pack them into 140 bytes, and send them.  If the characterset 
they define is a superset of gsm0338 then we can send all the characters in the 
gsm alphabet, but if it isn't we obviously can't.

So, what happens if the SMSC default alphabet is not the same as the GSM 
default alphabet?

1. we have a character to send which exists in the gsm alphabet but not the 
smsc alphabet ... it can't be sent ... so we try to map it to a similar 
character or to a placeholder such as a question mark.
2. we have a character which exists in the smsc alphabet but not the gsm 
alphabet ... we know it can't get to the handset ... so again we try to map it 
(if we didn't the SMS would probably fail the message)

And of course a similar process applies in reverse for Mobile Originated 
messages coming in to Dragon from a network operator's SMSC.

The fact that the normal/default characterset used by SMPP is actually 
undefined is a major cause of problems when we set up new connections.
The vast majority of charactersets (including gsm0338) use the same codepoints 
for the common western characters(a-ZA-Z0-9 and common punctuation), so many 
messages will in practice work with any setting for the characterset used.  
However, less usual characters ('@', and '€' are the most frequently used) will 
come through incorrectly if the charactersets configured at the two ends of the 
SMPP connection differ.

In theory this is easy to resolve ... we ask the SMSC people what their system 
is using as the SMSC default alphabet, then we configure Dragon to use the same 
characterset.
In practice, our contacts at the SMSC end may not know what characterset they 
are using, or they may misread their configuration, or they may think they 
know, but be wrong.  They may lack the technical knowledge to find out (or even 
to understand the question).  One possibility which seems to be quite common is 
that they read something in their system saying 'SMSC default alphabet', and 
think that it means 'GSM default alphabet' ... and insist that their end is 
therefore using the GSM alphabet.
My impression is that, only a small minority of the companies we ask for this 
characterset information supply us with the correct answer first time!
[Prev in Thread]
Current Thread
[Next in Thread]
SMS text messages ... how they work ... charactersets, Richard Frith-Macdonald <=
Prev by Date: Re: Test results from base Linux/PPC
Next by Date: Re: Windows Drag & Drop
Previous by thread: Re: GNUstep, Google Summer of Code, and an idea
Next by thread: Re: Windows Drag & Drop
Index(es):
- Date
- Thread