Mailing List archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[linux-dvb] Re: DVB character coding...



Robert Schlabbach skrev:

From: "Jesper Sörensen" <jesper@datapartner.se>

The wording in annex A isn't that good and I had some problems figuring
out what they meant too. Anyway, I wouldn't look too carefully at those
tables. I think what they mean is that unless some other coding is
specified you should use Latin-1 (ISO 8859-1) which makes sense since it
is the most widely used coding in the west and on the net. The 0xE9 will
then indeed be mapped into "é" like expected.

I think the wording is pretty clear:

| Annex A (normative):
| Coding of text characters
[...]
| if the first byte of the text field has a value in the range "0x20"
| to "0xFF" then this and all subsequent bytes in the text item are
| coded using the default character coding table (table 00 - Latin
| alphabet) of figure A.1

Figure A.1 is a superset of ISO/IEC 6937, *not* any of the ISO/IEC 8859-x
tables. Using this table, the character "é" would have to be composed with
the sequence 0xC2 0x65.

Note that this is a _normative_ Annex, i.e. this is part of the standard,
not an option. It does appear, though, that not even professional tools
properly implement character encoding/decoding that fully complies with
this standard...

Yeah, I don't know if it's the standard or the implementations that are broken. I can only tell you that my DVB feed looks the same as yours. It uses Latin-1 and it doesn't have any "charset escape". Maybe I'm just stupid but it doesn't make much sense to me to include all the other 8859-x encodings and not have Latin-1. AFAIK 8859-15 is mostly the same as Latin-1 but not quite...

BTW, do you happen to know what they mean when they say the following (WRT 16-bit codings):

* if the first byte of the text field has a value "0x10" then the following two bytes carry a 16-bit value (uimsbf) N
to indicate that the remaining data of the text field is coded using the character code table specified by
ISO Standard 8859, parts 1 to 9;

What does that mean? High byte selects table and low byte has the character code?

* if the first byte of the text field has a value "0x11" then the remaining bytes in the text item are coded in pairs
in accordance with the Basic Multilingual Plane of ISO/IEC 10646-1 [8];

Is this UCS-2?





Home | Main Index | Thread Index