WEBVTT

00:00.000 --> 00:11.080
I'm Jonathan Clark, a developer from the Document Foundation, working on improving language

00:11.080 --> 00:15.120
support in the labor office.

00:15.120 --> 00:20.400
Language supports very important for our project and for the foundation.

00:20.400 --> 00:23.200
Many languages are endangered, many are going extinct.

00:23.200 --> 00:27.880
I think the saddest thing is when people are forced to use a different language because their

00:27.880 --> 00:32.440
software doesn't work in the language they want to use.

00:32.440 --> 00:37.760
As developers, I think we have a tremendous opportunity and responsibility to make sure

00:37.760 --> 00:42.720
that our project supports everybody's languages.

00:42.720 --> 00:48.880
So, what we've been doing is fixing bugs.

00:48.880 --> 00:55.160
In the last year, labor office developers have fixed more than 100 bugs directly related

00:55.160 --> 00:57.160
to language support.

00:57.160 --> 01:02.840
So, 34 of them were very, very old.

01:02.840 --> 01:09.160
Speak up.

01:09.160 --> 01:13.920
If there's one takeaway from this talk, it's that if you tried using labor office while

01:13.920 --> 01:18.920
ago and it didn't work well for you because of your language, because of how we support

01:18.920 --> 01:22.000
it, please try a new version.

01:22.000 --> 01:25.440
The situation might be much better now.

01:25.440 --> 01:31.000
There are too many bug fixes to talk about, so I just wanted to focus on one bug that

01:31.000 --> 01:38.560
can help to kind of motivate and explain some of the challenges behind retrofitting good

01:38.560 --> 01:45.600
language support into a mature code base.

01:45.600 --> 01:53.840
This is a very old bug in labor office that's been around since the very beginning.

01:53.840 --> 02:01.160
When you change the color of text inside an English word, the spacing of the letters gets

02:01.160 --> 02:02.880
a little bit weird.

02:02.880 --> 02:12.720
They kind of hide to see, so I blew it up and overlaid the two over there.

02:12.720 --> 02:18.840
Now, this bug set on the back burner for a really long time, because it's kind of hard

02:18.840 --> 02:21.960
to understand why it's important.

02:21.960 --> 02:26.960
It seems like a very minor difference in text, and you always just tell a user not to do

02:26.960 --> 02:28.680
this.

02:28.680 --> 02:33.920
But it turns out that labor office actually does this automatically sometimes, for instance,

02:33.920 --> 02:37.160
when you're using a track changes feature.

02:37.200 --> 02:43.800
So we can't just tell you just not to do it, because we do it.

02:43.800 --> 02:45.800
Why does this happen?

02:45.800 --> 02:53.040
It's due to a very common assumption made by software developers, especially developers

02:53.040 --> 02:57.080
who are most used to the Latin script.

02:57.080 --> 03:02.440
We kind of tend to view computers as high-concept typewriters that just arrange letters from

03:02.480 --> 03:06.080
left to right across the screen.

03:06.080 --> 03:16.760
And so if that's what you believe about text, it leads you to an obvious abstraction.

03:16.760 --> 03:22.880
Your graphics library, your graphics code, is responsible for atomically rendering text, pieces

03:22.880 --> 03:26.840
of text, in a particular style using a particular font.

03:26.880 --> 03:33.400
The application code is responsible for chopping text up into regions of interest, into stretches

03:33.400 --> 03:39.000
of text that share the same style, whatever you have.

03:39.000 --> 03:42.120
And this is what this ends up looking like.

03:42.120 --> 03:47.240
So we saw where there are a virtual pen at a particular position on screen.

03:47.240 --> 03:54.800
We tell our graphics library to draw a string in this case T, the first letter.

03:54.840 --> 04:00.440
We advance our pen to the right by the width of the text that we drew.

04:00.440 --> 04:04.680
Set the color, draw the rest of the text.

04:04.680 --> 04:08.400
In this case another complete string, and advance the pen again.

04:08.400 --> 04:14.080
So now we're ready to draw whatever text comes after this.

04:14.080 --> 04:18.240
The problem is is that that's not the way that text works.

04:18.240 --> 04:21.920
This turns out to be a very bad assumption.

04:21.920 --> 04:27.720
In an English, we use lots of special font features to adjust the positions of characters

04:27.720 --> 04:33.600
on a screen to make the text look more attractive.

04:33.600 --> 04:38.160
So in software, this is something we call an architecture bug.

04:38.160 --> 04:44.160
It's an assumption or a design decision made very early in the project that turns out

04:44.160 --> 04:45.960
to be incorrect.

04:45.960 --> 04:52.960
When you have an architecture bug, it's very difficult, very expensive to fix.

04:52.960 --> 04:57.640
And in this case, fortunately for English text, it's only a minor irritation.

04:57.640 --> 04:59.200
You can still understand the text.

04:59.200 --> 05:03.000
It just looks a little bit funny.

05:03.000 --> 05:08.000
But in software, we have this category of languages called CTL languages.

05:08.000 --> 05:10.400
Stents for complex text layout.

05:10.960 --> 05:16.760
The dictionary definition is writing systems where the shapes of the characters, the appearance

05:16.760 --> 05:21.760
of the character, changes depending on the context in which the character appears.

05:21.760 --> 05:29.280
CTL languages are sadly very often neglected by software projects.

05:29.280 --> 05:36.480
Treated as an afterthought, but the group includes a number of heavy hitters.

05:36.480 --> 05:39.680
But your project doesn't consider the needs of CTL languages.

05:39.680 --> 05:44.880
You're giving up billions of potential users.

05:44.880 --> 05:49.520
So the canonical example of a CTL language is Arabic script.

05:49.520 --> 05:54.520
In Arabic, letters can change shape pretty dramatically depending on where they appear

05:54.520 --> 05:55.520
in a word.

05:55.520 --> 06:00.080
In this case, you can see here this is the isolated form.

06:00.080 --> 06:06.080
And that's one of the pairs of the start of a word in the middle or at the end.

06:06.080 --> 06:09.840
But this isn't an unusual phenomenon.

06:09.840 --> 06:14.880
Even if you're only familiar with European languages, see this example of English cursive.

06:14.880 --> 06:20.560
The shape of the letter E changes significantly depending on whether it's connected higher

06:20.560 --> 06:22.560
or low.

06:22.560 --> 06:31.840
If you try to render the other E in a different word, it would look pretty odd.

06:31.920 --> 06:37.760
I might even confuse it for space in the middle of a word.

06:37.760 --> 06:47.840
So the algorithm that I described previously really doesn't handle CTL languages well.

06:47.840 --> 06:56.880
I can't read to mill, but those don't look very similar to me.

06:56.880 --> 07:00.560
So the definition I gave before is kind of vague.

07:00.560 --> 07:06.880
It's languages that are heavily affected by context.

07:06.880 --> 07:13.360
I think an alternate definition might be a CTL language is one where when you try to use

07:13.360 --> 07:23.280
the algorithm that we're talking about, it fails to substantially preserve meaning.

07:23.360 --> 07:27.920
Just to make it clear, Librar office isn't alone in making this assumption.

07:27.920 --> 07:31.360
It's very, very common.

07:32.160 --> 07:34.000
May I have a time?

07:34.000 --> 07:35.360
You have two minutes.

07:35.360 --> 07:37.200
Two minutes, okay.

07:37.200 --> 07:44.880
Many other software projects make the same assumption, including some very popular operating systems

07:44.880 --> 07:46.080
and made by very big companies.

07:47.440 --> 07:51.680
Anyway, two expensive to rewrite all of our application code to suit to fix this bug.

07:52.640 --> 07:57.280
So instead, what we can do is just pass more information when we're figuring out what characters

07:57.280 --> 08:01.120
to render this case once again, start at the origin.

08:01.120 --> 08:08.160
This time, we're going to draw the entire text and just drop the characters that aren't included in this

08:08.160 --> 08:09.520
section of the text.

08:10.320 --> 08:11.600
Now we repeat the algorithm.

08:13.040 --> 08:16.640
At the end, we've successfully laid out the text correctly.

08:16.960 --> 08:20.720
Our approach isn't perfect.

08:22.560 --> 08:29.840
It can't handle cases where characters actually rearranged in words, which is something I can

08:29.840 --> 08:35.760
happen in CTL languages, but our results are much closer to what someone would expect when they're

08:35.760 --> 08:36.640
using these languages.

08:38.880 --> 08:44.720
And all this is to say that if you're writing a new software project today, it's very helpful to keep

08:44.720 --> 08:51.680
these languages in mind because it's very difficult to reverse course once you've written a lot of application code.