-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.html
297 lines (263 loc) · 14.8 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<link rel="stylesheet" href="reset.css">
<link rel="stylesheet" href="style.css">
<link rel="preconnect" href="https://fonts.googleapis.com">
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
<link href="https://fonts.googleapis.com/css2?family=Roboto:wght@100;300;400;500;700;900&display=swap"
rel="stylesheet">
<link
href="https://fonts.googleapis.com/css2?family=Roboto+Serif:wght@100;200;300;400;500;600;700;800;900&family=Roboto:wght@100;300;400;500;700;900&display=swap"
rel="stylesheet">
<title>Project4 for Lede Program: Sentiment Analysis</title>
</head>
<body>
<div class="column">
<h2 class="subtitle">A Sentiment-al Journey:<br>Translating Fiction Under the Watchful Eye of AI </h2>
<p class="byline">By Marie-France Han <span class="date">August 2023</span></p>
<!-- <div class="graphic wide"></div> -->
<div class="column">
<img src="./viz/cosette_cropped.png" alt="Engraving of Cosette" class="align-center large">
<p class="source">“Cosette Sweeping,” illustration from Victor Hugo</p>
</div>
<div class="column">
<p>Literary translation is a subjective effort to convey a writer's creative
choices
into another language. Word-for-word accuracy is neither required nor expected -- but fidelity to the overall
meaning and emotional content is what determines a successful endeavor. </p>
<p>How emotionally faithful is a piece of translated text? Is there a consistent, accurate way to measure that
dimension? Can sentiment analysis be a useful tool? Let's look at short, but emotionally and structurally
significant passages from three novels from the Western canon. </p>
</div>
<!-- <p class="column"> -->
<!-- <h3 class="subtitle">Methodology</h3> -->
<p><strong>Methodology</strong>: Data project for the <a href="https://ledeprogram.com/" target="_blank"> Lede
Program at Columbia University</a>, summer 2023</p>
<div class="column">
<p><strong>Tools and Processes:</strong></p>
<p>-- Excerpts from <i>Les Miserables</i>, <i>In Search of Lost Time</i> and <i>Wuthering Heights </i>and their
translations obtained from Project Gutenberg <br>
-- sentiment analysis performed with the Cardiffnlp <a
href="https://huggingface.co/cardiffnlp/twitter-xlm-roberta-base-sentiment">twitter-XLM-roBERTa-base </a>
<br>
-- Data visualization performed with Flourish and D3/>Svelte. <br>
-- Code and data available on <a href="https://github.com/mfhan/translations" target="_blank">GitHub</a></i>
</p>
</div>
<div class="column">
<!-- <div class="container"> -->
<h3 class="subtitle">Les Misérables: the Death of Gavroche</h3>
<p class="column">
Anyone who's read the rousing final volume of Victor Hugo's protean novel -- and anybody's who's
seen the long-running musical comedy -- knows that this is an epic full of tears, death, suffering and
redemption.
Between the work houses, the prison labor camps and the student riots, the body count is high and the wine
flows.
But even with this colorful backdrop, the death of street urchin Gavroche represents the culmination of all
the injustices and cruelties endured by the dispossessed. It is the absolute moral low point of the entire
novel. </p>
<p>
The insurgency has failed, and the small group of students retrenched behind their crumbling barricades know
that their fate is sealed.
Courageous little Gavroche ventures into the open, within shooting distance of the
soldiers, in a desperate attempt to collect ammunition. He dances with death until, at last, he is hit by a
marksman's bullet. He continues to sing for a brief moment until a second bullet kills him. There's horror but
also playfulness, childish fun -- Gavroche feels invulnerable until he's shot dead.
</p>
<p>The original French version and its English translation were carefully divided into 58 pieces of text, then
analyzed by <i>twitter-XLM-roBERTa-base for Sentiment Analysis</i>, a multilingual model trained on almost 200
million tweets. </p>
<p>(RoBERTa, short for “Robustly Optimized BERT Approach”, is a variant of the BERT (Bidirectional Encoder
Representations from Transformers) model, which was developed at Facebook AI).
</p>
<p>The French original is on the left, and the English translation on the right. Deep blue denotes a "positive"
sentiment with a score of more than 0.5; light blue is "positive" with a score of less than 0.5; light gray is
"neutral"; light red is a "negative" sentiment with a score of less than 0.5; and finally the deep red that
dominates the French version means "negative" with a score of more than 0.5.
<br>In other words: blue is positive, red is negative, gray is neutral. <br>
<b>Click on any bar to read the actual text.</b>
</p>
<!-- <div class="row-container"> -->
<!-- <div class="column graphic"> -->
<div class="columns is-multiline is-mobile"></div>
<!-- <h3>French version:</h3> -->
<div class="flourish-embed flourish-heatmap" data-src="visualisation/14657032"
style="width:45%; display: inline-block; vertical-align: top;">
<script src="https://public.flourish.studio/resources/embed.js"></script>
</div>
<!-- <h3>English Translation:</h3> -->
<div class="flourish-embed flourish-heatmap" data-src="visualisation/14656678"
style="width:45%; display: inline-block; vertical-align: top;">
<script src="https://public.flourish.studio/resources/embed.js"></script>
</div>
</div>
<div class="column">
<p class="column">
No need to be bilingual to be struck by the vast differences in sentiment between the two versions. Out of 58
text
elements, the English translation has 26 neutrals and just 28 negatives, while the French original has 41
negative
labels and just 12 neutrals -- an unsurprising ratio for a text that describes the death of a child. </p>
<p>The most striking discrepancy lies at the very end of the text, when a second bullet fatally strikes the
wounded boy:</p>
<p><i>"This time he fell face downward on the pavement / and moved no more. / This grand little soul had taken
its
flight."</i></p>
<p>While the French version gets a "negative" label for all three pieces of text, with scores ranging from 0.48
to
0.69, the English text gets: "neutral" with a 0.52 score, "neutral" with a 0.54 score, and a puzzling
"positive"
with a 0.47 score.
</p>
<p>What could explain such a contrast? Could it be that the twitter-XLM-roBERTa-base model, trained on the
syntax
and content of 21st-century tweets, fails to correctly interpret a work of 19th century fiction full of
metaphors and emotion? </p>
<p>Another hypothesis: would it be possible that, by nature, translations tend to lean toward a safer, more
neutral
version of the same phrases and ideas? </p>
</div>
<div class="column">
<h3 class="subtitle">Proust: The "Little Phrase"</h3>
<p class="column">Let's take another work famous for its depiction of complex human emotions:
Marcel
Proust's titanic "In Search of Lost Time" hexalogy.
In this passage, from the first novel, "Swann's Way", the narrator's long-buried memories of a painful lost
love
are searingly,
"maddeningly" awakened by a little musical phrase from a violin sonata. Even the harshest pang of suffering is
intimately connected with the remembered joy of lost love.
</p>
<!-- </div> -->
<p>Here again, the original French version and its English translation were divided into 40 pieces of text and
analyzed by <i>twitter-XLM-roBERTa-base</i>. </p>
<p>The French original is on the left, and the English translation on the right.
Again, deep blue denotes a "positive" sentiment with a score of more than 0.5; light blue is "positive" with a
score of less than 0.5; light gray is "neutral"; light red is a "negative" sentiment with a score of less than
0.5; and deep red means "negative" with a score of more than 0.5.
<br>
<b>Click on a bar to read the text.</b>
</p>
</p>
<!-- <div class="row-container"> -->
<!-- <div class="column graphic"> -->
<div class="columns is-multiline is-mobile">
<div class="flourish-embed flourish-heatmap" data-src="visualisation/14657068"
style="width:45%; display: inline-block; vertical-align: top;">
<script src="https://public.flourish.studio/resources/embed.js"></script>
</div>
<div class="flourish-embed flourish-heatmap" data-src="visualisation/14657063"
style="width:45%; display: inline-block; vertical-align: top;">
<script src="https://public.flourish.studio/resources/embed.js"></script>
</div>
<!-- <p class="source">original edition xxxx ; translation xxxx</p> -->
</div>
<div class="column">
<p class="column">
While the contrast is a bit less striking, it is undeniable that the original has more red bars of negative
sentiment than the translation, which is dominated by neutrals.
</p>
<p>The French version starts briskly: the red bars in the first 10 lines illustrate the physical reaction to
the
sudden recollection ("and this apparition tore him with such anguish / that his hand rose impulsively to his
heart"). After fluttering in a desperate attempt to stave off the flood of memories (blues and neutrals in
the
middle section), it finally gives way to their "maddening" song -- in a confusing mix of pain and joy. </p>
<p>The model clearly finds the English version more neutral -- 23 of the 40 pieces, compared with only 9
negatives.
The French version gets 18 negatives and 14 neutrals. </p>
</div>
<div class="column">
<!-- <p class="column">
Puzzling classifications: why does Roberta consider the final sentence a positive in french and a negative in English? Why xxxx? </p> -->
<div>
<p>Still, puzzling labels abound: why is the phrase "When Odette was in love with him" score a 0.6 negative
in
French, but a 0.65 neutral in English? And why is the model finding the final clause, "the forgotten
strains
of
happiness",
rate a negative 0.48 in English and a 0.66 positive in French? </p>
</div>
<div class="column">
<h3 class="subtitle">English to French: Wuthering Heights</h3>
<p class="column">For good measure, I decided to take a look at English-to-French translation, and use
<i>twitter-XLM-roBERTa-base</i> on another emotionally-charged work of
fiction: Wuthering Heights. In this scene, Heathcliff reappears after a three-year absence. Everything has
changed -- his beloved Cathy is married to Edgar Linton, and Heathcliff himself is a grown, somber man
bent on
revenge. But when he sees her, their shared delight consumes everything else.
</p>
<p>
"They were too much absorbed in their mutual joy / to suffer embarrassment."
"You don't deserve this welcome / to be absent and silent for three years / and never to think of me!"
"I've fought through a bitter life / since I last heard your voice; / and you must forgive me, / for I
struggled
only for you!"</p>
<p>In a bit more than 40 slices of text, there is joy, longing, anger and self-destruction. Will the French
translation successfully
convey this explosion of feelings, or will it water things down as we saw in our first two examples?
</p>
<!-- <p>Here again, the French translation and the English original were divided into 42 pieces of text and analyzed by
<i>twitter-XLM-roBERTa-base</i>. -->
</div>
<p>The French version is on the left, and the English on the right.
<br>
<b>Click on a bar to read the text.</b>
</p>
</div>
<div class="columns is-multiline is-mobile">
<div class="flourish-embed flourish-heatmap" data-src="visualisation/14677272"
style="width:45%; display: inline-block; vertical-align: top;">
<script src="https://public.flourish.studio/resources/embed.js"></script>
</div>
<div class="flourish-embed flourish-heatmap" data-src="visualisation/14677286"
style="width:45%; display: inline-block; vertical-align: top;">
<script src="https://public.flourish.studio/resources/embed.js"></script>
</div>
<!-- <p class="source">original edition xxxx ; translation xxxx</p> -->
</div>
<div class="column">
<p class="column">
</p>
<p>And yet again, a full half of the French text is negative, with just 11 neutrals and 10 positives. In the
English text, neutral labels represent half of the rows, with 17 negatives to and just 4 positives. </p>
<p>Based on this small corpus, we observe that when it comes to literary translations, the Cardiff Roberta
model
finds French sentences more polarized than their English counterparts. </p>
<p>Hugo, with his intentional digressions and infinite subplots, and Proust, with his spellbinding sentences,
are both challenging to translate into another language.
<p>This experiment only used three tiny samples, but could serve as a jumping-off point for broader
comparisons.
Sentiment analysis may very well be a way to test translation quality and translation effectiveness. New
translations are published all the time, to better fit the evolution of the
destination language.
Much more could be done, with more languages, and on texts well outside the Western canon.
</p>
<p>Visualizing these diverging sentiment readings led us to experiment with several options.
<p>One option was to visualize not only the label but also the score of each text element. Using the Svelte
framework, we gave a shape (circles) and colors to the labels and used a <a
href="https://svelte.dev/repl/161b41989f334a12a4c2d8383831df0d?version=4.1.2" target="_blank"
rel="noreferrer"> force
simulation algorithm </a> to show the contrasting analyses.
<p>While it has the benefit of displaying the score of each text element, the force graph doesn't allow a
comparison between each specific snippet. That is something a horizontal bar chart can achieve -- and the
option we eventually chose. What is being lost in the score is compensated by the clear display of the
sequential nature of the text being analyzed. </p>
</p>
</p>
</div>
<div class="column">
<p class="column">
Code and data available on <a href="https://github.com/mfhan/translations" target="_blank">GitHub</a></i>
</p>
</div>
<br>
<br>
</div>
<!-- <script type='text/javascript' src="script.js"></script> -->
</body>
</html>