A lot of discussion around Matt Jockers’ Syuzhet package (involving Annie Swafford, Ted Underwood, Andrew Piper, Scott Weingart and many others) has focused on issues of validity — whether sentiment analysis is accurate enough for the task, whether the Fourier transform is an appropriate method for dimensionality reduction, whether the emotional trajectories themselves are valid measurements of anything at all (Scott has a good enumeration of the various issues here.)

Andrew’s discussion of the validity of inherently subjective measurements inspired me to solicit at least one data point from readers that we can use for one question under discussion with Syuzhet: what does a human judgment of the “emotional trajectory” of a work look like, and how often do readers agree with each other on this task?

This method of soliciting human judgments for inherently subjective tasks is at the core of NLP and a lot of machine learning — syntactic parsing, part of speech tagging, named entity recognition, topic classification, sentiment analysis, and lots of other tasks all rely on humans making judgments that are often surprisingly difficult in practice; learning algorithms in these cases are not so much learning any notion of “truth” but simply to reproduce the human judgments they’re given.  Agreement rates between humans is often seen as a proxy for the complexity of the task; if humans can’t agree, it can be a sign that the task is ill-defined or underspecified. Word sense disambiguation is one good example of this, with low inter-annotator agreement rates [Snyder and Palmer 2004]; while sentiment analysis was originally designed with product/movie reviews in mind (does person X like product Y?) — i.e., attitude with respect to a particular target — I think the more general sentiment-as-tone problem (is this tweet happy or sad?) is much less well specified as a problem with an answer that can be judged by anyone but the original author.

One aspect of those kind of annotations that I think is much less explored (which Piper points to and I think would be an extremely interesting area to work on) is the case where multiple judgments are simultaneously valid — different interpretations of the same phenomenon, each backed by their own argument.  For lots of foundational NLP tasks like parsing and part-of-speech tagging, context is an almost perfect arbiter, so while a sentence like “I saw a man with a telescope” may have PP attachment ambiguity in isolation (do I have the telescope or the man?), it’s usually situated in some larger discourse that can resolve that ambiguity for us.  Lots of tasks, however, don’t have clear answers no matter how much context is given — in NLP, pretty much anything involving semantics, pragmatics or larger discourse (including frame semantic parsing, summarization, machine translation, question answering — what counts as “correct” or “good” here can admit multiple interpretations, even if we have standard methods for assessing the performance of systems).

Rating the emotional trajectory of texts is clearly one of these tasks.  As Piper notes, “readers will not universally agree on the sentiment of a sentence, let alone more global estimates of sentimental trajectories in a novel.”  At the same time, however, the ratings we get from readers will not be random; some trajectories will be favored, and there may even be some structure to the relationship between trajectories and who the readers are (e.g., readers from one theoretical camp favor X, those from another favor Y).  One way of potentially measuring validity here is to compare the output of a system with the judgments of multiple readers — even without making the assumption that there is some “correct” trajectory to be found, we have better reason to trust a system that agrees with some person’s judgment.  The important thing here is to solicit those judgments a priori, to avoid post-hoc rationalization of whatever output we get.

To get a small sense of the potential for inter-annotator agreement, I asked 5 people on Amazon Mechanical Turk to rate each of the 26 scenes in Romeo and Juliet on an 11-point scale (with -5 denoting extremely negative, 0 neutral and 5 extremely positive); plays are nice for this task since they have well-defined breakpoints (acts and scenes) that make for natural fine-grained annotation boundaries.  One book is not a large sample, and neither are 5 annotators, but it points the way to a larger evaluation. AMT happens to be a convenient method of getting judgments from people, but of course they come with their biases as well; those who prefer can ask the same of Shakespeare scholars (it would be interesting to see how they differ).

The full survey I asked is here: turk_emotion.html.  I deliberately kept the instructions of what counts as “emotion” vague and avoided priming the annotators with examples that I personally might consider high (e.g., declarations of love) or low (death scenes).  To be able to analyze the thought process behind the responses, I also solicited rationales for each decision.

Figure 1 shows the mean emotional rating for each scene in Romeo and Juliet averaged over the 5 annotators (along with 95% confidence intervals).  Each annotator provided 26 annotations (one for each scene in the play); the x axis maps these scene-level annotations to tokens in the linear order of the book.


Fig. 1: Average annotator “emotional trajectory” for Romeo and Juliet (with 95% C.I. on the mean)


Some points have extraordinarily high variability, but this is not a random plot — the whole is less chaotic than I might have expected, with points of agreement at important scenes.  There’s little variability in the balcony scene (Act 2, Scene 5) and with Mercutio and Tybalt’s deaths at the beginning of Act 3; much more so at points elsewhere. It would take more samples (of more texts) to see if the points of intersection among annotators falls only among a class of relatively straightforward events (love is positive, fighting and death are negative), and that everything else is much more subjective and variable. One vivid illustration of that kind of variability can be seen in the 5 annotator rationales for Romeo and Juliet’s death scene at the end of the play (Act 5, Scene 3):

This is a seemingly high rating for an ending in which both Romeo and Juliet die (and halfway through the scene it would be a -5) but the ending which reconciles the two warring households and the Prince’s final words speak of hope and not despair. (1)

Death and sadness all around (-5)

Young lovers committing suicide is a tragedy. (-5)

Complicated scene. There’s a major negative emotional moment with the Romeo and Juliet deaths, but then there’s a positive ending where it’s implied that the houses have, to some extent, reconciled. (0)

As Romeo goes to the grave, sees Paris, they get in a fight and Romeo kills Paris, then kills himself, then Juliet wakes up, sees Romeo dead and kills herself. (-5)

For comparison, here are all 5 annotators’ individual “emotional trajectories” for Romeo and Juliet. There are large-scale commonalities here beyond the level of individual scene decisions: 4 of the 5 annotators have rising sentiment through the first act; all have a sharp drop (and nadir) at the beginning of act 3, followed by a slight rise. The second half, however, shows far greater variability, with dramatically different judgments for most scenes throughout acts 4 and 5.


Fig. 2: Individual annotator “emotional trajectories” for Romeo and Juliet.


So what does this tell us? One the one hand, with a sample of one book and 5 annotators, not much.  If an algorithm for finding such emotional trajectories doesn’t recover the average in figure 1 above, that doesn’t tell us that it doesn’t work; an average, for example, is an inappropriate quantity if the distribution is bimodal — if, for example, there are two equally valid but opposite opinions of what the emotional trajectory of Romeo and Juliet should look like.  However, if the output resembles none of the individual judgments, that’s more problematic, especially as more and more human judgments are given.  Results that confirm past judgments can (in part) borrow from their explanatory power; results that are completely unexpected a priori need to face a higher burden of proof.

But either way,  having these kinds of credible examples of what the output of an algorithm should look like — before the algorithm is run and sensemaking can begin — is usually a good sanity check.  Even if a task itself is inherently subjective — with no clear right or wrong and only arguments to be marshalled for and against — we can still approach some definition of empirical validity by considering overlap with other subjective judgments that we have some reason to trust.

All of the individual Romeo and Juliet annotations (with rationales) can be downloaded here: rj_annotations.txt


  1. Doug Duhaime

    Fascinating, David! I love the idea of crowdsourcing some of this data collection for evaluative purposes. What do other studies on sentiment classification say about inter-annotator agreement? I’d love to read more on the subject if you pursue it further…

  2. Marti Hearst

    Wow, fast work, and really nice to have the MTurk details posted too! Terrific graphs.

    I struggled with exactly this problem of where to draw the borders when I first investigated subtopic boundary switching for TextTiling in 1993. I also ended up asking several people and showing how much they agreed and disagreed using somewhat older technology; here is an old image:
    In the 90’s we used the Kappa coefficient to determine if we had enough agreement among judges to assume a boundary was “correct”, but it was always clear from these graphs that some boundaries are stronger than others, and also if someone chose a certain subset of boundaries that precluded choosing a different subset.

    So in the 2000’s we developed more sophisticated evaluation algorithms for at least comparing these segmentation algorithms, including WindowDiff, which gives partial credit depending on how close the boundary is to the “correct” location.

Leave a Comment