ASOIAF Motifs

Link to code

This page contains spoilers for all the books in A Song of Ice and Fire, by George R. R. Martin. Do not continue unless you have finished reading the series, or if you do not plan on ever reading the series (which would be the wrong choice).


The purpose of this project is find textual motifs in ASOIAF by analyzing the variations in word frequencies and bigram frequencies for each unique point of view. Simple word frequencies will be used to find the likely starts of motifs, and the bigrams following from those given starting points. We will rank both word and bigram frequencies in each POV within each book by:

(# occurrences in X POV chapters) / (# occurrences in all chapters)

Note that for testing purposes, we would expect "Where do whores go" to be returned as a motif in Tyrion's chapters.

This page is my slightly edited stream of consciousness while working towards this goal.

First, for the sake of making processing easier, we need to break down the complete texts into separate files by POV, instead of by book. We also want to remove the question marks, since those would confuse the bigram finder (" "SomeWord " would be considered different from " SomeWord" ", as well as every other variation of no quotes, quotes on one side, or even quotes on both sides of a word, which is not the desired behavior.). We also need to make all the text the same case (lowercase, here), so the same phrase located at the beginning and the middle of a sentence will be considered keys to the same count. Finally we need to clean up the white space and remove all the unnecessary newlines, double spaces, etc. Here's a sample of the cleaned text, from Arya's chapters in A Game of Thrones:

who are you? arya asked. i am your dancing master. he tossed her one of the wooden blades. she grabbed for it, missed, and heard it clatter to the floor. tomorrow you will catch it. now pick it up. it was not just a stick, but a true wooden sword complete with grip and guard and pommel. arya picked it up and clutched it nervously with both hands, holding it out in front of her.

The resulting files from this process can be accessed on github under the povs/ directory (see the link to code at the top of this page).

Now, with the text all cleaned and sorted, we need to count the number of times each word occurs in each POV. For each of these, I first cleaned the text further to remove punctuation using the re module. I used the Counter object in the collections module to get the actual word frequencies on the text split on spaces. I kept a sum of the counts in each POV per book, the total number of occurrences for each word in each book. This per book count was stored using the pickle module. Next, I found the POV to book count ratios for each POV, per the formula given above. These, I similarly stored using pickle. The first couple results for each POV here tend to be, quite predictably, the names of the narrator's companions which are either unknown or irrelevant to all the other POVs. For example, Bran's top 4 results in A Storm of Swords, all used exclusively by him (per the ratio being 1.0), are:

'osha': 1.0, 'frog': 1.0, "hodor's": 1.0, 'meera': 1.0

(As an aside, it's pretty entertaining looking at all the variations of HOOODOOOOOR throughout the text. It might be interesting to graph the correlation between frequency_of_hodors*average_hodor_length and some sentiment analysis for Bran's chapters over time.)

Now, moving further down the list we see some more interesting results:

'summer': 0.6024096385542169, 'stories': 0.5357142857142857, 'causeway': 0.5333333333333333, 'lake': 0.5319148936170213, 'story': 0.5217391304347826, 'net': 0.5172413793103449, 'embers': 0.5, 'rat': 0.5, 'deer': 0.5, 'moonlight': 0.47368421052631576, 'stableboy': 0.46153846153846156, 'crypts': 0.45454545454545453, 'footsteps': 0.4444444444444444

It appears that between 0.25 and 0.8, we see a lot of environmental descriptors and imagery. While this is not immediately helpful for our motif finding goal, it's definitely worth taking another look at, perhaps in another project. There is also evidence of a problem with our method here. By making all the characters lowercase, we have unintentionally combined the counts for 'summer' the season and 'Summer' the direwolf. This would be a difficult problem to solve, however, since at the start of a sentence, there would be no distinction between the two at all. Given the small scale of this project, I'll just leave this as a 'won't fix' known issue which we should keep in mind in our analysis, but really isn't a big deal in the grand scheme of things.

The next step would be to calculate the bigram frequencies. I considered using nltk's BigramCollocationFinder, but that wouldn't give me the counts or the specific format I want the data to be in, so I decided to just do this part by hand. I modeled a simple Bayesian network using a dictionary where the top level keys are the first words in the bigrams, the second level keys are the following words in the bigrams, and the value each secondary key points to is the fraction of times the given secondary word followed the given first word when the first word appeared, or:

dict['azor']['ahai'] = (# times 'azor ahai' occurs) / (# of times 'azor' occurs)

The challenge here is determining whether punctuation should impact the "motifiness" of a phrase containing a specific series of bigrams. For the sake of this project, I decided to make a judgement call and say bigrams where the first word ends in a period, i.e. this bigram crosses the end of a sentence, do not count. Likewise, bigrams where the first word ends in an exclamation point, question mark, etc., will not be counted. However, commas will be allowed and simply ignored. That is to say, occurrences of "x, y" will be counted toward the same total as "x y". I chose to allow commas because they are so variable in allowed placements, so should a character take an extra pause (I am equating commas to pauses, for the sake of this exercise) while saying something that could be considered a motif, that instance will still count toward said motif's total count. That said, bear in mind my decision here is based off of almost no data and severely impacted by my personal biases, so it is definitely worth investigating bigram counts with a different interpretation of the possible punctuations.

At this point, it's also worth noting that one of the main problems with using bigram analysis to find motifs is that it is only possible to form 2-word assocations, and it is technically impossible to draw definite continued phrases from that. Take, for example, the phrase:

"The sea was black and the moon was silver" (Victarion I, A Dance With Dragons)

This phrase gets broken up into:

{'and': {'the': 1}, 'moon': {'was': 1}, 'black': {'and': 1}, 'sea': {'was': 1}, 'the': {'sea': 1, 'moon': 1}, 'was': {'black': 1, 'silver': 1}}

From here, it is impossible to tell if the original phrase was "The sea was black and the moon was silver" or "The sea was silver and the moon was black." Therefore, we also need to crossreference each potential motif with the original text to make sure it actually appears in the text.

Now would be a good time to revisit the example motif, "where do whores go" for Tyrion in A Dance With Dragons. Broken into bigrams "where do", "do whores", and "whores go", we can look at the number of occurrences for each listed bigram divided by the total number of occurrences for all bigrams begining with "where", "do", and "whores", which we will call CTyrion for counts in that POV and CAll for counts in all POVs:

Bigram CTyrion CAll CTyrion / CAll
where -> do 0.0409836065574 0.017825311943 2.29918032787
do -> whores 0.0245398773006 0.00470035252644 5.22085889571
whores -> go 0.179487179487 0.118644067797 1.51282051282

Judging from this, we should be able to find the bigrams with the highest CPOV / CAll ratios and follow along the path of bigrams, picking the most common next word and most common previous word, until we hit a word that doesn't have a bigram with a ratio over, say, 1.5.

I ended up chosing to take the top two bigrams, rather than just the top single bigram, for each iteration, mainly due to the fact that we are looking for phrases greater than two words long, so we cannot guarentee that every bigram in a motif will be the bigram with the highest ratio given its first word. Two was an arbitrary cutoff, but for the purposes of this project, seemed to serve well. I further verified each new possible motif was still contained in the actual text of the books.

When we apply this algorithm to Tyrion's chapters in A Dance With Dragons, we do, in fact, get our expected motif, "where do whores go"! Interestingly, we even see a number of similar motifs to that one, including "know where whores go," and "that be where whores go."

Overall, this algorithm did produce more false positives than desirable, but the results were generally satisfactory. This number of false positives could be expected, since not all POVs in all books contain significant texual motifs. The results are listed, labeled by book and POV, below.

Results

A Game of Thrones

Arya

Bran

Catelyn

Daenerys

Eddard

Jon

Prologue

Sansa

Tyrion

A Clash of Kings

Arya

Bran

Catelyn

Daenerys

Davos

Jon

Prologue

Sansa

Theon

Tyrion

A Storm of Swords

Arya

Bran

Catelyn

Daenerys

Davos

Jaime

Jon

Prologue

Samwell

Sansa

Tyrion

A Feast for Crows

Alayne

Arya

Brienne

CatOfTheCanals

Cersei

Jaime

Prologue

Samwell

Sansa

TheCaptainOfGuards

TheDrownedMan

TheIronCaptain

ThePrincessInTheTower

TheProphet

TheQueenmaker

TheReaver

TheSoiledKnight

A Dance With Dragons

AGhostInWinterfell

Bran

Cersei

Daenerys

Davos

Epilogue

Jaime

Jon

Melisandre

Prologue

Reek

Samwell

TheBlindGirl

TheDiscardedKnight

TheDragontamer

TheGriffinReborn

TheIronSuitor

TheKingbreaker

TheLostLord

Theon

ThePrinceOfWinterfell

TheQueensguard

TheSacrifice

TheSpurnedSuitor

TheTurncloak

TheUglyLittleGirl

TheWatcher

TheWaywardBride

TheWindblown

Tyrion

Victarion

And thanks to lordchair for the raw texts of the books!