Game of Thrones: Who spoke the most, and what?

Photo by Pixabay on Pexels

I came across the Game of Thrones script for the first time in the final project of a Python course by Internshala. The idea for the project was to simply find unique words spoken by the characters. I, for one, wanted to explore some visualizations, and here is my attempt at one. We will find the character with the maximum number of lines in the script and create a word cloud. Fairly simple stuff.

Who is the character with the maximum number of lines in the script, and what were the words they spoke the most?

The Dataset

I found a GOT dataset in the public domain graciously delivered by Alben Tumanggor in Kaggle. We will be working with this dataset for our explorations. Here is the link if you wanna explore it on your own.

Going over the headers for each column and what they correspond to will give us a good start. You can find a summary in the table.

HeaderDescriptionExample
Release DateThe original air date of the episode in YYYY-MM-DD format.2011-04-17
SeasonThe season number.Season 1
EpisodeThe episode number.Episode 1
Episode TitleThe title of the episode.Winter is Coming
NameName of the GOT character.waymar royce
SentenceSentence spoken by the character.What do you expect? They’re savages. One lot steals a goat from another lot and before you know it, they’re ripping each other to pieces.

Let us import the packages that we will use for the analysis.

%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re

Before we read the whole dataset, I feel more comfortable having a glimpse first. So we will read five rows from the dataset to get started.

input_file_path = '../input/game-of-thrones-script-all-seasons/Game_of_Thrones_Script.csv'
df = pd.read_csv(input_file_path,nrows=5)
df.head()
Release DateSeasonEpisodeEpisode TitleNameSentence
2011-04-17Season 1Episode 1Winter is Comingwaymar royceWhat do you expect? They’re savages. One lot s…
2011-04-17Season 1Episode 1Winter is ComingwillI’ve never seen wildlings do a thing like this…
2011-04-17Season 1Episode 1Winter is Comingwaymar royceHow close did you get?
2011-04-17Season 1Episode 1Winter is ComingwillClose as any man would.
2011-04-17Season 1Episode 1Winter is CominggaredWe should head back to the wall.

We only require the columns Name and Sentence to address our problem statement. However, I am curious about finding efficient ways to read a dataset. Further, I do hope to extend the analysis with the other variables.

Pandas use an enum-like structure for a category, which allows saving on storage and computation. Well, it’s more complicated than an enum, but the comparison helps my understanding. Similarly, the string datatype is a good choice when we wish to do string manipulations from within a data frame.

dtype = {'Episode Title' : 'category',
         'Name': 'category',
         'Sentence': 'string'}
df = pd.read_csv(input_file_path, parse_dates=['Release Date'], dtype=dtype,
                 converters={'Season': lambda x: int(re.sub('.*\D', '', x)),
                             'Episode': lambda x: int(re.sub('.*\D', '', x))}
                )
df.head()
Release DateSeasonEpisodeEpisode TitleNameSentence
2016-05-0162HomeNaNYou leave the fighting to the little lords, Wy…
2016-05-0162HomeNaNWell, he’s never going to learn to fight becau…
2016-05-2265The DoorNaNWylis! What’s the matter?

A quick search leads us to Old Nan.

Old Nan is an elderly woman living in Winterfell. She is a retired servant of House Stark known for her tale-telling abilities. She has entertained the children of Eddard and Catelyn with stories throughout their childhoods.

Old Nan is falsely parsed as null by pandas. Now we can’t have that, can we? Let us fix it.

df.loc[:, 'Name'] = df.Name.cat.add_categories("Nan")
df.loc[:, 'Name'].fillna("Nan", inplace=True)

We end up with the data frame with following data types.

df.dtypes
    Release Date     datetime64[ns]
    Season                    int64
    Episode                   int64
    Episode Title          category
    Name                   category
    Sentence                 string
    dtype: object

Doing the Analysis

Who spoke the most?

We can quickly find the top 10 characters according to the number of lines they had in the complete series.

top_10 = df[['Name']].value_counts().head(10).reset_index()
top_10
indexName0
0tyrion lannister1760
1jon snow1133
2daenerys targaryen1048
3cersei lannister1005
4jaime lannister945
5sansa stark784
6arya stark783
7davos528
8theon greyjoy455
9petyr baelish449

Tyrion Lannister rocks the top, followed by Jon Snow. Let us make a quick plot for our satisfaction.

top_10.plot(x='Name',kind='barh', title='Top 10 Characters with lines',
            xlabel='Character', ylabel='No. of Lines'
           )

png

We have our guy. Now comes the question,

What words did he speak the most?

We will extract Tyrion’s lines to a different data frame.

tyrion_lannister = df.loc[df.Name=='tyrion lannister','Sentence']

We will do some quick string manipulations to split the lines into words.

tyrion_lannister = tyrion_lannister. \
        str.replace('[,.?!-]','', regex=True). \
        str.lower(). \
        str.split()
tyrion_lannister.head()
    145    [mmh, it, is, true, what, they, say, about, th...
    147               [i, did, hear, something, about, that]
    149                           [and, the, other, brother]
    151    [there's, the, pretty, one, and, there's, the,...
    153                 [i, hear, he, hates, that, nickname]
    Name: Sentence, dtype: object

We have a list under Sentence after splitting. A list is not so lovely inside a data frame. So let us explode it into long-form data.

tyrion_lannister = tyrion_lannister.explode('Sentence')
tyrion_lannister.value_counts().head(10)
    the    1094
    you     864
    i       772
    to      771
    a       605
    of      513
    and     435
    my      307
    it      290
    me      276
    Name: Sentence, dtype: int64

Oh my, the top spoken words are all stop words. To give a quick rundown, stop words make sense in a sentence but are nonsensical without context. Since we are looking at individual words, it is safe to remove them.

I have tried four libraries-wordcloud, Scikit-Learn, Natural Language Toolkit (NLTK), and spaCy. I liked the output from wordcloud the best. So let’s have a look.

from wordcloud import STOPWORDS
tyrion_lannister[~tyrion_lannister.isin(STOPWORDS)].value_counts().head(10)
    will      139
    know      109
    one       100
    father     83
    well       73
    want       72
    good       66
    yes        58
    time       56
    man        56
    Name: Sentence, dtype: int64

Now that is a lot better. We already have the words of Tyrion. Nevertheless, let us now get right into making a word cloud for our visualization. First, we will have to create a word cloud object.

from wordcloud import WordCloud
tyrion_wc = WordCloud(
    background_color='white',
    max_words=2000,
    stopwords=STOPWORDS
)

Next comes the job of adding our words into the word cloud instance.

tyrion_wc.generate(' '.join(tyrion_lannister.values))

We are left with the visualization, and we’re done.

plt.imshow(tyrion_wc, interpolation='bilinear')
plt.axis('off')
plt.show()

png

Well, on first look, I have been tempted to make claims on the strength of Tyrion’s relationships with his family and others in the realm. Alas, but there ought to be better numerical ways to make such interpretations. Jumping to conclusions is never a good idea when we’re yet to dig out the good info from the data to back the claim. We will see if I can figure that part.

Bevan Stanely
Bevan Stanely
Data Scientist

An aspiring Data Scientist. I enjoy narrating data through storytelling.

Comments

You can use a Mastodon account to comment on this article by replying to the associated Mastodon toot.

Previous