Skip to the content.

Introduction

Abstract

American newspapers cover all kinds of topics from politics to sports, business, environment, etc… At the highest level, one could ask which topics people are mostly interviewed about. One way to answer this question is by trying to infer the theme each Quotebank citation is treating in an unsupervised manner and draw a distribution over some labels/classes representing these themes. Once all citations are classified with a certain confidence probability, a lower-level question could be: for each class, is it possible to relate the quotes’ frequency with some real world event? For instance, could an increase of citations about economics forecast some variation in the American stock market? In addition, using these classified quotes combined with additional data on the authors characteristics, we would like to compare the features of the authors of a particular citation class and try to find recurrent attributes or relevant features.

We will go trough a few question we asked ourselves :

Methods

In our data story, we chose to take only New York Times quotes because it represents the American newspapers well, and is a reasonable sample of our data and much easier to work with than with the massive original dataset. We will consider three main aspects:

To analyse those aspects, we will use several techniques:

Who speaks the most?

Nationality

The distribution of the different nationalities of the speakers that are in the dataset is the following. It is grouped by the num of occurrences of the quotation, that is to say we counted a speaker X times if the quotation he·she is involved in appears X times. Almost the three quarters are american speakers. The main speakers come from America or Europe. The main ones correlate with the countries that have been awarded the price of “soft power” influence. As it can be seen in wikipedia webpage, the last three different winners were: the USA, United Kingdom and France.

Gender

Gender equality is sadly far from being present our dataset. The distribution of the different genders of the speakers is the following. For the rest of the analysis, genders other than male and female will be gathered in a “other” category as there are a really few of them.

Occupation

The distribution of the different work occupations of the speakers that are in the dataset is the following. It is grouped by the num of occurrences of the quotation, that is to say we counted a speaker X times if the quotation he·she is involved in appears X times. The split is more discrete but some work types stand out: almost of the quotes are told by specific jobs which are all related to publishing something to a public. It is not surprising that people who first publish work publicly and already have a sort of relationship with the population are the ones that are the most quoted.

Do the specific combination of features of a speaker have a influence on the number of quotation?

A PCA model was trained on all the features of the speaker composed of:

It corresponds to a bit more than 2’000 columns (taking into account all the genders, all the nationalities and all the work occupation as well as the date of birth of the speaker). We didn’t set any number of components as we wanted to study the global influence of the columns on the number of occurrences.

As one can see below, 50 columns are considered a bit more important that the others. It means that only 50 columns could be necessary to represent the number of occurrences of all quotations. However, the highest variance is only around 1%.