Sentiment Analysis

Importing Libraries

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import datetime
import time

import nltk.data
from nltk import tokenize
import re

Importing Data

In [10]:
columns = ['Year', 'Week Start', 'Week End', 'Section', 'Number', 'Headline', 'Body Text']
articles_df = pd.read_csv('https://s3.amazonaws.com/cs109data/articles_db.csv', names=columns)
In [11]:
articles_df.head(100)
Out[11]:
Year Week Start Week End Section Number Headline Body Text
0 2000 2000-01-03 2000-01-09 business 0 There's no time to waste Over the past few months, President Clinton ha...
0 2000 2000-01-03 2000-01-09 business 1 Ford staff threaten strike Leaders of salaried staff at Ford are threaten...
0 2000 2000-01-03 2000-01-09 business 2 There's no time to waste Over the past few months, President Clinton ha...
0 2000 2000-01-03 2000-01-09 business 3 Cybersquatters with an eye for domain chance What's in a domain name? Loadsamoney, apparent...
0 2000 2000-01-03 2000-01-09 business 4 Clicks and mortar leave property crumbling away The property market looks in pretty good healt...
0 2000 2000-01-03 2000-01-09 business 5 Labour isn't working hard enough Few people I know would dissent from the propo...
0 2000 2000-01-03 2000-01-09 business 6 Dunces excel in the knowledge economy While all the fashionable blather is of a know...
0 2000 2000-01-03 2000-01-09 business 7 Russia Y2K bill 'shows West overreacted' Russia spent just $200 million on preparing fo...
0 2000 2000-01-03 2000-01-09 business 8 Briefcase BUY... Domino's Pizza company, which last week...
0 2000 2000-01-03 2000-01-09 business 9 TransTec duo kept silent on £11m claim Two former executive directors of TransTec, th...
0 2000 2000-01-03 2000-01-09 business 10 US capital firm hires QXL founder heads Net st... Tim Jackson, the 34-year-old journalist and mi...
0 2000 2000-01-03 2000-01-09 business 11 Figures lift M&S gloom First signs that Marks & Spencer may have arre...
0 2000 2000-01-03 2000-01-09 business 12 Experts predict ¼-point interest rate rise The city is expecting interest rates to rise b...
0 2000 2000-01-03 2000-01-09 business 13 Vodafone 'must bid billions in cash' The gloves came off in the biggest hostile tak...
0 2000 2000-01-03 2000-01-09 business 14 Getting smart on subsidies New Labour seems to hate 's' words. Socialism,...
0 2000 2000-01-03 2000-01-09 business 15 Shy dealmaker who kept quiet about costly details He was the ultimate City high flyer. Plucked f...
0 2000 2000-01-03 2000-01-09 business 16 Stockwatch Index beaters It was a mammoth task, calcula...
0 2000 2000-01-03 2000-01-09 business 17 Golden boy lost his Midas touch Geoffrey Robinson, the colourful and controver...
0 2000 2000-01-03 2000-01-09 business 18 @large It is as we had suspected all along: Netheads ...
0 2000 2000-01-03 2000-01-09 business 19 Health check Your personal happiness is greatly affected by...
0 2000 2000-01-03 2000-01-09 business 20 A wider Net for eBusiness shares The arrival of Y2K may have failed to wreak ha...
0 2000 2000-01-03 2000-01-09 business 21 Media Diary Domeward Bound Much of the festivity at the D...
0 2000 2000-01-03 2000-01-09 business 22 Infamous 5's star ratings Question: Which is the only mainstream broadca...
0 2000 2000-01-03 2000-01-09 business 23 How to 1. Accept that most people (including you) h...
0 2000 2000-01-03 2000-01-09 business 24 Rage against the dying of light If you've ever thought that buildings aren't a...
0 2000 2000-01-03 2000-01-09 business 25 Taxes, certainly - but on all our houses The recent controversy about the European Unio...
0 2000 2000-01-03 2000-01-09 business 26 Crash? What crash? In any normal market, it would be seen as a ro...
0 2000 2000-01-03 2000-01-09 business 27 Granada forces pace on ITV's fate Granada yesterday stepped up pressure on the g...
0 2000 2000-01-03 2000-01-09 business 28 Byers ultimatum for WTO The trade and industry secretary, Stephen Byer...
0 2000 2000-01-03 2000-01-09 business 29 Underside • Not everyone was taken by surprise at NatWes...
... ... ... ... ... ... ... ...
0 2000 2000-01-03 2000-01-09 uk-news 10 Welcome back to the craic A is for Assembly After decades of resistan...
0 2000 2000-01-03 2000-01-09 uk-news 11 Nelson bomb suspect arrested in the US The chief suspect in the murder of the Norther...
0 2000 2000-01-03 2000-01-09 uk-news 12 Irving ready for court battle over Holocaust The most emotive libel trial to be heard in Br...
0 2000 2000-01-03 2000-01-09 uk-news 13 Women who flee violence 'lack shelter' More than 50,000 women and children flee their...
0 2000 2000-01-03 2000-01-09 uk-news 14 Dealing with the end of the world time after time No one has seen the end of the world come roun...
0 2000 2000-01-03 2000-01-09 uk-news 15 Tory idea for schools to branch out Leading independent schools should be encourag...
0 2000 2000-01-03 2000-01-09 uk-news 16 Hindley may face brain surgery The moors murderer Myra Hindley may undergo em...
0 2000 2000-01-03 2000-01-09 uk-news 17 Gun law on streets of Manchester Armed police are to patrol parts of Manchester...
0 2000 2000-01-03 2000-01-09 uk-news 18 Bridging the gap: Walkway reveals gorgeous gorge One of the last inaccessible places in England...
0 2000 2000-01-03 2000-01-09 uk-news 19 Villagers break away from UK Residents of a village in East Sussex have dec...
0 2000 2000-01-03 2000-01-09 uk-news 20 Warmed by the flame of dance When Monica Mason's ballet shoe snagged on the...
0 2000 2000-01-03 2000-01-09 uk-news 21 In brief Price of coffee rises 10p The price of coffee...
0 2000 2000-01-03 2000-01-09 uk-news 22 How art treasures are stolen to order Christopher Brown is in sombre mood. "I have a...
0 2000 2000-01-03 2000-01-09 uk-news 23 Courts may get 'enforcers' to make debtors pay up A new breed of court "enforcers", with the pow...
0 2000 2000-01-03 2000-01-09 uk-news 24 Limpets threaten coast When part of Beachy Head fell into the Channel...
0 2000 2000-01-03 2000-01-09 uk-news 25 Lloyds sues over lost Shelley letter The poet Shelley, contemplating the ruins of a...
0 2000 2000-01-03 2000-01-09 uk-news 26 Briton's 24-hour ordeal in shark sea A British tourist whose family had all but giv...
0 2000 2000-01-03 2000-01-09 uk-news 27 Parents call for schools to bring back the cane A majority of parents want corporal punishment...
0 2000 2000-01-03 2000-01-09 uk-news 28 Morning-after pill trial hailed as success A project in Manchester which allows women to ...
0 2000 2000-01-03 2000-01-09 uk-news 29 Judges may double injury payouts The court of appeal will hold an unprecedented...
0 2000 2000-01-10 2000-01-16 business 0 Crunch time for euro in Lisbon Tony Blair wants the EU Summit in Lisbon in Ma...
0 2000 2000-01-10 2000-01-16 business 1 Confidence in a tarnished age Confidence is at the heart of economic policy....
0 2000 2000-01-10 2000-01-16 business 2 Media diary Victorian values It would be wrong to let Vi...
0 2000 2000-01-10 2000-01-16 business 3 Land of the free and home of the brave class a... There are 2 million guns in civilian hands in ...
0 2000 2000-01-10 2000-01-16 business 4 Old hand for new job It was teatime when Sir George Bull stepped ou...
0 2000 2000-01-10 2000-01-16 business 5 BA cabin crews face job losses British Airways plans to shed at least 2,500 c...
0 2000 2000-01-10 2000-01-16 business 6 BOC £700m sale means total break-up BOC is to spin off its world-beating vacuum pu...
0 2000 2000-01-10 2000-01-16 business 7 BNFL threatened by loss of ISO quality guarantee Nuclear reprocessor and generator British Nucl...
0 2000 2000-01-10 2000-01-16 business 8 Utility fat cats face pay curb Government pressure on the bosses of privatise...
0 2000 2000-01-10 2000-01-16 business 9 Stockwatch Merger medicine Merger activity notwithstandi...

100 rows × 7 columns

In [12]:
articles_df.shape
Out[12]:
(88745, 7)

 Initial Data Exploration

Filtering articles based on relevance to exchange rate

In [40]:
# Dictionary of relevant words
relevant_words = np.genfromtxt('Keywords.txt', dtype='str')
In [14]:
def find_relevant(text, n):
    text = str(text)
    num_relevant_words = [word for word in relevant_words if ((' '+word+' ') in text)]
    if len(num_relevant_words) > n:
        return True
    else:
        return False

Number of relevant articles per week

In [9]:
text = articles_df['Body Text'].values[0]

plt.figure(figsize=(10,5))
for n in [3,5,7]:
    relevant_articles = [find_relevant(text, n) for text in articles_df['Body Text'].values]
    relevant_df = articles_df[relevant_articles]
    weekly_articles = relevant_df.groupby('Week Start').size().reset_index()
    plt.plot(weekly_articles[0], label='n='+str(n),)
    
plt.legend(loc='best')
plt.ylabel('Number of relevant articles')
axes = plt.gca()
axes.set_ylim([0,50])
Out[9]:
(0, 50)

Number of relevant articles per section:

In [10]:
n = 3
relevant_articles = [find_relevant(text, n) for text in articles_df['Body Text'].values]
relevant_df = articles_df[relevant_articles]
articles_per_section = relevant_df.groupby(['Week Start', 'Section']).size().reset_index()

plt.figure(figsize=(15,5))
for section in articles_per_section['Section'].unique():
    articles_count = articles_per_section[articles_per_section['Section'] == section]
    articles_count.head()
    plt.plot(range(0, len(articles_count[0])), articles_count[0], label=section)
plt.xlabel('Week', fontsize=20)
plt.ylabel('Number of relevant articles', fontsize=20)
plt.rc('xtick', labelsize=20) 
plt.rc('ytick', labelsize=20)
plt.grid(True)
plt.legend()
axes = plt.gca()
axes.set_ylim([0,30])
Out[10]:
(0, 30)

Sentiment Analysis using SentiWordNet

In [41]:
# Source code for sentwordinet: http://www.nltk.org/_modules/nltk/corpus/reader/sentiwordnet.html
import nltk
from nltk.corpus import sentiwordnet as swn
from nltk.corpus import stopwords
In [16]:
# A simple function to obtain the overall sentimate of a text chunk
# Method: tokenise the text chunk, obtain the sentiment score of each token, then take mean average.
# Note: you may need to separately install sentiwordnet: nltk.download('sentiwordnet')

## synsets based on context
## phrases/tokens 

## classify words as noun/adjectives

# unsupervised split between adjectives 

def simple_sentiment(text_chunk):
    cumulative_pos_sentiment = 0
    cumulative_neg_sentiment = 0
    index = 0
    
    # Tokenizing the sample text
    tokens=nltk.word_tokenize(text_chunk)
    # Removing words of lenght 2 or less
    tokens = [i for i in tokens if len(i)>=3]
    # remove stop words
    tokens = [word for word in tokens if word not in stopwords.words('english')]
    
    # a/n/v/r represent adjective/noun/verb/adverb respectively. They are used to index the sentinet dictionary.
    for i in tokens:
        if len(list(swn.senti_synsets(i, 'a')))>0:
            cumulative_pos_sentiment += list(swn.senti_synsets(i, 'a'))[0].pos_score()
            cumulative_neg_sentiment += list(swn.senti_synsets(i, 'a'))[0].neg_score()
            index +=1
        elif len(list(swn.senti_synsets(i, 'n')))>0:
            cumulative_pos_sentiment += list(swn.senti_synsets(i, 'n'))[0].pos_score()
            cumulative_neg_sentiment += list(swn.senti_synsets(i, 'n'))[0].neg_score()
            index +=1
        elif len(list(swn.senti_synsets(i, 'v')))>0:
            cumulative_pos_sentiment += list(swn.senti_synsets(i, 'v'))[0].pos_score()
            cumulative_neg_sentiment += list(swn.senti_synsets(i, 'v'))[0].neg_score()
            index +=1
        elif len(list(swn.senti_synsets(i, 'r')))>0:
            cumulative_pos_sentiment += list(swn.senti_synsets(i, 'r'))[0].pos_score()
            cumulative_neg_sentiment += list(swn.senti_synsets(i, 'r'))[0].neg_score()
            index +=1
        
    avg_pos_sentiment = cumulative_pos_sentiment / float((1 if (index == 0) else index))
    avg_neg_sentiment = cumulative_neg_sentiment / float((1 if (index == 0) else index))
    
#     print('Positive sentiment:',avg_pos_sentiment)
#     print('Negative sentiment:',avg_neg_sentiment)
    
    return (avg_pos_sentiment,avg_neg_sentiment)
In [17]:
sample_text = 'There\'s no time to waste,"Over the past few months, President Clinton has lost few opportunities to sing the praises of his favourite book. In November, he told a conference attended by Tony Blair that it was no longer necessary to choose between growth and environment. He took as evidence Natural Capitalism, The Next Industrial Revolution (Paul Hawken and Amory and Hunter Lovins, Earthscan, pounds 18.99), which \'proves beyond argument that there are presently available technologies, and those just on the horizon, which will permit us to get richer by cleaning, not by spoiling, the environment. This is a huge deal,\' Clinton said.   It\'s a suitably millennial claim. The authors argue that \'capitalism, as practised, is a financially profitable, nonsustainable aberration in human development... [which] does not fully conform to its own accounting principles. It liquidates its capital and calls it income. It neglects to assign any value to the largest stocks of capital it employs, the natural resources and living systems, as well as the social and cultural systems that are the basis of human capital.\'   Companies, as has been well said, are brilliant externalising machines, pocketing the profits and shunting the costs of their enterprise on to the collectivity. Thus, the NHS pays for the profits of big tobacco, and the Government subsidises cars by building roads. Put it another way, business is a free rider on the environment and the services it provides, services which have been tentatively valued by Nature magazine at $36 trillion annually, roughly the same as world GDP.   The reason business is so profligate with the the environment (the \'natural capital\' of the book) is that its goods are assumed by economists to be free and infinitely substitutable. So they are uncosted. But in reality they are not free. They are produced by the earth\'s 3.8-billion-year store of natural capital which, as the authors rehearse with hair-raising thoroughness, is being eroded so fast that by the end of this century there will be little left. And there is no conceivable substitute, for example, for the biosphere\'s ability to produce oxygen.   The authors manage to recast this rush to disaster as a story with a (potentially) happier ending. Their grounds for optimism are partly familiar American technological optimism, if natural resources were treated as scarce and expensive, then nanotechnology and biotechnology could multiply four or even tenfold the outputs from today\'s inputs. Hence Clinton\'s enthusiasm.   But more crucial to the project is a complete mental flip of what an \'output\' consists of (as Edwin Land once said, a great idea is often \'not having a new thought but stopping having an old one\').   At present, it is entirely conceivable that one-quarter or even half of the GDP of advanced countries makes not value but waste. Most industrial processes, and the economy as a whole, are inefficient , at best achieving 10 per cent of their potential likewise their products. A car uses just 1 per cent of the energy it burns to propel the driver, the rest to warm the atmosphere, deafen pedestrians and shift ponderous steel boxes between traffic jams.   Moreover, waste is cumulative, so an increasing income has to be spent on alleviating growth\'s byproducts, pollution, traffic accidents and congestion, crime. Hence the phenomenon of uneconomic growth, where increases in nominal wealth produce no net gain in quality of life or standard of living: in real terms 80 per cent of Americans are no better off than they were in 1979.   However, the grossness of the waste is, say the authors, also a measure of the huge potential for improvement if the spiral changed to virtuous. The secret is taking a systems view in which it is always more expensive to get rid of waste than to design it out in the first place. Given the wastefulness of most current systems, improvements of 10 to 100 times in overall efficiency are possible even with existing technology.   Much of what the Lovins and Hawken propose is not new. Frances Cairncross wrote about costing the earth 10 years ago, and Richard Schonberger coined the term \'frugal manufacturing\' in the 1980s. What is new is the way these ideas are brought together in a systems approach to business and the environment, and the coopting of markets as the mechanism which can be used to turn things around.   There is some irony here, of course. The greatest obstacle to \'natural capitalism\' in practice will be the vested interests and special pleading of those most vociferous champions of capitalist orthodoxy, US companies, which emerge from this book the masters of the perverse, not to mention grotesque, hidden subsidy, whether of agriculture, cars, or their wealthy executives.   Persuading them to confront their own bad faith will be no easy matter. But, as someone once said, the economy is a wholly-owned subsidiary of the environment, and time is running out for the parent to bring it to heel.'
simple_sentiment(sample_text)
Out[17]:
(0.0845771144278607, 0.04695273631840796)
In [18]:
n = 3
relevant_articles = [find_relevant(text, n) for text in articles_df['Body Text'].values]
relevant_df = articles_df[relevant_articles]
weeks = relevant_df['Week Start'].unique()

avg_weekly_pos_score = np.zeros((len(weeks), 1))
avg_weekly_neg_score = np.zeros((len(weeks), 1))
avg_weekly_pos_minus_neg_score = np.zeros((len(weeks), 1))
In [ ]:
# Calculate weekly sentiment scores across the entire time period
weeks = relevant_df['Week Start'].unique()

for i, week in enumerate(weeks):
    articles = relevant_df[relevant_df['Week Start'] == week]['Body Text']
    num_articles = articles.shape[0]
    pos_score = 0
    neg_score = 0
    for article in articles:
        pos, neg = simple_sentiment(article)
        pos_score += pos
        neg_score += neg
    avg_weekly_pos_score[i] = (pos_score/float(num_articles))
    avg_weekly_neg_score[i] = (neg_score/float(num_articles))
    avg_weekly_pos_minus_neg_score[i] = avg_weekly_pos_score[i] - avg_weekly_neg_score[i]
    if (i%10 == 0):
        print('Week: ', week, 'Postive: ', avg_weekly_pos_score[i][0], 'Negative: ', avg_weekly_neg_score[i][0])

Saving file to not recalculate every time

In [32]:
# Saving file to not recalculate every time
# scores_df = pd.DataFrame()
# scores_df['weeks']=weeks
# scores_df['avg_weekly_pos_score']=avg_weekly_pos_score
# scores_df['avg_weekly_neg_score']=avg_weekly_neg_score

# scores_df.to_csv('scores_df.csv',index=False)
In [20]:
scores_df = pd.read_csv('scores_df.csv')
avg_weekly_pos_score = scores_df['avg_weekly_pos_score']
avg_weekly_neg_score = scores_df['avg_weekly_neg_score']

Plot of average weekly positive sentiments

In [19]:
plt.figure(figsize=(15, 10))
plt.plot(scores_df['avg_weekly_pos_score'])
plt.xlabel('Week', fontsize=20)
plt.ylabel('Average weekly postive sentiment score', fontsize=20)
Out[19]:
<matplotlib.text.Text at 0x1282fa8d0>

Plot of average weekly negative sentiments

In [20]:
plt.figure(figsize=(15, 10))
plt.plot(scores_df['avg_weekly_neg_score'],'red')
plt.xlabel('Week', fontsize=20)
plt.ylabel('Average weekly negative sentiment score', fontsize=20)
Out[20]:
<matplotlib.text.Text at 0x12c608ef0>

Plot of average weekly net postive (positive minus negative) sentiments

Including the exchange rate plots

In [21]:
daily_data = pd.read_csv('daily_rates.csv', skiprows=3, header=0)
monthly_data = pd.read_csv('monthly_rates.csv', skiprows=11, header=0)
In [22]:
daily_data['datetime'] = pd.to_datetime(daily_data['DATE'])
monthly_data['datetime'] = pd.to_datetime(monthly_data['DATE'])

daily_data['dayofweek'] = daily_data['datetime'].apply(lambda row: row.dayofweek)
weekly_data = daily_data[daily_data['dayofweek'] == 4]
In [23]:
timestamp_weeks = [pd.to_datetime(week) for week in weeks]
In [24]:
fig, ax1 = plt.subplots( figsize=(20,15))

ax1.plot(weekly_data['datetime'], weekly_data['XUDLERS'], 'brown', linewidth=2, label=str('EUR/GBP'))
ax1.plot(weekly_data['datetime'], weekly_data['XUDLUSS'], 'blue', linewidth=2, label=str('USD/GBP'))
ax1.legend(loc='best', fontsize=20)
ax1.set_xlabel('Year', fontsize=20)
ax1.set_ylabel('Euro and US Dollar to Pound exchange rate', fontsize=20)
ax1.grid(True)
ax1.set_ylim([min(min(weekly_data['XUDLERS']),min(weekly_data['XUDLUSS'])),max(max(weekly_data['XUDLERS']),max(weekly_data['XUDLUSS']))])
ax1.axvline(x=datetime.datetime(2016,1,8), color='grey', linewidth=2)
ax1.axvline(x=datetime.datetime(2007,1,5), color='orange', linewidth=2)
ax1.axvline(x=datetime.datetime(2009,1,12), color='orange', linewidth=2)

ax2 = ax1.twinx()
ax2.plot(timestamp_weeks, scores_df['avg_weekly_pos_score'], 'green',linewidth=0.5, label = 'Average weekly positive score')
ax2.set_ylabel('Positive Sentiment Score', color='green',fontsize=20)
ax2.set_ylim([0.0,0.1])
for tl in ax2.get_yticklabels():
    tl.set_color('green')

plt.show()
In [28]:
fig, ax1 = plt.subplots( figsize=(20,15))

ax1.plot(weekly_data['datetime'], weekly_data['XUDLERS'], 'brown', linewidth=2, label=str('EUR/GBP'))
ax1.plot(weekly_data['datetime'], weekly_data['XUDLUSS'], 'blue', linewidth=2, label=str('USD/GBP'))
ax1.legend(loc='best', fontsize=20)
ax1.set_xlabel('Year', fontsize=20)
ax1.set_ylabel('Euro and US Dollar to Pound exchange rate', fontsize=20)
ax1.grid(True)
ax1.set_ylim([min(min(weekly_data['XUDLERS']),min(weekly_data['XUDLUSS'])),max(max(weekly_data['XUDLERS']),max(weekly_data['XUDLUSS']))])
ax1.axvline(x=datetime.datetime(2016,1,8), color='grey', linewidth=2)
ax1.axvline(x=datetime.datetime(2007,1,5), color='orange', linewidth=2)
ax1.axvline(x=datetime.datetime(2009,1,12), color='orange', linewidth=2)

ax2 = ax1.twinx()
ax2.plot(timestamp_weeks, scores_df['avg_weekly_neg_score'], 'red',linewidth=0.5, label = 'Average weekly negative score')
ax2.set_ylabel('Negative Sentiment Score', color='red',fontsize=20)
ax2.set_ylim([0.0,0.1])
for tl in ax2.get_yticklabels():
    tl.set_color('green')

plt.show()

Improving sentiment analysis using sentiwordnet: smoothing for variance reduction

For variance reduction/ smoothing the sentiment scores, we take the running average of preceding ten weeks as our sentiment time series. Then there are visible trends between exchange rates and the net sentiments.

In [25]:
def running_average_smoother(list, parameter):
    return([np.mean(list[k-(parameter-1):k+1]) for k in range(parameter-1,len(list))])
In [26]:
# Example plot: with the centred running average of ten precedent sentiment scores being used to smoothen sentiment score 
# Smoothing parameter 
smooth = 10
fig, ax1 = plt.subplots( figsize=(20,15))

ax1.plot(weekly_data['datetime'], weekly_data['XUDLUSS'], 'blue', linewidth=2, label=str('USD/GBP'))
ax1.legend(loc='best', fontsize=20)
ax1.set_xlabel('Year', fontsize=20)
ax1.set_ylabel('Euro and US Dollar to Pound exchange rate', fontsize=20)
ax1.grid(True)
ax1.set_ylim([min(min(weekly_data['XUDLERS']),min(weekly_data['XUDLUSS'])),max(max(weekly_data['XUDLERS']),max(weekly_data['XUDLUSS']))])
ax1.axvline(x=datetime.datetime(2016,1,8), color='grey', linewidth=2)
ax1.axvline(x=datetime.datetime(2007,1,5), color='orange', linewidth=2)
ax1.axvline(x=datetime.datetime(2009,1,12), color='orange', linewidth=2)

avg_weekly_pos_score_smooth = running_average_smoother(avg_weekly_pos_score,smooth)
avg_weekly_neg_score_smooth = running_average_smoother(avg_weekly_neg_score,smooth)

ax2 = ax1.twinx()
ax2.plot(timestamp_weeks[smooth-1:], avg_weekly_pos_score_smooth, 'green',linewidth=1, label = 'Weekly positive score')
ax2.plot(timestamp_weeks[smooth-1:], [-1*avg_weekly_neg_score_smooth[k] for k in range(0,len(avg_weekly_neg_score_smooth))], 'red',linewidth=1, label = 'Weekly negative score')
ax2.legend(loc='best', fontsize=20)

ax2.set_ylabel('Smoothened Sentiment Scores', color='grey',fontsize=20)
ax2.set_ylim([-0.08,0.08])
for tl in ax2.get_yticklabels():
    tl.set_color('grey')

plt.show()

Comment: The correlation between the sentiment scores and the exchange rate changes are somewhat visible.

In [43]:
# Calculating the time series correlation: note we shift dates by 3 days for matching 
# (we are working at a weekly level)

weekly_data_shifted = weekly_data['datetime'] + datetime.timedelta(days=3)
ts1 = []
ts2 = []

for k in weekly_data_shifted:
    if k in timestamp_weeks[9:]:
        ts1.append(weekly_data['XUDLERS'][weekly_data_shifted==k])
        ts2.append(avg_weekly_pos_score_smooth[timestamp_weeks[9:].index(k)])

ts1 = [ts1[i].values[0] for i in range(0,len(ts1))]

print(np.corrcoef(ts1,ts2)[0][1])
0.469417894603
In [44]:
# A function for calculating the time series correlation for different running average smoothing parameters
def correlation_function(smoothing_parameter,avg_weekly_pos_score,avg_weekly_neg_score):
    # We shift dates by 3 days for matching (we are working at a weekly level)
    weekly_data_shifted = weekly_data['datetime'] + datetime.timedelta(days=3)
    ts0 = []
    ts1 = []
    ts2 = []
    ts3 = []
    for k in weekly_data_shifted:
        if (k in timestamp_weeks[smoothing_parameter:]):
            if (timestamp_weeks[smoothing_parameter:].index(k) <= len(timestamp_weeks[smoothing_parameter:]) -1 ):
                ts0.append(weekly_data['XUDLUSS'][weekly_data_shifted==k])
                ts1.append(weekly_data['XUDLERS'][weekly_data_shifted==k])
                ts2.append(running_average_smoother(avg_weekly_pos_score,smoothing_parameter)[timestamp_weeks[smoothing_parameter:].index(k)])
                ts3.append(running_average_smoother(avg_weekly_neg_score,smoothing_parameter)[timestamp_weeks[smoothing_parameter:].index(k)])
    
    ts0 = [ts0[i].values[0] for i in range(0,len(ts0))]
    ts1 = [ts1[i].values[0] for i in range(0,len(ts1))]
    
    usd_pos_score_corr = np.corrcoef(ts0,ts2)[0][1]
    usd_neg_score_corr = np.corrcoef(ts0,ts3)[0][1]
    eur_pos_score_corr = np.corrcoef(ts1,ts2)[0][1]
    eur_neg_score_corr = np.corrcoef(ts1,ts3)[0][1]
    
    print('Correlation values are:','GBPUSD with Pos scores:',usd_pos_score_corr,
          'GBPUSD with Neg scores:', usd_neg_score_corr, 'GBPEUR with Pos scores:', eur_pos_score_corr,
          'GBPEUR with Neg scores:', eur_neg_score_corr)
    
    return(usd_pos_score_corr,usd_neg_score_corr,eur_pos_score_corr,eur_neg_score_corr)
In [500]:
correlation_scores = [correlation_function(k,avg_weekly_pos_score,avg_weekly_neg_score) for k in range(10,100,5)]
correlation_scores_USD_pos = np.asarray(correlation_scores)[:,0]
correlation_scores_USD_neg = np.asarray(correlation_scores)[:,1]
correlation_scores_EUR_pos = np.asarray(correlation_scores)[:,2]
correlation_scores_EUR_neg = np.asarray(correlation_scores)[:,3]
Correlation values are: GBPUSD with Pos scores: -0.0596762609186 GBPUSD with Neg scores: -0.118320076695 GBPEUR with Pos scores: 0.468743670721 GBPEUR with Neg scores: 0.0964551474907
Correlation values are: GBPUSD with Pos scores: -0.0459757823929 GBPUSD with Neg scores: -0.141905542882 GBPEUR with Pos scores: 0.500051269621 GBPEUR with Neg scores: 0.102741844176
Correlation values are: GBPUSD with Pos scores: -0.0394163385641 GBPUSD with Neg scores: -0.162371605098 GBPEUR with Pos scores: 0.518723465749 GBPEUR with Neg scores: 0.102810485615
Correlation values are: GBPUSD with Pos scores: -0.0280955240856 GBPUSD with Neg scores: -0.184740543425 GBPEUR with Pos scores: 0.533246140324 GBPEUR with Neg scores: 0.0984263883988
Correlation values are: GBPUSD with Pos scores: -0.0250106439674 GBPUSD with Neg scores: -0.202384567203 GBPEUR with Pos scores: 0.543715175141 GBPEUR with Neg scores: 0.0935015471854
Correlation values are: GBPUSD with Pos scores: -0.023151265409 GBPUSD with Neg scores: -0.218089228175 GBPEUR with Pos scores: 0.553299570239 GBPEUR with Neg scores: 0.0886645002041
Correlation values are: GBPUSD with Pos scores: -0.0224583815368 GBPUSD with Neg scores: -0.229878832634 GBPEUR with Pos scores: 0.560401011001 GBPEUR with Neg scores: 0.0811306448641
Correlation values are: GBPUSD with Pos scores: -0.0175429410483 GBPUSD with Neg scores: -0.236793257038 GBPEUR with Pos scores: 0.566621941305 GBPEUR with Neg scores: 0.0806137922158
Correlation values are: GBPUSD with Pos scores: -0.013597520392 GBPUSD with Neg scores: -0.24127379348 GBPEUR with Pos scores: 0.572372385792 GBPEUR with Neg scores: 0.0827793506844
Correlation values are: GBPUSD with Pos scores: -0.0158481749082 GBPUSD with Neg scores: -0.24468573532 GBPEUR with Pos scores: 0.57789315866 GBPEUR with Neg scores: 0.0822803138723
Correlation values are: GBPUSD with Pos scores: -0.0161071396635 GBPUSD with Neg scores: -0.249630453768 GBPEUR with Pos scores: 0.58192113256 GBPEUR with Neg scores: 0.0813479853572
Correlation values are: GBPUSD with Pos scores: -0.0142447046852 GBPUSD with Neg scores: -0.24935597211 GBPEUR with Pos scores: 0.586798983959 GBPEUR with Neg scores: 0.0786291146123
Correlation values are: GBPUSD with Pos scores: -0.00954674890996 GBPUSD with Neg scores: -0.244304249078 GBPEUR with Pos scores: 0.593045918975 GBPEUR with Neg scores: 0.076069456613
Correlation values are: GBPUSD with Pos scores: -0.000469455820377 GBPUSD with Neg scores: -0.235132753712 GBPEUR with Pos scores: 0.597839501842 GBPEUR with Neg scores: 0.0740300316717
Correlation values are: GBPUSD with Pos scores: 0.0154939132231 GBPUSD with Neg scores: -0.228081243827 GBPEUR with Pos scores: 0.602916316534 GBPEUR with Neg scores: 0.0718797058371
Correlation values are: GBPUSD with Pos scores: 0.0344065419703 GBPUSD with Neg scores: -0.218957035285 GBPEUR with Pos scores: 0.609986269173 GBPEUR with Neg scores: 0.068917918345
Correlation values are: GBPUSD with Pos scores: 0.0564767002637 GBPUSD with Neg scores: -0.20595241304 GBPEUR with Pos scores: 0.619289458566 GBPEUR with Neg scores: 0.0670197267899
Correlation values are: GBPUSD with Pos scores: 0.0800913940224 GBPUSD with Neg scores: -0.187415086995 GBPEUR with Pos scores: 0.631073314674 GBPEUR with Neg scores: 0.0647389520355
In [501]:
fig, ax1 = plt.subplots( figsize=(10,10) )
ax1.plot(correlation_scores_USD_pos, 'yellow',label = 'USD and Pos Sentiment Correlation')
ax1.plot(correlation_scores_USD_neg, 'red',label = 'USD and Neg Sentiment Correlation')
ax1.plot(correlation_scores_EUR_pos, 'green',label = 'EUR and Pos Sentiment Correlation')
ax1.plot(correlation_scores_EUR_neg, 'orange',label = 'EUR and Neg Sentiment Correlation')

ax1.set_ylabel('Correlation', fontsize=20)
ax1.legend(loc='best', fontsize=20)
ax1.set_ylim([-1,1])
Out[501]:
(-1, 1)
In [502]:
[k for k in range(10,100,5)][np.argmin(correlation_scores_USD_neg)]
Out[502]:
60

From the plot above, we note: negative sentiment is correlated with GBPUSD, and postive sentiment is correlated with GBPEUR. We use the negative sentiment correlation score with GBPUSD to determine 60 as the optimal smoothing parameter.

In [27]:
# Example plot: with the centred running average of ten precedent sentiment scores being used to smoothen sentiment score 
# Smoothing parameter 
smooth = 60
fig, ax1 = plt.subplots( figsize=(20,15))

ax1.plot(weekly_data['datetime'], weekly_data['XUDLUSS'], 'blue', linewidth=2, label=str('USD/GBP'))
ax1.legend(loc='best', fontsize=20)
ax1.set_xlabel('Year', fontsize=20)
ax1.set_ylabel('Euro and US Dollar to Pound exchange rate', fontsize=20)
ax1.grid(True)
ax1.set_ylim([min(min(weekly_data['XUDLERS']),min(weekly_data['XUDLUSS'])),max(max(weekly_data['XUDLERS']),max(weekly_data['XUDLUSS']))])
ax1.axvline(x=datetime.datetime(2016,1,8), color='grey', linewidth=2)
ax1.axvline(x=datetime.datetime(2007,1,5), color='orange', linewidth=2)
ax1.axvline(x=datetime.datetime(2009,1,12), color='orange', linewidth=2)

avg_weekly_pos_score_smooth = running_average_smoother(avg_weekly_pos_score,smooth)
avg_weekly_neg_score_smooth = running_average_smoother(avg_weekly_neg_score,smooth)

ax2 = ax1.twinx()
ax2.plot(timestamp_weeks[smooth-1:], avg_weekly_pos_score_smooth, 'green',linewidth=1, label = 'Weekly positive score')
ax2.plot(timestamp_weeks[smooth-1:], [-1*avg_weekly_neg_score_smooth[k] for k in range(0,len(avg_weekly_neg_score_smooth))], 'red',linewidth=1, label = 'Weekly negative score')
ax2.legend(loc='best', fontsize=20)

ax2.set_ylabel('Smoothened Sentiment Scores', color='grey',fontsize=20)
ax2.set_ylim([-0.08,0.08])
for tl in ax2.get_yticklabels():
    tl.set_color('grey')

plt.show()

 First 20/ last 20 word analysis: we investigate the sentiment scores using only the first and last 20 words of the articles. Our hypothesis is that this is sufficient to indicate an article's sentiment, and will save computation time.

In [497]:
weeks = relevant_df['Week Start'].unique()

avg_weekly_pos_score_fl20 = np.zeros((len(weeks), 1))
avg_weekly_neg_score_fl20 = np.zeros((len(weeks), 1))

for i, week in enumerate(weeks):
    # First and last twenty words
    articles = relevant_df[relevant_df['Week Start'] == week]['Body Text']
    num_articles = articles.shape[0]
    pos_score_fl20 = 0
    neg_score_fl20 = 0
    for article in articles:
        # Only keeping the first and last 20 words
        article = ' '.join(article.split()[0:20] + article.split()[-20:])
        pos, neg = simple_sentiment(article)
        pos_score_fl20 += pos
        neg_score_fl20 += neg
    avg_weekly_pos_score_fl20[i] = (pos_score_fl20/float(num_articles))
    avg_weekly_neg_score_fl20[i] = (neg_score_fl20/float(num_articles))
    if (i%10 == 0):
        print('Week: ', week, 'Postive: ', avg_weekly_pos_score[i][0], 'Negative: ', avg_weekly_neg_score[i][0])
Week:  2000-01-03 Postive:  0.070247292798 Negative:  0.0528722433901
Week:  2000-03-13 Postive:  0.0658073567332 Negative:  0.0521062053688
Week:  2000-05-22 Postive:  0.0653498060063 Negative:  0.0470417799244
Week:  2000-07-31 Postive:  0.0623012359218 Negative:  0.053896961167
Week:  2000-10-09 Postive:  0.0616461154479 Negative:  0.0443163230379
Week:  2000-12-18 Postive:  0.0575619986913 Negative:  0.0559789036546
Week:  2001-02-26 Postive:  0.0650370416063 Negative:  0.0542495718371
Week:  2001-05-07 Postive:  0.0647627400464 Negative:  0.0497860663912
Week:  2001-07-16 Postive:  0.0616574453781 Negative:  0.0550657019294
Week:  2001-09-24 Postive:  0.0590427462749 Negative:  0.0602430461482
Week:  2001-12-03 Postive:  0.0645236698043 Negative:  0.0612511105347
Week:  2002-02-11 Postive:  0.0626446311006 Negative:  0.0577428083685
Week:  2002-04-22 Postive:  0.0589347996625 Negative:  0.0545471878205
Week:  2002-07-01 Postive:  0.0613698109972 Negative:  0.05814486571
Week:  2002-09-09 Postive:  0.0627252193592 Negative:  0.0626176364583
Week:  2002-11-18 Postive:  0.0633336372791 Negative:  0.056378552437
Week:  2003-01-27 Postive:  0.0609341498228 Negative:  0.0609444727475
Week:  2003-04-07 Postive:  0.0626810720958 Negative:  0.0591173676005
Week:  2003-06-16 Postive:  0.0676218160258 Negative:  0.056937275817
Week:  2003-08-25 Postive:  0.0660500459658 Negative:  0.0550704016185
Week:  2003-11-03 Postive:  0.0686246619861 Negative:  0.055255099286
Week:  2004-01-12 Postive:  0.0561652525227 Negative:  0.0576878794783
Week:  2004-03-22 Postive:  0.0645770992968 Negative:  0.0542847149407
Week:  2004-05-31 Postive:  0.0656146118722 Negative:  0.0611784677117
Week:  2004-08-09 Postive:  0.0612027890647 Negative:  0.0465309413544
Week:  2004-10-18 Postive:  0.0622041887523 Negative:  0.0538350519776
Week:  2004-12-27 Postive:  0.0570319809523 Negative:  0.0580905027247
Week:  2005-03-07 Postive:  0.0698148768707 Negative:  0.0584057348249
Week:  2005-05-16 Postive:  0.0561093925709 Negative:  0.05169866295
Week:  2005-07-25 Postive:  0.0631798212986 Negative:  0.050992308504
Week:  2005-10-03 Postive:  0.0663529680452 Negative:  0.0566626431499
Week:  2005-12-12 Postive:  0.0639792081296 Negative:  0.0495567880074
Week:  2006-02-20 Postive:  0.0621549112422 Negative:  0.0556185741554
Week:  2006-05-01 Postive:  0.0576230553326 Negative:  0.0522844687026
Week:  2006-07-10 Postive:  0.0639583800864 Negative:  0.0556285531944
Week:  2006-09-18 Postive:  0.0580891556258 Negative:  0.0564063680661
Week:  2006-11-27 Postive:  0.0577974196904 Negative:  0.05286202742
Week:  2007-02-05 Postive:  0.0585059853479 Negative:  0.0511422156741
Week:  2007-04-16 Postive:  0.0614453809981 Negative:  0.0536828697315
Week:  2007-06-25 Postive:  0.0603237269026 Negative:  0.0540731743576
Week:  2007-09-03 Postive:  0.0661022180416 Negative:  0.054826362352
Week:  2007-11-12 Postive:  0.0565422957201 Negative:  0.0566136107693
Week:  2008-01-21 Postive:  0.0625120742943 Negative:  0.0564085599664
Week:  2008-03-31 Postive:  0.062196596257 Negative:  0.0562425867818
Week:  2008-06-09 Postive:  0.0609730981448 Negative:  0.05679412495
Week:  2008-08-18 Postive:  0.0529763120502 Negative:  0.0531138689598
Week:  2008-10-27 Postive:  0.0589351729894 Negative:  0.0534437370023
Week:  2009-01-05 Postive:  0.0596886373794 Negative:  0.0506772625514
Week:  2009-03-16 Postive:  0.0626951685953 Negative:  0.0547125112088
Week:  2009-05-25 Postive:  0.0643500897157 Negative:  0.0552304730094
Week:  2009-08-03 Postive:  0.0567692463999 Negative:  0.0573681071284
Week:  2009-10-12 Postive:  0.0589871292156 Negative:  0.053112511686
Week:  2009-12-21 Postive:  0.0560603002174 Negative:  0.0476079895846
Week:  2010-03-01 Postive:  0.0600195251088 Negative:  0.0512435021172
Week:  2010-05-10 Postive:  0.0584850202821 Negative:  0.0530082672269
Week:  2010-07-19 Postive:  0.0652257430737 Negative:  0.0540492717606
Week:  2010-09-27 Postive:  0.0600147586977 Negative:  0.0498376810092
Week:  2010-12-06 Postive:  0.0661222990263 Negative:  0.0441327423386
Week:  2011-02-14 Postive:  0.0545892193367 Negative:  0.0472627406083
Week:  2011-04-25 Postive:  0.0596794090988 Negative:  0.0541653530357
Week:  2011-07-04 Postive:  0.0554506729749 Negative:  0.052003327899
Week:  2011-09-12 Postive:  0.0594790228372 Negative:  0.0579826637214
Week:  2011-11-21 Postive:  0.0578874901878 Negative:  0.0541887680281
Week:  2012-01-30 Postive:  0.058571272334 Negative:  0.053489140404
Week:  2012-04-09 Postive:  0.0624904745445 Negative:  0.0588127050151
Week:  2012-06-18 Postive:  0.064600234682 Negative:  0.0566528569825
Week:  2012-08-27 Postive:  0.0563098040101 Negative:  0.0502421163991
Week:  2012-11-05 Postive:  0.0570770968645 Negative:  0.0466208704091
Week:  2013-01-14 Postive:  0.0618242513182 Negative:  0.0524102249518
Week:  2013-03-25 Postive:  0.0553472428224 Negative:  0.0523411798025
Week:  2013-06-03 Postive:  0.0610375286252 Negative:  0.0489877789968
Week:  2013-08-12 Postive:  0.0566643127066 Negative:  0.0574768519155
Week:  2013-10-21 Postive:  0.0601101616437 Negative:  0.0475021530791
Week:  2013-12-30 Postive:  0.0597342057968 Negative:  0.0502957137784
Week:  2014-03-10 Postive:  0.0607634982244 Negative:  0.0498187741547
Week:  2014-05-19 Postive:  0.0616238208348 Negative:  0.0492452893465
Week:  2014-07-28 Postive:  0.0605461959962 Negative:  0.052903029744
Week:  2014-10-06 Postive:  0.0578441891044 Negative:  0.0513116090539
Week:  2014-12-15 Postive:  0.0543014260406 Negative:  0.0540579953841
Week:  2015-02-23 Postive:  0.0626575665451 Negative:  0.0459604766705
Week:  2015-05-04 Postive:  0.0594779341886 Negative:  0.0546943683571
Week:  2015-07-13 Postive:  0.0632909728393 Negative:  0.0544319960899
Week:  2015-09-21 Postive:  0.0591034943314 Negative:  0.0499770091971
Week:  2015-11-30 Postive:  0.0609028360376 Negative:  0.0493375113752
Week:  2016-02-08 Postive:  0.0600355685147 Negative:  0.0576782106041
Week:  2016-04-18 Postive:  0.0552852615782 Negative:  0.0491811576399
Week:  2016-06-27 Postive:  0.0634605779242 Negative:  0.0530507597463
Week:  2016-09-05 Postive:  0.0612259443832 Negative:  0.0509554182366
In [498]:
# Example plot: with the centred running average of ten precedent sentiment scores being used to smoothened sentiment score 
# Smoothing parameter set to 60
smooth = 60
fig, ax1 = plt.subplots( figsize=(20,15))

ax1.plot(weekly_data['datetime'], weekly_data['XUDLERS'], 'brown', linewidth=2, label=str('EUR/GBP'))
ax1.plot(weekly_data['datetime'], weekly_data['XUDLUSS'], 'blue', linewidth=2, label=str('USD/GBP'))
ax1.legend(loc='best', fontsize=20)
ax1.set_xlabel('Year', fontsize=20)
ax1.set_ylabel('Euro and US Dollar to Pound exchange rate', fontsize=20)
ax1.grid(True)
ax1.set_ylim([min(min(weekly_data['XUDLERS']),min(weekly_data['XUDLUSS'])),max(max(weekly_data['XUDLERS']),max(weekly_data['XUDLUSS']))])
ax1.axvline(x=datetime.datetime(2016,1,8), color='grey', linewidth=2)
ax1.axvline(x=datetime.datetime(2007,1,5), color='orange', linewidth=2)
ax1.axvline(x=datetime.datetime(2009,1,12), color='orange', linewidth=2)

avg_weekly_pos_score_smooth = running_average_smoother(avg_weekly_pos_score_fl20,smooth)
avg_weekly_neg_score_smooth = running_average_smoother(avg_weekly_neg_score_fl20,smooth)

ax2 = ax1.twinx()
ax2.plot(timestamp_weeks[smooth-1:], avg_weekly_pos_score_smooth, 'green',linewidth=1, label = 'Weekly positive score')
ax2.plot(timestamp_weeks[smooth-1:], [-1*avg_weekly_neg_score_smooth[k] for k in range(0,len(avg_weekly_neg_score_smooth))], 'red',linewidth=1, label = 'Weekly negative score')


ax2.set_ylabel('Smoothened Sentiment Scores', color='grey',fontsize=20)
ax2.set_ylim([-0.08,0.08])
for tl in ax2.get_yticklabels():
    tl.set_color('grey')

plt.show()

Let us compare the correlation of the two models (one only looking at sentiment of first/last 20 words and the other looking at the sentiment of the full article)

In [506]:
print('Full article sentiment correlation: \n', correlation_function(60,avg_weekly_pos_score,avg_weekly_neg_score),
     '\n First/Last 20 words sentiment correlation: \n', correlation_function(60,avg_weekly_pos_score_fl20,avg_weekly_neg_score_fl20),)
Correlation values are: GBPUSD with Pos scores: -0.0161071396635 GBPUSD with Neg scores: -0.249630453768 GBPEUR with Pos scores: 0.58192113256 GBPEUR with Neg scores: 0.0813479853572
Correlation values are: GBPUSD with Pos scores: 0.101768598987 GBPUSD with Neg scores: -0.291796052172 GBPEUR with Pos scores: 0.562555930038 GBPEUR with Neg scores: -0.281428244465
Full article sentiment correlation: 
 (-0.016107139663533665, -0.24963045376849607, 0.58192113255961253, 0.081347985357151847) First/Last 20 words sentiment correlation: 
 (0.1017685989865699, -0.29179605217215565, 0.56255593003843052, -0.28142824446493103)

Sentiment of the first/ last 20 words of the article actually give stronger correlations with the exchange rates. This is also a faster method computationally.

First/ last 20 words sentiment analysis function

In [517]:
# First/ last 20 words sentiment analysis function:
def simple_sentiment_fl20(text_chunk):
    # Only keeping the first/ last 20 words
    text_chunk = ' '.join(text_chunk.split()[0:20] + text_chunk.split()[-20:])
    return simple_sentiment(text_chunk)

Sentiment analysis with neutral sentences removed

In [28]:
# A function to remove the neutral sentences from a given text
def remove_neu_sentences(text, neutral_threshold):
    sentences = tokenize.sent_tokenize(text)
    output_text = []
    
    for sentence in sentences:
        if (1-sum(simple_sentiment(sentence))) < neutral_threshold :
            output_text.append(sentence)
    return ' '.join(output_text)
In [29]:
# A function for calculating the sentiment scores with the neutral setences removed
def sentiscores_neu_removed(neutral_threshold):
    weeks = relevant_df['Week Start'].unique()
    avg_weekly_pos_score = np.zeros((len(weeks), 1))
    avg_weekly_neg_score = np.zeros((len(weeks), 1))
    
    for i, week in enumerate(weeks):
        articles = relevant_df[relevant_df['Week Start'] == week]['Body Text']
        num_articles = 0
        pos_score = 0
        neg_score = 0
        for article in articles:
            article_neutral_removed = remove_neu_sentences(article,neutral_threshold)
            if (len(article_neutral_removed) > 2):
                num_articles += 1
            
            pos, neg = simple_sentiment(article_neutral_removed)
            pos_score += pos
            neg_score += neg
        
        avg_weekly_pos_score[i] = (pos_score/float(1 if num_articles==0 else num_articles))
        avg_weekly_neg_score[i] = (neg_score/float(1 if num_articles==0 else num_articles))
        avg_weekly_pos_minus_neg_score[i] = avg_weekly_pos_score[i] - avg_weekly_neg_score[i]
        if (i%10 == 0):
            print('Week: ', week, 'Postive: ', avg_weekly_pos_score[i][0], 'Negative: ', avg_weekly_neg_score[i][0])
           
    return weeks, avg_weekly_pos_score, avg_weekly_neg_score
In [ ]:
dataframe_input = sentiscores_neu_removed(0.9)
In [77]:
# # Saving file to not recalculate every time
# scores_df = pd.DataFrame()
# scores_df['weeks']=dataframe_input[0]
# scores_df['avg_weekly_pos_score'] = dataframe_input[1]
# scores_df['avg_weekly_neg_score'] = dataframe_input[2]

# scores_df.to_csv('scores_df_0.9.csv',index=False)
In [39]:
# Example plot: with the centred running average of ten precedent sentiment scores being used to smoothen sentiment score 
# Smoothing parameter 
smooth = 20
fig, ax1 = plt.subplots( figsize=(20,15))

ax1.plot(weekly_data['datetime'], weekly_data['XUDLUSS'], '#4b97ab', linewidth=2, label=str('USD/GBP'))
ax1.legend(loc='best', fontsize=20)
ax1.set_xlabel('Year', fontsize=20)
ax1.set_ylabel('Euro and US Dollar to Pound exchange rate', fontsize=20)
ax1.grid(True)
ax1.set_ylim([min(min(weekly_data['XUDLERS']),min(weekly_data['XUDLUSS'])),max(max(weekly_data['XUDLERS']),max(weekly_data['XUDLUSS']))])

ax1.axvline(x=datetime.datetime(2016,1,8), color='grey', linewidth=2)
ax1.axvline(x=datetime.datetime(2007,1,5), color='orange', linewidth=2)
ax1.axvline(x=datetime.datetime(2009,1,12), color='orange', linewidth=2)

plt.xticks(fontsize=30)
plt.yticks(fontsize=30)

avg_weekly_pos_score_smooth = running_average_smoother(avg_weekly_pos_score,smooth)
avg_weekly_neg_score_smooth = running_average_smoother(avg_weekly_neg_score,smooth)

ax2 = ax1.twinx()
ax2.plot(timestamp_weeks[smooth-1:], avg_weekly_pos_score_smooth, 'green',linewidth=1, label = 'Weekly positive score')
ax2.plot(timestamp_weeks[smooth-1:], [-1*avg_weekly_neg_score_smooth[k] for k in range(0,len(avg_weekly_neg_score_smooth))], 'red',linewidth=1, label = 'Weekly negative score')

ax2.set_ylabel('Smoothened Sentiment Scores', color='grey',fontsize=20)
ax2.set_ylim([-0.08,0.08])
for tl in ax2.get_yticklabels():
    tl.set_color('grey')

plt.xticks(fontsize=30)
plt.yticks(fontsize=30)
plt.savefig('sentiment_correlation_pic.png', dpi=200)