Sema difference? English and Swahili news coverage of the 2015 general election in Tanzania
Men in Stone Town, Zanzibar read headlines on Oct 14, 2015, about 10 days before the election
Yes, this great East African country has been among the countries with the
best media freedom conditions in the world.
It seems that the media will never be free from the clutches of the government.
The laws imposed on the media are strict and makes it practically impossible
for journalists to do their job.
If you find yourself in Tanzania and can’t read the headlines above, hakuna matata. Major publishers here offer both English and Swahili language dailies. Last summer as media coverage ramped up around the now-concluded general election, I started to wonder how much it matters that I, an english-speaking foreigner, read the news in a different language than most Tanzanians. Were there differences in coverage in English vs. Swahili, or between publishing houses? Were foreigners getting the same news as locals, and could I figure out a way to measure the difference?
Analysis of almost 9000 articles showed that election coverage was the primary focus of Tanzanian print media this Fall, especially in Swahili. Unsurprisingly the ruling party and topics favorable to the Tanzanian government received the most coverage in government-owned publications. I didn’t find an obvious difference in which topics were covered in English vs. Swahili, but Election-related terms appeared more often in Swahili. I think this consistency between languages reflects well on the Tanzanian media. Experiments with topic modeling highlighted some unexpected aspects of the election coverage, including a dominant focus on political promises and electoral policy and procedures.
What election?
On October 25 Tanzania held its 5th general election since 1992, electing John Magufuli as President by a 58% majority. Dr. Magufuli represented the ruling party CCM (Chama Cha Mapinduzi), and defeated Edward Lowassa, the candidate for opposition party CHADEMA (Chama cha Demokrasia na Maendeleo). It was a lively campaign but CCM’s victory was expected, and the transition of power from former president Jakaya Kikwete was smooth and generally peaceful.
CCM candidate and current President of Tanzania John Magufuli on the left, CHADEMA candidate Edward Lowassa on the right.
A wrinkle worth mentioning happened in Zanzibar, which runs its own presidential election in parallel with the mainland. Amid rumors of an opposition victory the CCM-dominated Zanzibar Electoral Commission (ZEC) annulled the election result on October 28, citing irregularities. The move was widely criticized and left Zanzibar in an uncertain political situation that is still unresolved.
Three publishers, six newspapers
Tanzania has an active media landscape including over 30 print publications that vary widely in scope and quality. Conveniently the three most prominent publishers each produce a daily newspaper in English and Swahili, so these six papers were an obvious choice for this project. The Kenya-based Nation Media Group publishes The Citizen (EN) and Mwananchi (SW), Tanzanian-owned IPP Media publishes The Guardian (EN) and Nipashe (SW), and government-owned TSN (Tanzanian Standard Newspapers) publishes the Daily News (EN) and Habarileo (SW). Here they are in a table:
English
Swahili
Nation Media
Citizen
Mwananchi
IPP Media
Guardian
Nipashe
TSN
Daily News
Habarileo
The six newspapers used for this project. Nation Media is foreign-owned, IPP Media is owned by a wealthy Tanzanian, and TSN is owned by the government of Tanzania.
The setup
On September 14, 2015 I set up scrapers to download all articles linked from the front page of these six newspapers’ websites. The scrapers were built in Python with scrapy and ran twice daily, with deduplication done in post-processing. The election was held on October 25, and by mid-November the scrapers had accumulated almost 9000 articles: about 4900 in English and 3900 in Swahili. I used Python to clean and organize the scraped articles, and looked at the data in three ways:
Counted terms of interest by language and publication
Manually compared selected English and Swahili headlines
I selected some key terms related to the election, plus a few more general terms for comparison, and counted the number of times they were mentioned over two months from September 15 - November 15.
Daily counts for
Counts of the selected term from September 15 - November 15, with English publications above the x-axis and Swahili publications below. Select different terms from the dropdown, and click the legend to toggle publications.
Election-related terms were consistently mentioned more in Swahili than English, often by at least a factor of 2, and absolute counts were high across all publications. Ruling party CCM and candidate [John] Magufuli were mentioned more than opposition party CHADEMA/UKAWA and candidate [Edward] Lowassa. This was true for all publications, but the difference was largest for the TSN (government-owned) papers Daily News and Habarileo. For example, Magufuli was mentioned 3.6 times as often as Lowassa by TSN, vs 2.2 times as often in IPP Media publications. TSN mentioned CCM 1.8 times as often as CHADEMA and UKAWA combined. Their coverage of Zanzibar stands out as the only case where an election-related term was mentioned more in English (Daily News) than Swahili (Habarileo).
Select a publisher:
Citizen
Mwananchi
Total
Citizen:Mwananchi
Magufuli
644
1164
1808
0.55
Lowassa
457
928
1385
0.49
CCM
1101
2007
3108
0.55
UKAWA
292
596
888
0.49
CHADEMA
614
1087
1701
0.56
Zanzibar
425
578
1003
0.74
election/uchaguzi*
1322
2259
3581
0.59
corruption/rushwa*
237
171
408
1.39
sports/michezo*
126
138
264
0.91
Kenya
345
127
472
2.72
Guardian
Nipashe
Total
Guardian:Nipashe
Magufuli
326
1099
1860
0.30
Lowassa
267
683
853
0.39
CCM
733
1828
2801
0.40
UKAWA
172
433
574
0.40
CHADEMA
467
1007
1334
0.46
Zanzibar
501
744
1266
0.67
election/uchaguzi*
889
2134
3176
0.42
corruption/rushwa*
128
169
322
0.76
sports/michezo*
143
114
454
1.25
Kenya
170
68
338
2.50
Daily News
Habarileo
Total
Daily News:Habarileo
Magufuli
761
1205
1966
0.63
Lowassa
170
371
541
0.46
CCM
973
1013
1986
0.96
UKAWA
141
169
310
0.83
CHADEMA
327
484
811
0.68
Zanzibar
522
459
981
1.14
election/uchaguzi*
1042
1897
2939
0.55
corruption/rushwa*
153
217
370
0.71
sports/michezo*
340
313
653
1.09
Kenya
270
183
453
1.48
Total (EN)
Total (SW)
Total (EN):Total (SW)
Magufuli
1731
3468
0.50
Lowassa
894
1982
0.45
CCM
2807
4848
0.58
UKAWA
605
1198
0.51
CHADEMA
1408
2578
0.55
Zanzibar
1448
1781
0.81
election/uchaguzi*
3253
6290
0.52
corruption/rushwa*
518
557
0.93
sports/michezo*
609
565
1.08
Kenya
785
378
2.08
Comparison of term counts in English and Swahili. For pairs marked * the English word was counted in English language publications, and the Swahili word in Swahili publications. Shaded rows are terms not directly related to the election, provided for comparison.
Some academic studies of media bias normalize term counts, for example as counts per 10,000 words or as a fraction of words published. I briefly played with these techniques and didn’t find them useful for highlighting trends between publications, especially on days when few articles were published, so the tables and charts here use absolute counts.
Reading the headlines
Manually reading headlines around controversial events helps put these term counts in context. It’s not quantitative, but coverage of political scandals can be more revealing of a publication’s editorial bias than topic selection. Headlines are also how many people in Tanzania get their news: the photo at the top of this post is a common scene.
The Zanzibar election annulment on October 28 is a good example because it’s a discrete, high profile and polarizing event. Ben Taylor (@mtega), a blogger and consultant with TZ civil society organization Twaweza, graciously helped with translations from Swahili to English. The list below shows headlines from all articles scraped on October 30 (2 days after the annulment) that mention Zanzibar. There’s unavoidable subjectivity in interpreting them, but two trends stand out.
First, consistent with the word counts above, TSN is pro-government and pro-CCM, IPP Media is pro-opposition and Nation is more centrist. This isn’t surprising (government-owned media supports the government, foreign-owned publisher less opinionated, nobody shocked), but it does make for colorful comparisons. Two days after a major story the Guardian calls Zanzibar a “sure cause for worry” and Nipashe speculates on what will happen if Zanzibar becomes violent, while the Daily News hails “Peaceful Elections” and Habarileo runs a fluffy piece about Zanzibari architecture. You’ll see a similar pattern most days.
Citizen
Congrats, Dr Magufuli; we now must move on
CUF: No need to conduct fresh polls in Zbar
EAC releases poll result report
Forge a democratic Zanzibar, Moyo tells CCM
Peace resumes after tense 3 days in Zanzibar
Pressure mounts on ZEC to reverse decision on polls
Take leadership responsibility, Maalim Seif asks Kikwete, Shein
What it’ll cost to repeat Zbar polls
Mwananchi *
Congratulations Dr. Magufuli, take this into account
Lowassa contests Dr Magufuli presidency
Maalim Seif orders Jakaya Kikwete and Shein to stand for peace
Magufuli 2015
Observers put pressure on ZEC
Guardian
Final Whistle: Magufuli is President
Situation in Zanzibar sure cause for worry
Six Zanzibar presidential candidates fault ZEC on election results rulling
Status of tuna fisheries in Tanzania under spotlight
UK, observers call upon ZEC to resume tabulation process
Nipashe *
Maalim Seif: What ZEC has done is a revolution
Nine hard questions on the Zanzibar elections
Nine houses burned down on Zanzibar while Magufuli is announced as the winner
ZEC will take the blame if Zanzibar becomes violent
Daily News
Businesses open in Isles after a weeks lull
Let us all give Magufuli full support
Observers hail Tanzania over peaceful polls
PBZ posts 2.19bn/- profit for July-Sept
Peaceful elections? The people have shown the way!
Tanzania Postal Bank makes 1.88bn/- profit in Q3
TSA picks constitutional amendment committee
Habarileo *
Election challenges should be settled peacefully
Human rights organization wants to assist ZEC
Stone town's valued architectural art
Quiet returns to Zanzibar
All headlines from articles mentioning Zanzibar, scraped on October 30. Publications marked * are translated from Swahili.
Second, and more positive, is that for each publisher there doesn’t appear to be a major difference between their coverage of important current events in English vs Swahili. If anything Swahili headlines are more emotionally charged. In the example above Nipashe discusses houses burning, violence, revolution (though the word has less volatile connotations in Swahili) and asks hard questions. It’s impressive to see strong dissenting viewpoints in a major local language publication.
Topic modeling
Word counts turned out to be a simple if rough way to quantify topic coverage, but counts can’t incorporate word sense or context. Latent Dirichlet Allocation (LDA) is a computational technique for discovering groups of words that represent topics covered by a collection of documents. It is often applied to find topics in large, unstructured texts, for example Sarah Palin’s leaked emails in 2011 (this page also links to a good general discussion of LDA). In the end it wasn’t especially useful, but worth including because it highlighted two aspects of the overall election coverage that I didn’t expect.
I ran LDA on the English language articles, n = 4935. I used NLTK and Gensim to clean the text (downcase, remove punctuation/white space/stop words, and identify common bigrams), and then ran Gensim’s LDA implementation with k = 100. k is an LDA parameter which represents the number of topics and is often chosen heuristically. I then manually reviewed each topic and assigned it a label. For example a topic including these terms:
players, stars, team, taifa_stars, tournament, mkwasa, tanzania, dar_es, teams, match
was labeled “sports”. I then used the LDA model to assign these labels to articles based on the most strongly represented topics in each article.
Unfortunately most of the topics discovered by LDA, at least at my level of skill with the technique, were too general (e.g. “wildlife”) or too specific (e.g. “stampede during the Hajj”) to help identify editorial differences. Still, two stand out as noteworthy.
The most frequently occurring topic in Daily News and Habarileo, and second most frequent overall, was labeled “political promises”. It looks like this:
government, would, people, dr_magufuli, ensure, country, residents, water, promised, area
Many articles strongly represented by this topic have headlines like:
Lowassa: I’ll make Tanzania land of milk, honey
Dr Shein vows to uphold Union as CCM launches campaign in Zanzibar
Tanzanians like to see real development, says Magufuli
This is neat because LDA turned out to be good at unsupervised identification of political promises, and a little surprising because most Tanzanians don’t have faith in politicians’ promises. Apparently they still like to read about them.
Another interesting topic I labeled “election process”. Its terms include:
General Election campaigns should be smooth, peaceful
IGP warns parties’ security groups over ‘grabbing’ of powers of police
NEC working on problems in voter registration
This topic was among the top 10 for each publication, and 4th most common overall. There’s room for interpretation, but I think this shows a media focus on the procedures and mechanics of the election. It suggests a lively interest in the electoral process from a young democracy during its 5th multi-party general election.
Wrapping up
When I started this project I thought it would be an opportunity to learn more about algorithms from text analytics, but simple tools ended up being able to identify high level trends. While the analysis surfaced some interesting features of the election coverage, overall topic coverage in English and Swahili publications seems similar. I don’t think it would be valuable to probe for more subtle differences with computational techniques alone.
Many academic studies of media bias use human labeling to supplement results from LDA and other machine learning approaches. If I were to take this project further, for example to examine what topic coverage is associated with the high counts of election-related words in Swahili, I would start by having human readers label articles. I’d only look to topic models or machine classification if trends were still unclear, or I wanted to try and generalize the results to new articles.
Data
Want to go deeper? Lonely on a Tuesday? The data’s online! You can download the original articles in json format or check out the scrapers on github. The brave might peruse my idiosyncratic Python scripts for data cleaning, including a Jupyter notebook with the LDA experiments.
Thanks!
Ben Taylor for context, translation and thoughtful feedback. Josh Levens, Jessica Padron and Daniel Waistell for help with translation. Angela Ambroz, Mike Dewar, David Feldman, Kelly Hamblin, and Ashely Price for useful discussions. Kelly and Jennifer Hamblin contributed photos.