It has long been argued that women are under-represented and marginalised in relation to men in the world’s news media. In this research we analysed over two million articles to find out how gender is represented in online news. The study, which is the largest undertaken to date, found men’s views and voices are represented more in online news than women’s.
Modern AI, which is frequently in the news, is a great tool to support research and can automate tasks that would take humans an impossible amount of person-hours to complete. It is now possible to automate the task of recognising the gender of a face with a remarkable level of accuracy, and it is also possible to detect references to people in online text, along with their gender.
We collected 2,353,652 news articles from over 950 news outlets that have a web presence, spanning a period of six months from when collection of news images began, covering 19th October 2014 to 19th April 2015. For each news article in the corpus, we used automated methods to extract information from the text and image, annotating each news article with topic categories, named person entities from the text and the gender of faces in the images.
We used Support Vector Machines (SVMs) trained for high precision on the well-known Reuters and New York Times news corpora, along with online linear perceptron models trained on news media within our modular architecture to classify news articles into 12 topic categories as defined by the editors of Reuters and the New York Times.
We extracted person named entities from the full text of the articles using the ANNIE plugin of GATE followed by a series of steps aimed at improving the quality of co-reference resolution and gender classification. We do face detection and gender classification in images using the Viola-Jones algorithm implemented in OpenCV.
Topic Gender Balance
Examining gender balance in the corpus of news articles analysed, we focused on how the representation of men and women featured in the news changed when examining the topic category of the news article, along with any differences between the main text of the article and the image associated with it.
Fig 1. Gender balance for each topic category. Fashion was excluded for readability, and would reside at (36.1,45.9). All topics above the diagonal have a higher probability of a face being female in an image than a person entity in the text being female.
We found that across all topic categories except Fashion, mentions of males dominated in written texts, with the probability of an entity being male ranging from 69.5% in Entertainment to 91.5% in Sports. The results were similar for images, where the probability of a face image being male ranged from 59.3% in Entertainment to 79.9% in Politics. Fashion was found to be the only topic category where mentions of females in the text, or images of female faces, were more likely than those of their male counterparts, with the probability of an entity being male in Fashion equalling 45.9%, while the probability of a face image being male was 36.1%.
Outlet Gender Balance
We also wanted to investigate how the balance of males and females featured in the news might change when examining the news outlet that published the news article.
Fig 2. Gender balance for each of the 15 news outlets. All outlets above the diagonal have a higher probability of a face being female in an image than a person entity in the text being female. For one particular outlet displayed (“The Hindu”), the position might not reflect its actual probability of a given entity being male in textual data (vertical axis) due to our Named Entities recognizer having worse performance than average on that outlet. We have included it in the Figure for completeness.
From the top 15 news outlets for which we have data, we found that Forbes was the least balanced of the news outlets in the mention of entities in the text, with 81.0% being males, closely followed by the BBC at 80.9%. Fox News was the least balanced in the choice of images containing females published, with the probability of a face image being male equalling 76.5%. Of the 15 outlets, News.com.au was the closest to gender balance, with the probability of an entity and a face image being classified as male being 69.8% and 65.1% respectively. More details on other experiments performed are mentioned in the paper.
Automated approaches that combine computer vision and natural language processing technologies with vast data samples make it possible, for the first time, to gather extensive empirical support, on a scale previously inconceivable, for long-standing claims around the marginalisation of women in the news. Facilitating analyses that would otherwise take years to complete, such technologies enable new forms of critical enquiry in this field of research.
Publication
- Women are seen more than heard in online newspapers by Sen Jia, Thomas Lansdall-Welfare, Saatviga Sudhahar, Cynthia Carter, Nello Cristianini in PLOS ONE.