Data Gathering
&
Exploratory Data Analysis
Data Gathering
To analyze gender representation in news coverage, articles were collected using NewsAPI, a web service that provides real-time access to news articles from various sources. Articles can be retrieved through NewsAPI utilizing specific filters, such as publication source, date range, and keywords. However, it has limitations, including a cap of 500 articles per request and, in the free version, access to articles only from January 2, 2025, onward. Given these constraints, Business Insider was chosen as the primary news source due to its considerable reach of 100 million users and its emphasis on "people-first" storytelling, indicating a focus on human-centered narratives that may include gendered language. As a publication focused on business and economy, Business Insider offers a relevant dataset for examining how gender is represented in discussions on finance, leadership, and workplace dynamics.
To build the dataset, Business Insider articles for January 2025 were retrieved. Weekly data was collected to maximize coverage while adhering to API limits. The retrieved articles were returned in JSON format and converted into a data frame for analysis.​​
After extracting weekly batches, all four data frames were merged into a single consolidated dataset, combined_df. This dataset included the following columns: "author," "title," "description," "url," "urlToImage," "publishedAt," "content," "source.id," and "source.name."
For this analysis, the dataframe was filtered to retain only the most relevant columns: "author," "title," "description," and "content." No duplicate rows were found; however, some entries contained missing values. Two rows with missing values were removed from the dataset. A total of 1,939 articles are in the cleaned dataset.
The next step involved tokenizing and vectorizing the "content" column to count the frequency of words per article. Stop words were initially retained to preserve gendered terms. A set of gendered words was identified for analysis, ['woman', 'women', 'she', 'her', 'he', 'female', 'him', 'his', 'male', 'men', 'man', 'they', 'theirs', 'them'].
​
All other stop words and numerical values were removed to refine the dataset for gender-focused text analysis. Two separate data frames were then created: one containing all words for broader linguistic analysis, words_df, and another isolating only gendered words to facilitate a focused examination of gender representation in news coverage, gender_DF.
​
For the rest of the analysis, we will focus on articles that include gendered words to understand representation in articles.