EB-NeRD: Ekstra Bladet News
Recommendation Dataset

A Large-Scale Dataset for News Recommendation

Sponsored by

About

The Ekstra Bladet News Recommendation Dataset (EB-NeRD) was created to support advancements in news recommendation research. It was collected from user behavior logs at Ekstra Bladet. We collected behavior logs from active users during the 6 weeks from April 27 to June 8, 2023. This timeframe was selected to avoid major events, e.g., holidays or elections, that could trigger atypical behavior at Ekstra Bladet. The active users were defined as users who had at least 5 and at most 1,000 news click records in a three-week period from May 18 to June 8, 2023. To protect user privacy, every user was delinked from the production system when securely hashed into an anonymized ID using one-time salt mapping. Alongside, we provide Danish news articles published by Ekstra Bladet. Each article is enriched with textual context features such as title, abstract, body, categories, among others. Furthermore, we provide features that have been generated by proprietary models, including topics, named entity recognition (NER), and article embeddings

Dataset Format

Each dataset bundle—demo, small, and large—consists of a training set and validation set, together with the articles (articles.parquet) present in the bundle. The official test set is to be downloaded separately from these. Each data split has two files: 1) the behavior logs for the 7-day data split period (behaviors.parquet) and 2) the users' click histories (history.parquet), i.e., 21 days of clicked news articles prior to the data split's behavior logs. The click histories are fixed to the period prior to the behavior logs; i.e., they are not updated within the data split period.

# File Name Description
1 behaviors.parquet Each file consists of seven days of impression logs.
2 history.parquet Each file consists of users' click histories collected over 21 days period.
3 articles.parquet The information on news articles.
4 artifacts.parquet Article artifacts, such as article embeddings and image embeddings.

Behaviors

The behaviors.parquet file contains the impression logs. The training and validation sets have exactly the same format, whereas some features are removed from the test set to avoid potential reverse engineering. These features are Article ID, Next Readtime, Next Scroll Percentage, and Clicked Article IDs. Furthermore, to include beyond-accuracy computations, we have included 200,000 samples. Hence, the test set has an extra called Is Beyond Accuracy.

# Column Context Example dtype
1 Impression ID The ID of an impression. 153 u32
2 User ID The anonymized user ID. 44038 u32
3 Article ID The unique ID of a news article. An empty field means the impression is from the front page. 9650148 i32
4 Session ID A unique identifier for a user's browsing session. 1153 u32
5 Inview Article IDs List of inview news articles in the impression (news articles that was registered as seen by the user). The orders of the IDs have been shuffled. [9649538, 9649689, …, 9649569] list[i32]
6 Clicked Article IDs List of news articles clicked in the impression. [9649689] list[i32]
7 Time The impression timestamp. The format is "YYYY/MM/DD HH:MM:SS". 2023-02-25 06:41:40 datetime[μs]
8 Readtime The amount of seconds a user spend on a given page. 14.0 f32
9 Scroll Percentage The percentage of an article that a user scrolls through, indicating how much of the content was potentially viewed. 100.0 f32
10 Device Type The type of device used to access the content, such as desktop (1) mobile (2), tablet (3), or unknown (0). 1 i8
11 SSO Status Indicates whether a user is logged in through Single Sign-On (SSO) authentication. True bool
12 Subscription Status The user's subscription status indicates whether they are a paid subscriber. Note that the subscription is fixed throughout the period and was set when the dataset was created. True bool
13 Gender The gender of the user, either Male (0) or Female (1), as specified in their profile. null i8
14 Postcode The user's postcode, aggregated at the district level as specified in their profile, with metropolitan (0), rural district (1), municipality (2), provincial (3), big city (4). 2 i8
15 Age The age of the user, as specified in their profile, categorized into bins of 10 years (e.g., 20-29, 30-39 etc.). 50 i8
16 Next Readtime The time a user spends on the next clicked article, i.e., the article in Clicked Article IDs. 8.0 f32
17 Next Scroll Percentage The scroll percentage for a user's next article interaction, i.e., the article in Clicked Article IDs. 41.0 f32

History

The history.parquet file contains the click histories of users.

# Column Context Example dtype
1 User ID The anonymized user ID. 44038 u32
2 Article IDs The articles clicked by the user. [9618533, … 9646154] list[i32]
3 Timestamps The timestamps of when the articles were clicked. The format is "YYYY/MM/DD HH:MM:SS". [2023-02-02 16:37:42, … 2023-02-22 18:28:38] list[datetime[μs]]
4 Read times The read times of the clicked articles. [425.0, … 12.0] list[f32]
5 Scroll Percentages The scroll percentages of the clicked articles. [null, … 100.0] list[f32]

Articles

The articles.parquet file contains the detailed information of news articles.

# Column Context Example dtype
1 Article ID The unique ID of a news article. 8987932 i32
2 Title The article's Danish title. Se billederne: Zlatans paradis til salg str
3 Subtitle The article's Danish subtitle/abstract. Zlatan Ibrahimovic har sat sin skihytte i Åre til salg, men prisen skal nok afskrække en del. (...) str
4 Body The article's full Danish text body. Drømmer du om en eksklusiv skihytte i Sverige? Så har Zlatan Ibrahimovic et eksklusivt tilbud til dig (...) str
5 Category ID The category ID. 142 i16
6 Category String The category as a string. sport list[i16]
7 Subcategory IDs The subcategory IDs. [196, 271] list[i16]
8 Premium Whether the content is behind a paywall. False bool
9 Time Published The time the article was published. The format is "YYYY/MM/DD HH:MM:SS". 2021-11-15 03:56:56 datetime[μs]
10 Time Modified The timestamp for the last modification of the article, e.g., updates as the story evolves or spelling corrections. The format is "YYYY/MM/DD HH:MM:SS". 2023-06-29 06:38:41 datetime[μs]
11 Image IDs The image IDs used in the article. [8988118] list[i64]
12 Article Type The type of article, such as a feature, gallery, video, or live blog. article_default str
13 URL The article's URL. https://ekstrabladet.dk/.../8987932 str
14 NER The tags retrieved from a proprietary named-entity-recognition model at Ekstra Bladet are based on the concatenated title, abstract, and body. ['Aftonbladet', 'Åre', 'Bjurfors', 'Cecilia Edfeldt Jigstedt', 'Helena', 'Sverige', 'Zlatan Ibrahimovic'] list[str]
15 Entities The tags retrieved from a proprietary entity-recognition model at Ekstra Bladet are based on the concatenated title, abstract, and body. ['ORG', 'LOC', 'ORG', 'PER', 'PER', 'LOC', 'PER'] list[str]
16 Topics The tags retrieved from a proprietary topic-recognition model at Ekstra Bladet are based on the concatenated title, abstract, and body. [] list[str]
17 Total Inviews The total number of times an article has been inview (registered as seen) by users within the first 7 days after it was published. This feature only applies to articles that were published after February 16, 2023. null i32
18 Total Pageviews The total number of times an article has been clicked by users within the first 7 days after it was published. This feature only applies to articles that were published after February 16, 2023. null i32
19 Total Read Time The accumulated read time of an article within the first 7 days after it was published. This feature only applies to articles that were published after February 16, 2023. null f32
20 Sentiment Score The sentiment score from a proprietary sentiment model at Ekstra Bladet is based on the concatenated title and abstract. 0.5299 f32
21 Sentiment Label The assigned sentiment label from a proprietary sentiment model at Ekstra Bladet is based on the concatenated title and abstract. The labels are positive, neutral, and negative. Neutral str

Artifacts

To initiate the quick use of EB-NeRD, the dataset features embedding artifacts. This includes the textual representation of the articles and the encoded thumbnail images. The textual representations are based on the title, subtitle, and body. We provide three representations, namely, the multilingual BERT, RoBERTa, and a proprietary contrastive-based model. We also provide scripts to generate your own document embeddings using Hugging Face models (link coming soon!).

The artifacts follow the example:

# Column Context Example dtype
1 Article ID An article embedding, i.e., a continuous representation of the article. [0.062, 0.040, …, 0.061] list[f32]