About
The Ekstra Bladet News Recommendation Dataset (EB-NeRD) was created to support advancements in news recommendation research. It was collected from user behavior logs at Ekstra Bladet. We collected behavior logs from active users during the 6 weeks from April 27 to June 8, 2023. This timeframe was selected to avoid major events, e.g., holidays or elections, that could trigger atypical behavior at Ekstra Bladet. The active users were defined as users who had at least 5 and at most 1,000 news click records in a three-week period from May 18 to June 8, 2023. To protect user privacy, every user was delinked from the production system when securely hashed into an anonymized ID using one-time salt mapping. Alongside, we provide Danish news articles published by Ekstra Bladet. Each article is enriched with textual context features such as title, abstract, body, categories, among others. Furthermore, we provide features that have been generated by proprietary models, including topics, named entity recognition (NER), and article embeddings
Dataset Format
Each dataset bundle—demo, small, and large—consists of a training set and validation set, together with the articles (articles.parquet) present in the bundle. The official test set is to be downloaded separately from these. Each data split has two files: 1) the behavior logs for the 7-day data split period (behaviors.parquet) and 2) the users' click histories (history.parquet), i.e., 21 days of clicked news articles prior to the data split's behavior logs. The click histories are fixed to the period prior to the behavior logs; i.e., they are not updated within the data split period.
# | File Name | Description |
---|---|---|
1 | behaviors.parquet | Each file consists of seven days of impression logs. |
2 | history.parquet | Each file consists of users' click histories collected over 21 days period. |
3 | articles.parquet | The information on news articles. |
4 | artifacts.parquet | Article artifacts, such as article embeddings and image embeddings. |
Behaviors
The behaviors.parquet file contains the impression logs. The training and validation sets have exactly the same format, whereas some features are removed from the test set to avoid potential reverse engineering. These features are Article ID, Next Readtime, Next Scroll Percentage, and Clicked Article IDs. Furthermore, to include beyond-accuracy computations, we have included 200,000 samples. Hence, the test set has an extra called Is Beyond Accuracy.
# | Column | Context | Example | dtype |
---|---|---|---|---|
1 | Impression ID | The ID of an impression. | 153 | u32 |
2 | User ID | The anonymized user ID. | 44038 | u32 |
3 | Article ID | The unique ID of a news article. An empty field means the impression is from the front page. | 9650148 | i32 |
4 | Session ID | A unique identifier for a user's browsing session. | 1153 | u32 |
5 | Inview Article IDs | List of inview news articles in the impression (news articles that was registered as seen by the user). The orders of the IDs have been shuffled. | [9649538, 9649689, …, 9649569] | list[i32] |
6 | Clicked Article IDs | List of news articles clicked in the impression. | [9649689] | list[i32] |
7 | Time | The impression timestamp. The format is "YYYY/MM/DD HH:MM:SS". | 2023-02-25 06:41:40 | datetime[μs] |
8 | Readtime | The amount of seconds a user spend on a given page. | 14.0 | f32 |
9 | Scroll Percentage | The percentage of an article that a user scrolls through, indicating how much of the content was potentially viewed. | 100.0 | f32 |
10 | Device Type | The type of device used to access the content, such as desktop (1) mobile (2), tablet (3), or unknown (0). | 1 | i8 |
11 | SSO Status | Indicates whether a user is logged in through Single Sign-On (SSO) authentication. | True | bool |
12 | Subscription Status | The user's subscription status indicates whether they are a paid subscriber. Note that the subscription is fixed throughout the period and was set when the dataset was created. | True | bool |
13 | Gender | The gender of the user, either Male (0) or Female (1), as specified in their profile. | null | i8 |
14 | Postcode | The user's postcode, aggregated at the district level as specified in their profile, with metropolitan (0), rural district (1), municipality (2), provincial (3), big city (4). | 2 | i8 |
15 | Age | The age of the user, as specified in their profile, categorized into bins of 10 years (e.g., 20-29, 30-39 etc.). | 50 | i8 |
16 | Next Readtime | The time a user spends on the next clicked article, i.e., the article in Clicked Article IDs. | 8.0 | f32 |
17 | Next Scroll Percentage | The scroll percentage for a user's next article interaction, i.e., the article in Clicked Article IDs. | 41.0 | f32 |
History
The history.parquet file contains the click histories of users.
# | Column | Context | Example | dtype |
---|---|---|---|---|
1 | User ID | The anonymized user ID. | 44038 | u32 |
2 | Article IDs | The articles clicked by the user. | [9618533, … 9646154] | list[i32] |
3 | Timestamps | The timestamps of when the articles were clicked. The format is "YYYY/MM/DD HH:MM:SS". | [2023-02-02 16:37:42, … 2023-02-22 18:28:38] | list[datetime[μs]] |
4 | Read times | The read times of the clicked articles. | [425.0, … 12.0] | list[f32] |
5 | Scroll Percentages | The scroll percentages of the clicked articles. | [null, … 100.0] | list[f32] |
Articles
The articles.parquet file contains the detailed information of news articles.
# | Column | Context | Example | dtype |
---|---|---|---|---|
1 | Article ID | The unique ID of a news article. | 8987932 | i32 |
2 | Title | The article's Danish title. | Se billederne: Zlatans paradis til salg | str |
3 | Subtitle | The article's Danish subtitle/abstract. | Zlatan Ibrahimovic har sat sin skihytte i Åre til salg, men prisen skal nok afskrække en del. (...) | str |
4 | Body | The article's full Danish text body. | Drømmer du om en eksklusiv skihytte i Sverige? Så har Zlatan Ibrahimovic et eksklusivt tilbud til dig (...) | str |
5 | Category ID | The category ID. | 142 | i16 |
6 | Category String | The category as a string. | sport | list[i16] |
7 | Subcategory IDs | The subcategory IDs. | [196, 271] | list[i16] |
8 | Premium | Whether the content is behind a paywall. | False | bool |
9 | Time Published | The time the article was published. The format is "YYYY/MM/DD HH:MM:SS". | 2021-11-15 03:56:56 | datetime[μs] |
10 | Time Modified | The timestamp for the last modification of the article, e.g., updates as the story evolves or spelling corrections. The format is "YYYY/MM/DD HH:MM:SS". | 2023-06-29 06:38:41 | datetime[μs] |
11 | Image IDs | The image IDs used in the article. | [8988118] | list[i64] |
12 | Article Type | The type of article, such as a feature, gallery, video, or live blog. | article_default | str |
13 | URL | The article's URL. | https://ekstrabladet.dk/.../8987932 | str |
14 | NER | The tags retrieved from a proprietary named-entity-recognition model at Ekstra Bladet are based on the concatenated title, abstract, and body. | ['Aftonbladet', 'Åre', 'Bjurfors', 'Cecilia Edfeldt Jigstedt', 'Helena', 'Sverige', 'Zlatan Ibrahimovic'] | list[str] |
15 | Entities | The tags retrieved from a proprietary entity-recognition model at Ekstra Bladet are based on the concatenated title, abstract, and body. | ['ORG', 'LOC', 'ORG', 'PER', 'PER', 'LOC', 'PER'] | list[str] |
16 | Topics | The tags retrieved from a proprietary topic-recognition model at Ekstra Bladet are based on the concatenated title, abstract, and body. | [] | list[str] |
17 | Total Inviews | The total number of times an article has been inview (registered as seen) by users within the first 7 days after it was published. This feature only applies to articles that were published after February 16, 2023. | null | i32 |
18 | Total Pageviews | The total number of times an article has been clicked by users within the first 7 days after it was published. This feature only applies to articles that were published after February 16, 2023. | null | i32 |
19 | Total Read Time | The accumulated read time of an article within the first 7 days after it was published. This feature only applies to articles that were published after February 16, 2023. | null | f32 |
20 | Sentiment Score | The sentiment score from a proprietary sentiment model at Ekstra Bladet is based on the concatenated title and abstract. | 0.5299 | f32 |
21 | Sentiment Label | The assigned sentiment label from a proprietary sentiment model at Ekstra Bladet is based on the concatenated title and abstract. The labels are positive, neutral, and negative. | Neutral | str |
Artifacts
To initiate the quick use of EB-NeRD, the dataset features embedding artifacts. This includes the textual representation of the articles and the encoded thumbnail images. The textual representations are based on the title, subtitle, and body. We provide three representations, namely, the multilingual BERT, RoBERTa, and a proprietary contrastive-based model. We also provide scripts to generate your own document embeddings using Hugging Face models (link coming soon!).
The artifacts follow the example:
# | Column | Context | Example | dtype | 1 | Article ID | An article embedding, i.e., a continuous representation of the article. | [0.062, 0.040, …, 0.061] | list[f32] |
---|