About

The Ekstra Bladet News Recommendation Dataset (EB-NeRD) was created to support advancements in news recommendation research. It was collected from user behavior logs at Ekstra Bladet. We collected behavior logs from active users during the 6 weeks from April 27 to June 8, 2023. This timeframe was selected to avoid major events, e.g., holidays or elections, that could trigger atypical behavior at Ekstra Bladet. The active users were defined as users who had at least 5 and at most 1,000 news click records in a three-week period from May 18 to June 8, 2023. To protect user privacy, every user was delinked from the production system when securely hashed into an anonymized ID using one-time salt mapping. Alongside, we provide Danish news articles published by Ekstra Bladet. Each article is enriched with textual context features such as title, abstract, body, categories, among others. Furthermore, we provide features that have been generated by proprietary models, including topics, named entity recognition (NER), and article embeddings

Dataset Format

Each dataset bundle—demo, small, and large—consists of a training set and validation set, together with the articles (articles.parquet) present in the bundle. The official test set is to be downloaded separately from these. Each data split has two files: 1) the behavior logs for the 7-day data split period (behaviors.parquet) and 2) the users' click histories (history.parquet), i.e., 21 days of clicked news articles prior to the data split's behavior logs. The click histories are fixed to the period prior to the behavior logs; i.e., they are not updated within the data split period.

#	File Name	Description
1	behaviors.parquet	Each file consists of seven days of impression logs.
2	history.parquet	Each file consists of users' click histories collected over 21 days period.
3	articles.parquet	The information on news articles.
4	artifacts.parquet	Article artifacts, such as article embeddings and image embeddings.

Behaviors

The behaviors.parquet file contains the impression logs. The training and validation sets have exactly the same format, whereas some features are removed from the test set to avoid potential reverse engineering. These features are Article ID, Next Readtime, Next Scroll Percentage, and Clicked Article IDs. Furthermore, to include beyond-accuracy computations, we have included 200,000 samples. Hence, the test set has an extra called Is Beyond Accuracy.

#	Column	Context	Example	dtype
1	Impression ID	The ID of an impression.	153	u32
2	User ID	The anonymized user ID.	44038	u32
3	Article ID	The unique ID of a news article. An empty field means the impression is from the front page.	9650148	i32
4	Session ID	A unique identifier for a user's browsing session.	1153	u32
5	Inview Article IDs	List of inview news articles in the impression (news articles that was registered as seen by the user). The orders of the IDs have been shuffled.	[9649538, 9649689, …, 9649569]	list[i32]
6	Clicked Article IDs	List of news articles clicked in the impression.	[9649689]	list[i32]
7	Time	The impression timestamp. The format is "YYYY/MM/DD HH:MM:SS".	2023-02-25 06:41:40	datetime[μs]
8	Readtime	The amount of seconds a user spend on a given page.	14.0	f32
9	Scroll Percentage	The percentage of an article that a user scrolls through, indicating how much of the content was potentially viewed.	100.0	f32
10	Device Type	The type of device used to access the content, such as desktop (1) mobile (2), tablet (3), or unknown (0).	1	i8
11	SSO Status	Indicates whether a user is logged in through Single Sign-On (SSO) authentication.	True	bool
12	Subscription Status	The user's subscription status indicates whether they are a paid subscriber. Note that the subscription is fixed throughout the period and was set when the dataset was created.	True	bool
13	Gender	The gender of the user, either Male (0) or Female (1), as specified in their profile.	null	i8
14	Postcode	The user's postcode, aggregated at the district level as specified in their profile, with metropolitan (0), rural district (1), municipality (2), provincial (3), big city (4).	2	i8
15	Age	The age of the user, as specified in their profile, categorized into bins of 10 years (e.g., 20-29, 30-39 etc.).	50	i8
16	Next Readtime	The time a user spends on the next clicked article, i.e., the article in Clicked Article IDs.	8.0	f32
17	Next Scroll Percentage	The scroll percentage for a user's next article interaction, i.e., the article in Clicked Article IDs.	41.0	f32

History

The history.parquet file contains the click histories of users.

#	Column	Context	Example	dtype
1	User ID	The anonymized user ID.	44038	u32
2	Article IDs	The articles clicked by the user.	[9618533, … 9646154]	list[i32]
3	Timestamps	The timestamps of when the articles were clicked. The format is "YYYY/MM/DD HH:MM:SS".	[2023-02-02 16:37:42, … 2023-02-22 18:28:38]	list[datetime[μs]]
4	Read times	The read times of the clicked articles.	[425.0, … 12.0]	list[f32]
5	Scroll Percentages	The scroll percentages of the clicked articles.	[null, … 100.0]	list[f32]

Articles

The articles.parquet file contains the detailed information of news articles.

#	Column	Context	Example	dtype
1	Article ID	The unique ID of a news article.	8987932	i32
2	Title	The article's Danish title.	Se billederne: Zlatans paradis til salg	str
3	Subtitle	The article's Danish subtitle/abstract.	Zlatan Ibrahimovic har sat sin skihytte i Åre til salg, men prisen skal nok afskrække en del. (...)	str
4	Body	The article's full Danish text body.	Drømmer du om en eksklusiv skihytte i Sverige? Så har Zlatan Ibrahimovic et eksklusivt tilbud til dig (...)	str
5	Category ID	The category ID.	142	i16
6	Category String	The category as a string.	sport	list[i16]
7	Subcategory IDs	The subcategory IDs.	[196, 271]	list[i16]
8	Premium	Whether the content is behind a paywall.	False	bool
9	Time Published	The time the article was published. The format is "YYYY/MM/DD HH:MM:SS".	2021-11-15 03:56:56	datetime[μs]
10	Time Modified	The timestamp for the last modification of the article, e.g., updates as the story evolves or spelling corrections. The format is "YYYY/MM/DD HH:MM:SS".	2023-06-29 06:38:41	datetime[μs]
11	Image IDs	The image IDs used in the article.	[8988118]	list[i64]
12	Article Type	The type of article, such as a feature, gallery, video, or live blog.	article_default	str
13	URL	The article's URL.	https://ekstrabladet.dk/.../8987932	str
14	NER	The tags retrieved from a proprietary named-entity-recognition model at Ekstra Bladet are based on the concatenated title, abstract, and body.	['Aftonbladet', 'Åre', 'Bjurfors', 'Cecilia Edfeldt Jigstedt', 'Helena', 'Sverige', 'Zlatan Ibrahimovic']	list[str]
15	Entities	The tags retrieved from a proprietary entity-recognition model at Ekstra Bladet are based on the concatenated title, abstract, and body.	['ORG', 'LOC', 'ORG', 'PER', 'PER', 'LOC', 'PER']	list[str]
16	Topics	The tags retrieved from a proprietary topic-recognition model at Ekstra Bladet are based on the concatenated title, abstract, and body.	[]	list[str]
17	Total Inviews	The total number of times an article has been inview (registered as seen) by users within the first 7 days after it was published. This feature only applies to articles that were published after February 16, 2023.	null	i32
18	Total Pageviews	The total number of times an article has been clicked by users within the first 7 days after it was published. This feature only applies to articles that were published after February 16, 2023.	null	i32
19	Total Read Time	The accumulated read time of an article within the first 7 days after it was published. This feature only applies to articles that were published after February 16, 2023.	null	f32
20	Sentiment Score	The sentiment score from a proprietary sentiment model at Ekstra Bladet is based on the concatenated title and abstract.	0.5299	f32
21	Sentiment Label	The assigned sentiment label from a proprietary sentiment model at Ekstra Bladet is based on the concatenated title and abstract. The labels are positive, neutral, and negative.	Neutral	str

Artifacts

To initiate the quick use of EB-NeRD, the dataset features embedding artifacts. This includes the textual representation of the articles and the encoded thumbnail images. The textual representations are based on the title, subtitle, and body. We provide three representations, namely, the multilingual BERT, RoBERTa, and a proprietary contrastive-based model. We also provide scripts to generate your own document embeddings using Hugging Face models (link coming soon!).

The artifacts follow the example:

#	Column	Context	Example	dtype
1	Article ID	An article embedding, i.e., a continuous representation of the article.	[0.062, 0.040, …, 0.061]	list[f32]

EB-NeRD: Ekstra Bladet News
Recommendation Dataset

A Large-Scale Dataset for News Recommendation

Sponsored by

About

Dataset Format

Behaviors

History

Articles

Artifacts

EB-NeRD: Ekstra Bladet NewsRecommendation Dataset

A Large-Scale Dataset for News Recommendation

Sponsored by

About

Dataset Format

Behaviors

History

Articles

Artifacts

EB-NeRD: Ekstra Bladet News
Recommendation Dataset