Issuu Research Dataset

Background

The intention of this package is to make a real-world sized dataset of Issuu usage patterns available for research purposes. We have often been approached by researches interested in the data we have, so we wanted to give something back to the scientific community.

To provide a little context for this data, consider the following simplified model of what Issuu is and how our users make use of our platform:

Publishers upload documents and can embed the Issuu “Reader” software on their own website in order to give end users a nice reading experience.
End users can read documents from Issuu either through embeds on the publisher’s website or on Issuu.com
Unless the publisher is paying Issuu to not do so, Issuu will show “related” publications below the one you are currently reading or in a sidebar within the reader software itself. This is called a “stream” and consists of an endless stream of documents.
Publishers can pay to get their documents shown in streams – a document being paid for this way is referred to as an “ad”.

Licence

We are releasing this under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License:
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

This includes the following restrictions:

If you use this data for any published work, you must mention Issuu as the source.
You cannot in any way use this data commercially
If you share it, you need to apply this same licence.

Additionally we kindly ask the following:

Send us a mail to [labs@issuu.com] if you use the dataset. We’d love to hear what you are using it for or get comments and suggestions
Don’t hit our servers hard, if you scrape Issuu for metadata when using the dataset.

What data is available?

We provide statistical information about document interaction made by users. Statistics are gathered in different situations including:

Displaying a document (impression)
Opening a document (read)
Reading a document (page read)
Reading for a duration of time (page read time)

This data is made available via Issuu “pingback” entries that carry a fixed payload of information. Data is stored in text files containing JSON structures. The full dataset contains one months worth of data and takes about approx 200GB compressed with xz or 1.XXXX terrabyte extracted.

For a quick start we also provide a smaller sample for quickly having a look at the data structures. No particular effort was made to make the sample representative, so it should only be used to have a quick look at the data format.

Pingback anatomy

A pingback consists of six parts some of which might be omitted in some circumstances:

Timestamp
Visitor
Environment
Event
Subject
Cause

In the following sections we will describe each part separately along with the corresponding JSON fields in the event file. For easy parsing of the datafile each parameter name is prefixed with the name of the part that it belongs to, like “visitor_”, “env_”, “event_”, “subject_” and “cause_”.

Timestamp

When did the event happen?

This is an unsigned 64-bit integer denoting the Unix epoch time of the event

{
    "ts": <int>,
    ...
}

Visitor

Who triggered the event?

Identifying the visitor

Tracking of visitors (users who may or may not be signed into Issuu) is done primarily based on cookies in the browser. Every request is assigned a unique identifier (uuid). If the request already contains an identifier that one will be reused providing a means for grouping requests into sessions. If the visitor was logged in a consistent hash of the username will be included in the event if not then the value will be null. The uuid is not constant over time but is subject to change with when the user switches between browsers or for some reason clears the cookie altogether. For that reason different uuids might point to the same user and to the same username as well.

Logged in: (username field is set)

{
    "visitor_username": <username_hash>,
    "visitor_uuid": <uuid_hash>,
    ...
}

Not logged in: (username field is null)

{
    "visitor_username": null,
    "visitor_uuid": <uuid_hash>,
    ...
}

Source

The source parameter indicates how the user arrived on the current page and is based on the referrer value. If the user followed an external link or directly inputted the location into the browser then this parameter will have value “external”. On the other hand if it is evident from the referrer value that the users previous location was on the issuu.com site then the source value will be “internal”.

{
    "visitor_source": <source_string>,
    ...
}

User-Agent

In case the client sends useragent in the request headers this will be included in the pingback.

{
    "visitor_useragent": <string>,
    ...
}

Referrer

In case the client sends referrer in the request headers this will be included in the pingback. To protect the privacy of the user some browsers and proxies will block or modify this value. For the same reason the event will contain a consistent hash of the referrer instead of the actual value.

{
    "visitor_referrer": <referrer_hash>,
    ...
}

Geo data

Country location is deducted based on IP lookup and available as a two-letter uppercase ISO3166 code

{
    "visitor_country": <string>,
    ...
}

Device

The type of device used by the visitor to interact with the Issuu service. For most cases this will be “browser” but it might also indicate which type of mobile device or other client used to connect. Current values include:

“browser”
“android”

but more will be included in the future.

{
    "visitor_device": <device_string>,
    ...
}

IP

The actual IP address used to connect is not included in the event for privacy reasons. Instead we include a consistent hash of the IP.

{
    "visitor_ip": <ip_hash>,
    ...
}

Environment

Where in the Issuu network was the event generated?

Currently there are three different environments from where an event can originate:

reader
stream
website

The environment that dispatched the event is usually not that interesting to the end-user as it is mostly an implementation detail. For example some events are best captured and transmitted in the reader itself while others originate from the issuu.com website.

Reader

These events are transmitted by the Issuu “Reader” software. If the context of the event includes an ad (an Issuu document being promoted) then the hashed ad id is included. The document displayed by the reader at the time the event was sent is found in the doc_id

{
    "env_type": "reader",
    "env_doc_id": <document_id>,
    "env_adid": <ad_id_hash>,
    ...
}

Stream

The discovery portion of the issuu website is based around streams which are (seemingly) unlimited lists of documents (thumbnails) that the user can scroll through. The ranking number is an indication of how far down the list the user had scrolled when the event was triggered. Ranking of 0 means top of the list.

{
    "env_type": "stream",
    "env_ranking": <int>,
    ...
}

Website

These events are transmitted from the website outside any stream. It includes event when users share or download documents among other things.

{
    "env_type": "website",
    ...
}

Event type

What exactly happened?

impression
click
read
download
share
pageread
pagereadtime
continuation_load

Each of these events also carry information about the subject and cause. The sections Cause and Subject describes these properties in greater detail.

Impression

Subject is either “doc” or “infobox”. Cause is “impression”, “page”, “ad”, “related”, “archive”.

An impression is counted each time a document is displayed. This might happen directly from the issuu.com site by visitors browsing the content or it might be as a result of someone visiting an external page where an issuu document has been embedded. The difference between an impression and a read is that a read indicates a conscious decision by the user to engage with the content while the same cannot be said for an impression.

{
    "event_type": "impression",
    "cause_type": "impression|page|ad|related|archive",
    "subject_type": "doc|infobox"
    ...
}

Click

The subject for a click is either “doc” or “link”. Document clicks happen when a user clicks on a thumbnail of a document displayed in a stream. This can happen for different reasons (causes) like “ad”, “archive”, “related” or “embed” See “Causes” section. It can also happen without any special reason. As for link clicks these are described in the causes section.

{
    "event_type": "click",
    "subject_type": "doc|link",
    "cause_type": "ad|archive|related|embed"
    ...
}

Read

The read event is sent a short period (2 secs) after a document has been opened. It indicates an overall read of the document without indicating specifically which pages have been read. A document read event will inherently lead to one or more pageread events being fired.

{
    "event_type": "read",
    "subject_type": "doc",
    ...
}

Download

A user has opted to download the original document file (PDF, PPT etc.) most likely for printing purposes, for offline viewing or simply for opening in their standard viewer. Subject for this event will be “doc”.

{
    "event_type": "download",
    "subject_type": "doc",
    ...
}

Documents can be shared via a host of different channels. Possible values are:

“email”
“facebook”
“twitter”
“google”
“tumblr”
“linkedin”

After having successfully posted a document via one of these services the following event will be fired.

{
    "event_type": "share",
    "event_service": <service>,
    ...
}

Page read

Page read events offer a finer granularity than document read events when determining exactly which pages have been read. Similar to document read events these are triggered after the user has read the page for a minimum of 2 seconds. This means that quickly flipping through pages will not generate read events until a page has been shown for 2 seconds or more. Subject for this event will be “doc”.

{
    "event_type": "pageread",
    "subject_type": "doc",
    ...
}

Page read time

To track the duration of the reading session the read time events are transmitted at increasing time intervals starting after 2 seconds. The readtime field in the event indicates the time in milliseconds elapsed since last event. Subject for this event will be “doc”.

{
    "event_type": "pagereadtime",
    "event_readtime": <int>,
    "subject_type": "doc",
    ...
}

Continuation load

Documents on the issuu.com site can browsed from endless “streams” where new documents keep popping up as the browser window scrolls down. The stream is continuously populated by incremental loads performed in the background and this event type is fired every time a new continuation load is requested.

The index value is zero-based and indicates how far down the list the user has continued to scroll.

{
    "event_type": "continuation_load",
    "event_index": <int>,
    ...
}

Subject

To whom did it happen?

There might not always be a subject in which case the “subject” attribute will be omitted. The only event type that doesn’t have a subject is “continuation_load” as this event applies to a stream of (possibly mixed) subjects.

doc
infobox
link

Doc

For the vast majority of events the subject will be a document. In the pingback event this is denoted by a “doc” value for the subject attribute. All but the “continuation_load” event type MAY have document as the subject. For events where the subject_page is not relevant the value will be 0.

{
    "subject_type": "doc",
    "subject_doc_id": <document_id>,
    "subject_page": <int>,
    ...
}

Infobox

Inside the explore stream small infoboxes with “call to actions” (CTAs) with tips on how to use Issuu are shown. All impression events are recorded with an internal identifier of the infobox. The “event_type” for these events are always “impression”.

{
    "subject_type": "infobox",
    "subject_infoboxid": <string>,
    ...
}

Link

It is possible for a document to contain links on one or more of the pages. These originating either from the original uploaded file or alternatively they could have been manually entered at a later point.

Each click on a link is recorded with the link URL as well as the X/Y position on the page and the page number. The coordinates indicate the relative position of the link center on the page with X-axis running left to right and Y-axis running top to bottom.

The “event_type” for these events are always “click”.

{
    "subject_type": "link",
    "subject_url": <url>,
    "subject_link_position", <link_position>,
    "subject_page": <int>,
    ...
}

Cause

Why was the event sent?

Events with type “impression” or “click” will in some cases have a cause associated. That means that the impression or click of a document was caused by some sort of promotion like an ad display or from a related or archive stream. For these events the “subject_type” attribute will always be set to “doc”.

When an embed is displayed on a device that doesn’t support Flash the embed will show a simple thumbnail instead. If the user clicks the thumbnail it will trigger a “click” event where the “cause_type” is set to “embed”.

For “impression” event types the cause can be set to either “impression” or “page” indicating whether the impression was for a full document or merely a page in the document. The difference between document impressions and page impressions is the same as for reads: A document impression is only generated the first time the document is viewed while page impressions will be generated for each page viewed in the document.

Also, there might not always be a cause in which case the “cause” attribute will be omitted, hence possible values are:

ad
archive
related
embed
impression
page

Impression or click of an ad in a stream. The position attribute indicates where in the stream the ad was positioned at the time and the exact ad can be identified from the ad id (hash).

Ads are simply issuu documents that are being specifilcally promoted. The subject of the event contains info about the document.

{
    "cause_type": "ad",
    "cause_position": <int>,
    "cause_adid": <ad_id_hash>,
    ...
}

Embed

Clicks on embeds will have the cause value set to “embed”. For these events the subject is always “doc”

{
    "cause_type": "embed",
    ...
}

Impression

An impression of the document was rendered. The “event_type” is likewise “impression”

{
    "cause": "impression",
    ...
}

Page

An impression for a page in the document was rendered. The event type for this event is “impression”

{
    "cause": "page",
    ...
}

Full examples

Impression

{
    "ts": 1393631989,
    "visitor_uuid": "745409913574d4c6",
    "visitor_username": null,
    "visitor_source": "external",
    "visitor_device": "browser",
    "visitor_useragent": "Mozilla/5.0 (iPhone; CPU iPhone OS 7_0_6 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Mobile/11B651 [FBAN/FBIOS;FBAV/7.0.0.17.1;FBBV/1325030;FBDV/iPhone4,1;FBMD/iPhone;FBSN/iPhone OS;FBSV/7.0.6;FBSS/2; FBCR/Telcel;FBID/phone;FBLC/es_ES;FBOP/5]",
    "visitor_ip": "0e1c9cd3d6c22c65",
    "visitor_country": "MX",
    "visitor_referrer": "ab11264107143c5f",
    "env_type": "reader",
    "env_doc_id": "140228202800-6ef39a241f35301a9a42cd0ed21e5fb0",
    "env_adid": null,
    "event_type": "impression",
    "subject_type": "doc",
    "subject_doc_id": "140228202800-6ef39a241f35301a9a42cd0ed21e5fb0",
    "subject_page": 23,
    "cause_type": "page"
}

Read

{
    "ts": 1393631990,
    "visitor_uuid": "9a83c97f415601a6",
    "visitor_username": null,
    "visitor_source": "external",
    "visitor_device": "browser",
    "visitor_useragent": "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36",
    "visitor_ip": "03a2602450304bd4",
    "visitor_country": "AR",
    "visitor_referrer": "0aefac0a2bd221ab",
    "env_type": "reader",
    "env_doc_id": "131203154832-9b8594b7ec211f7e1a0782fd9883a42c",
    "env_adid": null,
    "event_type": "read",
    "subject_type": "doc",
    "subject_doc_id": "131203154832-9b8594b7ec211f7e1a0782fd9883a42c",
    "subject_page": 0,
    "cause": null
}

Page read

{
    "ts": 1393631989,
    "visitor_uuid": "745409913574d4c6",
    "visitor_username": null,
    "visitor_source": "external",
    "visitor_device": "browser",
    "visitor_useragent": "Mozilla/5.0 (iPhone; CPU iPhone OS 7_0_6 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Mobile/11B651 [FBAN/FBIOS;FBAV/7.0.0.17.1;FBBV/1325030;FBDV/iPhone4,1;FBMD/iPhone;FBSN/iPhone OS;FBSV/7.0.6;FBSS/2; FBCR/Telcel;FBID/phone;FBLC/es_ES;FBOP/5]",
    "visitor_ip": "0e1c9cd3d6c22c65",
    "visitor_country": "MX",
    "visitor_referrer": "ab11264107143c5f",
    "env_type": "reader",
    "env_doc_id": "140228202800-6ef39a241f35301a9a42cd0ed21e5fb0",
    "env_adid": null,
    "event_type": "pageread",
    "subject_type": "doc",
    "subject_doc_id": "140228202800-6ef39a241f35301a9a42cd0ed21e5fb0",
    "subject_page": 23,
    "cause": null
}

Page read time

{
    "ts": 1393631989,
    "visitor_uuid": "64bf70296da2f9fd",
    "visitor_username": null,
    "visitor_source": "internal",
    "visitor_device": "browser",
    "visitor_useragent": "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:27.0) Gecko/20100101 Firefox/27.0",
    "visitor_ip": "06f49269e749a837",
    "visitor_country": "VE",
    "visitor_referrer": "64f729926497515c",
    "env_type": "reader",
    "env_doc_id": "130705172251-3a2a725b2bbd5aa3f2af810acf0aeabb",
    "env_adid": null,
    "event_type": "pagereadtime",
    "event_readtime": 797,
    "subject_type": "doc",
    "subject_doc_id": "130705172251-3a2a725b2bbd5aa3f2af810acf0aeabb",
    "subject_page": 10,
    "cause": null
}

Anonymization process

The data that is made available has been filtered for information that could be used to identify the person triggering the event or the location of a document the publisher wish to keep private.

Field hashing

Instead of the raw identifiers a consistent hash has been applied to any sensitive fields. The username, uuid, ip, referrer and adid are all represented with 16 lowercase hex chars instead of actual values.

For private documents the identifier normally on the form \digit{12}“–”\lowerhex{32} has likewise been applied a hashing function and is so formatted as “000000000000-”\lowerhex_hash{32}

Issuu Research Dataset

Background

Licence

What data is available?

Pingback anatomy

Timestamp

Visitor

Identifying the visitor

Source

User-Agent

Referrer

Geo data

Device

IP

Environment

Reader

Stream

Website

Event type

Impression

Click

Read

Download

Share

Page read

Page read time

Continuation load

Subject

Doc

Infobox

Link

Cause

Ad

Archive

Related

Embed

Impression

Page

Full examples

Impression

Read

Page read

Page read time

Anonymization process

Field hashing