Output Data Format

The output data is characterized by zero or more matched page sets, one or more matched pages, and N keys that are defined in the Scraping definition.

JSON Output#

The following is a minimalist output format, but loses certain information.

{
"<pageSetID>": {
"<url>": {
"data": "result",
"is": "structured"
}
}
}

First off, we highly recommend enriching the data with semantic information. All implementations SHOULD parse individual page scraping definitions, and output data, AND provide warnings if the data is not a properly structured JSON-LD document.

Additionally, production environments SHOULD throw an error if the result is not properly structured.

This allows for each page to be consumed by a system that can easily understand its context and place it directly into a knowledge graph.

Example#

Let's say our scraper retrieved this data:

{
"@context": "http://schema.org",
"@type": "NewsArticle",
"description": "A never-released report shows that the number of people killed by police activity in New York is more than twice what has been reported.",
"headline": "Undercounting Those Killed by the N.Y.P.D.",
"publisher": {
"@id": "https://www.nytimes.com/#publisher"
}
}

Data Size and Visualization#

Root
Page Set 1
E
F
G
H
Page Set 2
I
J
K
L
M
attr 1
attr 2
... attr N
... Page set PS

This results in a total data size of:

S=โˆ‘n=1PSNiโ‹…PiS = \sum_{n=1}^{PS} N_{i} \cdot P_{i}

Where PSPS is the number of Page Sets, NiN_{i} is the number of Relevant Data attributes for set i, and PiP_{i} is the number of pages for that set.

Generally, for a static Scraping Definition, only PiP_{i} is subject to change based on new pages which may be posted and match a page set, or different slices of an offline dataset which the scraper looks at.