Many open data sets are essentially tables, or sets of tables, which follow the same regular structure. This document describes a set of conventions for CSV files that enable them to be linked together and to be interpreted as RDF.
The requirements on which this format is based are:
The structure of a CSV file is a header followed by a number of records. The header is the first line of the file, while the remaining lines are the records. Both the header and the records contain fields separated by commas. These terms are used as defined in [[RFC4180]]. Within this document, a column
is a set of fields which are at the same index within their respective rows and the column name is the value of the field in the header for that column. For example, the following is a valid CSV file which lists country codes and names:
country,name AD,Andorra AF,Afghanistan AI,Anguilla AL,Albania
All valid CSV files are valid linked CSV files, so the above example is also a valid linked CSV file. It has four records and two columns, whose names are country
and name
.
Valid CSV files MUST use CRLF
to indicate the ends of lines (and thus the separation of rows). Linked CSV parsers SHOULD provide a warning if CR
or LF
is used for line endings, and SHOULD recover by parsing the CSV file with those line endings.
Spreadsheet programs such as Excel or OpenOffice Calc typically use the line ending used by the platform on which they are deployed (eg simply LF
on Mac OS X). Allowing other line endings for linked CSV is intended to make it easier to create such documents within spreadsheet programs.
The aim of processing a linked CSV file is to generate information about a set of entities. An entity may be represented internally by the application as an object or a resource. Each entity has a number of properties, which may have one or more values.
Records within a linked CSV file may be of two different types: prolog lines (see ) and data lines. Data lines can only come after the last prolog line, if there is one. A data line is a line that contains data about an entity. A single entity may be described across multiple data lines. For each data line describing an entity, each value within the line corresponds to a value of a property of that entity (the property being labelled through the corresponding header).
The JSON version of this file, as defined in , is:
[{ "country": "AD", "name": "Andorra" },{ "country": "AF", "name": "Afghanistan" },{ "country": "AI", "name": "Anguilla" },{ "country": "AL", "name": "Albania" }]
Linked CSV files must be encoded as UTF-8.
It isn't usually easy to set the encoding of a CSV file when exporting from normal spreadsheet programs. It would be nice if there were a way of detecting the encoding. Perhaps it could be sniffed based on the initial characters #,
in the file (with UTF-8 assumed if those aren't the initial characters)?
Linked CSV is built around the concept of using URIs to name things. Every record, column, and even slices of data, in a linked CSV file is addressable using URI Identifiers for the text/csv Media Type. For example, if the linked CSV file is accessed at http://example.org/countries
, the first record in the CSV file above, which happens to be the first data line within the linked CSV file (which describes Andorra) is addressable with the URI:
http://example.org/countries#row:0
However, this addressing merely identifies the records within the linked CSV file, not the entities that the record describes. This distinction is important for two reasons:
By default, each data line describes an entity, each entity is described by a single data line, and there is no way to address the entities. However, adding a $id
column enables entities to be given identifiers. These identifiers are always URIs, and they are interpreted relative to the location of the linked CSV file. The $id
column may be positioned anywhere but by convention it should be the first column (unless there is a #
column, in which case it should be the second). For example:
$id,country,name #AD,AD, Andorra #AD,AD, Principality of Andorra #AF,AF, Afghanistan #AF,AF, Islamic Republic of Afghanistan
For the purpose of clarity within this document, whitespace has been added to this and the remainder of the examples so that headers and values line up correctly. Whitespace within linked CSV files is normally significant.
The prefix $
is used because the prefix @
is interpreted as indicating a formula when entered into spreadsheet programs such as Excel.
This linked CSV file contains two entities, which have the identifiers http://example.org/countries#AD
and http://example.org/countries#AF
. The first is described by the first two data lines and the second by the next two. The JSON generated for this file would be:
[{ "@id": "http://example.org/countries#AD", "country": "AD", "name": [ "Andorra", "Principality of Andorra" ] },{ "@id": "http://example.org/countries#AF", "country": "AF", "name": [ "Afghanistan", "Islamic Republic of Afghanistan" ] }]
and the RDF would be:
@prefix rel: <http://www.iana.org/assignments/relation/> PREFIX : <http://example.org/countries#> <http://example.org/countries#AD> rel:describedby <http://example.org/countries#row:0> ; :country "AD" ; :name "Andorra" , "Principality of Andorra" ; . <http://example.org/countries#AF> rel:describedby <http://example.org/countries#row:1> ; :country "AF" ; :name "Afghanistan" , "Islamic Republic of Afghanistan" ; .
As shown by this example, when multiple data lines describe a single entity, a given property takes only the distinct values within the column for that entity rather than being duplicated. However, the file can be made shorter if it doesn't contain duplicates in the first case; the following CSV is equivalent:
$id,country,name #AD,AD, Andorra #AD,, Principality of Andorra #AF,AF, Afghanistan #AF,, Islamic Republic of Afghanistan
By default, properties within the linked CSV file are assumed to apply to the thing described by the resource located by the URI identifier. For example, if the file contained identifier URIs that were Wikipedia pages, as in
$id, country,name http://en.wikipedia.org/wiki/Andorra, AD, Andorra http://en.wikipedia.org/wiki/Andorra, AD, Principality of Andorra http://en.wikipedia.org/wiki/Afghanistan,AF, Afghanistan http://en.wikipedia.org/wiki/Afghanistan,AF, Islamic Republic of Afghanistan
applications should interpret the properties labelled country
and name
to apply to the countries described by those Wikipedia pages, not the Wikipedia pages themselves. In general this distinction does not matter, but it may do when using linked CSV to describe resources that are available on the web. Individual properties may be used differently, and apply to the content found at the referenced URI; how they are interpreted should be incorporated into the property documentation.
A linked CSV file can contain any number of prolog lines. Prolog lines describe additional processing of the linked CSV file, usually related to the file or some portion or the file, or related to some or all of the columns. Prolog lines can only be present if there is a column named #
; any record that has a value in that column is a prolog line, and the value for that column indicates how the line should be interpreted:
type
lang
meta
url
see
#
column indicates that the line is a data line rather than a prolog lineProlog lines must all be at the start of a linked CSV file. Any prolog lines that appear after the first data line must be ignored by processors. Prolog lines of different types can appear in any order.
Ignoring prolog lines that appear after the first data line aids streaming processing of linked CSV files, the hiding of prolog information within spreadsheet applications, and ease of reading for humans.
Could add other kinds of prolog lines. The thing to do is probably to have a separate registry of prolog line types that provide for configuration of the processing that should be done on the values in particular columns. For example, you could have prolog lines that enable to to specify a separator used within the values, to enable the creation of list values, or a date-syntax line that enabled you to specify the date syntax used in the values in that particular column.
In the simple CSV example we have been looking at, all the values are strings, which works fine for country codes and names. We will now introduce a separate file, http://example.org/af-population
, which initially looks like:
country,year,population AF, 1960,9616353 AF, 1961,9799379 AF, 1962,9989846 AF, 1963,10188299
In this example, the property year
holds years and the property population
holds an integer. To indicate the types of these properties, we can add a type
prolog line. The value of a type
prolog line indicates the type of the values in the column that it is in. The type must be one of:
string
url
integer
decimal
double
boolean
(true
or false
)time
— values of this type can be any of the date/time syntaxes supported by XML Schema, namely gYear
, gMonth
, gDay
, gYearMonth
, gMonthDay
, date
, time
, dateTime
If there is no type indication in the header for the column, the default type for a particular value depends on the syntax of the value, as follows:
xs:gYear
) are assumed to be date/time values[0-9]+
are assumed to be integers[0-9]+\.[0-9]+
are assumed to be decimal numbers[0-9]+(\.[0-9]+)?[eE][-+][0-9]+(\.[0-9]+)?
are assuming to be floating point numberstrue
is assumed to be the boolean value true, and the value false
the boolean value false
Could enable quoting of values using """..."""
delimited values within the CSV?
In the example above, we can add a type
prolog line to indicate the types of the properties that are created. We can also change the country
column to use the Wikipedia URIs that we previously used for the countries, and indicate that this is being done by giving its type as url
. Since the population figures are all syntactically integers, there is no need to annotate that column with a type, but such an annotation can be added for clarity:
#, country, year,population type,url, time,integer , http://en.wikipedia.org/wiki/Afghanistan,1960,9616353 , http://en.wikipedia.org/wiki/Afghanistan,1961,9799379 , http://en.wikipedia.org/wiki/Afghanistan,1962,9989846 , http://en.wikipedia.org/wiki/Afghanistan,1963,10188299
Conversion to JSON cannot preserve all this information as it does not support date/time datatypes. The resulting data would include the years as integers:
[{ "country": "http://en.wikipedia.org/wiki/Afghanistan", "year": 1960, "population": 9616353 }, { "country": "http://en.wikipedia.org/wiki/Afghanistan", "year": 1961, "population": 9799379 }, { "country": "http://en.wikipedia.org/wiki/Afghanistan", "year": 1962, "population": 9989846 }, { "country": "http://en.wikipedia.org/wiki/Afghanistan", "year": 1963, "population": 10188299 }]
The mapping to RDF can preserve the datatype information:
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> @prefix rel: <http://www.iana.org/assignments/relation/> @prefix : <http://example.org/af-population#> [ rel:describedby <http://example.org/af-population#row:0> ; :country <http://en.wikipedia.org/wiki/Afghanistan> ; :year "1960"^^xsd:gYear ; :population 9616353 ] [ rel:describedby <http://example.org/af-population#row:1> ; :country <http://en.wikipedia.org/wiki/Afghanistan> ; :year "1961"^^xsd:gYear ; :population 9799379 ] [ rel:describedby <http://example.org/af-population#row:2> ; :country <http://en.wikipedia.org/wiki/Afghanistan> ; :year "1962"^^xsd:gYear ; :population 9989846 ] [ rel:describedby <http://example.org/af-population#row:3> ; :country <http://en.wikipedia.org/wiki/Afghanistan> ; :year "1963"^^xsd:gYear ; :population 10188299 ]
In generating the Turtle, the syntax of the values in the year
column is used to determine what kind of date/time value each value should be mapped on to. Without the time
annotation, the values would be mapped to integers.
A lang
prolog line indicates the language used within each column. For example, the file that contains the country details can also be expanded to include the names of the countries in other languages:
#, $id, country,english name, french name lang,, , en, fr , http://en.wikipedia.org/wiki/Andorra, AD, Andorra, Andorre , http://en.wikipedia.org/wiki/Andorra, , Principality of Andorra, , http://en.wikipedia.org/wiki/Afghanistan,AF, Afghanistan, Afghanistan , http://en.wikipedia.org/wiki/Afghanistan,, Islamic Republic of Afghanistan,
In this case, the values of the english name
column are labelled as being in English while those in the french name
column are labelled as being in French. The JSON would look like:
[{ "@id": "http://en.wikipedia.org/wiki/Andorra", "country": "AD", "english name": [{ "value": "Andorra", "lang": "en" }, { "value": "Principality of Andorra", "lang": "en" }], "french name": { "value": "Andorre", "lang": "fr" } },{ "@id": "http://en.wikipedia.org/wiki/Afghanistan", "country": "AF", "english name": [{ "value": "Afghanistan", "lang": "en" }, { "value": "Islamic Republic of Afghanistan", "lang": "en" }], "french name": { "value": "Afghanistan", "lang": "fr" } }]
The Turtle would look like:
@prefix rel: <http://www.iana.org/assignments/relation/> @prefix : <http://example.org/af-population#> <http://en.wikipedia.org/wiki/Andorra> rel:describedby <http://example.org/countries#row:0>, <http://example.org/countries#row:1> ; :country "AD" ; :english.name "Andorra"@en, "Principality of Andorra"@en ; :french.name "Andorre"@fr ; . <http://en.wikipedia.org/wiki/Afghanistan> rel:describedby <http://example.org/countries#row:2>, <http://example.org/countries#row:3> ; :country "AF" ; :english.name "Afghanistan"@en , "Islamic Republic of Afghanistan"@en ; :french.name "Afghanistan"@fr ; .
When there are separate columns providing values in different languages for the same property, or When a large dataset is split across multiple files, as in the example here where the set of population figures is split across multiple country-specific files such as http://example.org/af-population
, it is useful to be able to indicate when the separate labels in the CSV headers refer to the same property of a given entity.
To facilitate this, url
prolog lines can indicate global identifiers for the properties. These lines contain URIs which are resolved relative to the location of the file itself. In the previous example, the two headers english name
and french name
both refer to the same name
property. We can use a url
line to indicate that these both refer to the same property:
#, $id, country,english name, french name url, , , #name, #name lang,, , en, fr , http://en.wikipedia.org/wiki/Andorra, AD, Andorra, Andorre , http://en.wikipedia.org/wiki/Andorra, , Principality of Andorra, , http://en.wikipedia.org/wiki/Afghanistan,AF, Afghanistan, Afghanistan , http://en.wikipedia.org/wiki/Afghanistan,, Islamic Republic of Afghanistan,
When this is converted to JSON, the URI for the property is processed to give just the property name
:
[{ "@id": "http://example.org/countries#AD", "country": "AD", "name": [{ "value": "Andorra", "lang": "en" }, { "value": "Andorre", "lang": "fr" }, { "value": "Principality of Andorra", "lang": "en" }] },{ "@id": "http://example.org/countries#AF", "country": "AF", "name": [{ "value": "Afghanistan", "lang": "en" }, { "value": "Afghanistan", "lang": "fr" }, { "value": "Islamic Republic of Afghanistan", "lang": "en" }] }]
In the conversion to RDF, the RDF includes the labels for the properties:
@prefix rel: <http://www.iana.org/assignments/relation/> @prefix rdfs: <...> @prefix : <http://example.org/af-population#> <http://en.wikipedia.org/wiki/Andorra> rel:describedby <http://example.org/countries#row:0>, <http://example.org/countries#row:1> ; :country "AD" ; :name "Andorra"@en, "Andorre"@fr, "Principality of Andorra"@en ; . <http://en.wikipedia.org/wiki/Afghanistan> rel:describedby <http://example.org/countries#row:2>, <http://example.org/countries#row:3> ; :country "AF" ; :name "Afghanistan"@en , "Afghanistan"@fr, "Islamic Republic of Afghanistan"@en ; . :name rdfs:label "english name" , "french name" ; .
When properties are shared across multiple files, the URIs in the url
prolog line should resolve to the same URL. For example, if we wanted to indicate that the country
property within the af-population
file means the same as the country
property within the ad-population
file, we could associate them both with the same URI by adding the same url
prolog line in both files:
#, country, year, population type,url, time, integer url, /def/statistics#country, /def/statistics#year,/def/statistics#population , http://en.wikipedia.org/wiki/Afghanistan, 1960, 9616353 , http://en.wikipedia.org/wiki/Afghanistan, 1961, 9799379 , http://en.wikipedia.org/wiki/Afghanistan, 1962, 9989846 , http://en.wikipedia.org/wiki/Afghanistan, 1963, 10188299
The resulting RDF would use these URLs for the country
, year
and population
properties:
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> @prefix rel: <http://www.iana.org/assignments/relation/> @prefix : <http://example.org/def/statistics#> [ rel:describedby <http://example.org/af-population#row:2> ; :country <http://en.wikipedia.org/wiki/Afghanistan> ; :year "1960"^^xsd:gYear ; :population 9616353 ] [ rel:describedby <http://example.org/af-population#row:3> ; :country <http://en.wikipedia.org/wiki/Afghanistan> ; :year "1961"^^xsd:gYear ; :population 9799379 ] [ rel:describedby <http://example.org/af-population#row:4> ; :country <http://en.wikipedia.org/wiki/Afghanistan> ; :year "1962"^^xsd:gYear ; :population 9989846 ] [ rel:describedby <http://example.org/af-population#row:5> ; :country <http://en.wikipedia.org/wiki/Afghanistan> ; :year "1963"^^xsd:gYear ; :population 10188299 ]
Similarly, the resulting XML will use the property URIs to determine the namespace URIs for the child elements of the <csv:item>
elements representing each entity:
<csv:collection xml:base="http://example.org/af-population" xmlns:csv="http://example.org/linked-csv" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://example.org/def/statistics#"> <csv:item> <country href="http://en.wikipedia.org/wiki/Afghanistan" /> <year xsi:type="xsd:gYear">1960</year> <population xsi:type="xsd:integer">9616353</population> </csv:item> <csv:item> <country href="http://en.wikipedia.org/wiki/Afghanistan" /> <year xsi:type="xsd:gYear">1961</year> <population xsi:type="xsd:integer">9799379</population> </csv:item> <csv:item> <country href="http://en.wikipedia.org/wiki/Afghanistan" /> <year xsi:type="xsd:gYear">1962</year> <population xsi:type="xsd:integer">9989846</population> </csv:item> <csv:item> <country href="http://en.wikipedia.org/wiki/Afghanistan" /> <year xsi:type="xsd:gYear">1963</year> <population xsi:type="xsd:integer">10188299</population> </csv:item> </csv:collection>
Applications may attempt to resolve the URIs in the url
prolog lines; if they do so, this should resolve into a linked CSV file that describes the properties. In this example, http://example.org/def/statistics
should contain something like:
$id, label, description #country, country, "The country for which the population is being provided." #year, year, "The year for which the population is being provided." #population,population,"The number of people populating the given country in the given year."
To make it easier to use common vocabularies, a field within the URL prolog line may contain a CURIE (in the form prefix:name
) as a shorthand for a URL. If a field within the URL prolog line starts with a recognised prefix, that prefix is expanded to its namespace and prepended to the remainder of the CURIE (after the colon). The recognised prefixes are:
prefix | namespace | description |
---|---|---|
Generic Vocabularies | ||
rel | http://www.iana.org/assignments/relation/ | IANA Link Relations |
schema | http://schema.org/ | schema.org |
Metadata Vocabularies | ||
dc | http://purl.org/dc/terms/ | Dublin Core Metadata Terms |
dct | http://purl.org/dc/terms/ | Dublin Core Metadata Terms |
cc | http://creativecommons.org/ns# | Creative Commons Rights Expression Language |
void | http://rdfs.org/ns/void# | VoID |
wdrs | http://www.w3.org/2007/05/powder-s# | POWDER-S |
Schema Vocabularies | ||
rdf | http://www.w3.org/1999/02/22-rdf-syntax-ns# | RDF |
rdfs | http://www.w3.org/2000/01/rdf-schema# | RDF Schema |
owl | http://www.w3.org/2002/07/owl# | OWL |
skos | http://www.w3.org/2004/02/skos/core# | SKOS |
skos-xl | http://www.w3.org/2008/05/skos-xl# | SKOS Extensions for Labels |
This list is largely based on hunches about which vocabularies are going to be useful in linked CSV documents, coupled with some dogma in pushing schema.org as the vocabulary to rule them all. An alternative would be to define the same prefixes as listed in http://www.w3.org/2011/rdfa-context/rdfa-1.1
.
There's no support for declaring your own prefixes or declaring a default prefix/vocabulary.
Linked CSV files that describe the properties used within other linked CSV files SHOULD use the RDFS vocabulary, which contains properties such as rdfs:label
and rdfs:comment
, to provide details about the properties. For example:
$id, label, description url, rdfs:label,rdfs:comment #country, country, "The country for which the population is being provided." #year, year, "The year for which the population is being provided." #population,population,"The number of people populating the given country in the given year."
Linked CSV files should be self-describing. They should include important metadata about the source of the data they contain, their license conditions, and links to other files that contain non-essential supplementary information. Although the file might be described within other files, and metadata might be made available through the HTTP headers, it is safer to embed this metadata within the file as there is no guarantee that metadata stored outside the file will be available as the data is passed around.
To provide metadata about the linked CSV document, the file has to contain a meta
prolog line, which provides metadata about the file or records within the file. If there is a $id
column, the value within that column indicates what the metadata is about: an empty value (or a missing $id
column) indicates the metadata is associated with the file as a whole.
The remainder of each metadata line should hold the following values, in order:
$id
columntype
or lang
prolog lineurl
prolog line
In our example, the http://example.org/af-population
file may be part of a series of files available for different countries, and the metadata provide a pointer to an index document (http://example.org/populations
) and to a license for the file:
#, country, year,population type,url, time,integer meta,index, url, /populations meta,license, url, http://creativecommons.org/publicdomain/mark/1.0/ , http://en.wikipedia.org/wiki/Afghanistan,1960,9616353 , http://en.wikipedia.org/wiki/Afghanistan,1961,9799379 , http://en.wikipedia.org/wiki/Afghanistan,1962,9989846 , http://en.wikipedia.org/wiki/Afghanistan,1963,10188299
In this example, none of the remaining data lines have identifiers themselves. The corresponding JSON would be:
[{ "@id": "http://example.org/af-population", "index": "http://example.org/populations", "license": "http://creativecommons.org/publicdomain/mark/1.0/" }, { "country": "http://en.wikipedia.org/wiki/Afghanistan", "year": 1960, "population": 9616353 }, { "country": "http://en.wikipedia.org/wiki/Afghanistan", "year": 1961, "population": 9799379 }, { "country": "http://en.wikipedia.org/wiki/Afghanistan", "year": 1962, "population": 9989846 }, { "country": "http://en.wikipedia.org/wiki/Afghanistan", "year": 1963, "population": 10188299 }]
The corresponding RDF would be:
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> @prefix rel: <http://www.iana.org/assignments/relation/> @prefix : <http://example.org/af-population#> <> rel:describedby <http://example.org/af-population#row:1>, <http://example.org/af-population#row:2> ; :index <populations> ; :license <http://creativecommons.org/publicdomain/mark/1.0/> ; . [ rel:describedby <http://example.org/af-population#row:3> ; :country <http://en.wikipedia.org/wiki/Afghanistan> ; :year "1960"^^xsd:gYear ; :population 9616353 ] [ rel:describedby <http://example.org/af-population#row:4> ; :country <http://en.wikipedia.org/wiki/Afghanistan> ; :year "1961"^^xsd:gYear ; :population 9799379 ] [ rel:describedby <http://example.org/af-population#row:5> ; :country <http://en.wikipedia.org/wiki/Afghanistan> ; :year "1962"^^xsd:gYear ; :population 9989846 ] [ rel:describedby <http://example.org/af-population#row:6> ; :country <http://en.wikipedia.org/wiki/Afghanistan> ; :year "1963"^^xsd:gYear ; :population 10188299 ]
Metadata prolog lines can also be used to provide metadata about other parts of the linked CSV file by using URI Identifiers for the text/csv Media Type. These can be used to refer to rows, columns, and sets of rows that have common value(s) for particular fields. For example:
#, $id, country,english name, french name url, , , #name, #name lang,, , en, fr meta,#col:english%20name, note, "contains both official and popular names", , http://en.wikipedia.org/wiki/Andorra, AD, Andorra, Andorre , http://en.wikipedia.org/wiki/Andorra, , Principality of Andorra, , http://en.wikipedia.org/wiki/Afghanistan,AF, Afghanistan, Afghanistan , http://en.wikipedia.org/wiki/Afghanistan,, Islamic Republic of Afghanistan,
A prolog line in which the value of the #
column is see
provides pointers to other linked CSV files that describe the resources in appropriate columns.
Within a see
line, columns that hold URI values (having url
in the corresponding value of the type
prolog line), can reference additional linked CSV files that describe the entities identified by the URIs in that column. For example, the population data within http://example.org/af-populations
references a country described within http://example.org/countries
. The population file would include:
#, country, year,population type,url, time,integer see, /countries, , , http://en.wikipedia.org/wiki/Afghanistan,1960, 9616353 , http://en.wikipedia.org/wiki/Afghanistan,1961, 9799379 , http://en.wikipedia.org/wiki/Afghanistan,1962, 9989846 , http://en.wikipedia.org/wiki/Afghanistan,1963, 10188299
This indicates that an application can look within http://example.org/countries
to find more information about some or all of the URIs within the country
column. The URIs within the $id
column in that file should match the URIs within the country
column in this file.
If there is no type
prolog line, a value in a see
prolog line indicates that the column holds URIs (as if the type
was set to url
). If there is a type
prolog line but the type of the column has a value other than url
, values in the see
prolog lines for that column are ignored.
This technique can also be used to point to additional data about the entities described within the linked CSV file itself. For example if another publisher also published a linked CSV file containing information about countries at http://other.example.com/countries
(perhaps providing their names in other languages or describing their capital cities), we could reference it from the http://example.org/countries
file as follows:
#, $id, country,english name, french name url, , , #name, #name lang,, , en, fr see, http://other.example.com/countries, , , , http://en.wikipedia.org/wiki/Andorra, AD, Andorra, Andorre , http://en.wikipedia.org/wiki/Andorra, , Principality of Andorra, , http://en.wikipedia.org/wiki/Afghanistan,AF, Afghanistan, Afghanistan , http://en.wikipedia.org/wiki/Afghanistan,, Islamic Republic of Afghanistan,
It is useful to be able to package together sets of linked CSV files, which may include multiple interrelated tables of data. A linked CSV package is simply a set of such files within a zip. These files should use relative links when pointing to other files within the package.
The entry point for a linked CSV package is always named index.csv
. The index file is interpreted in the same way as any other linked CSV file, but the entities it describes are the files within the package. As well as an $id
column, the index file will usually have a description column or similar, for example:
$id, description countries.csv, "list of countries" populations.csv, "index of files containing information about the populations in different countries" ad-population.csv,"populations of Andorra" af-population.csv,"populations of Afghanistan" ...
Adding metadata within the index file is useful if it can help recipients understand the structure of the package as a whole. Sufficient metadata should be listed within the index file to enable the recipient to tell whether each file should be opened, but the majority of the metadata about the file should be included within the file itself.
TODO: Reference schema.org dataset vocabulary here?
An alternative design would be to use the http://www.iana.org/assignments/relation/item
property to indicate the relationship between the index file and the items in the package; that relationship could then be used recursively so that there's no need to list all the files in the package within the index file. In that case, the index file would look like:
#, $id,item type,, url url, , http://www.iana.org/assignments/relation/item , , countries.csv , , populations.csv , , ad-population.csv , , af-population.csv ...
The disadvantage with this is that it's more difficult to add metadata about the files themselves.
The manifest may list any number of the files: it does not need to list them all, merely to provide entry points such that the others can be located through the see
prolog lines in the files or through the URLs in the $id
column or other columns labelled with the type url
.
Linked CSV does not have to be mapped to JSON, but it can be used to create a JSON document (or, in the case of a package of linked CSV files, a collection of JSON documents) for systems that store information as JSON. Two conversions are provided here. One generates a simple JSON format that loses much of the information that is encoded within the linked CSV file; the other generates a more complex JSON-LD file that preserves that information.
The results of this parsing is an array of objects, one per entity in the linked CSV. An entity is generated for each data line that does not have a $id
value, and for each unique $id
value. If the entity has an identifier value in the $id
column, the JSON object is given a "@id"
property whose value is that URI identifier resolved against the base URI of the linked CSV document. Thus each JSON object is associated with a sequence of one or more data lines from the linked CSV file.
Each column within the linked CSV file is mapped to a property within the JSON file, as follows:
url
prolog line and the url
prolog line contains a URI for the column then
#
)
As the result of this algorithm, multiple columns may be mapped to a single property. Where there are multiple columns mapping to a single property, that property is marked as expecting arrays. If any of the columns comprising the property has a value within the lang
prolog line, the property is marked as a language property.
Each sequence of data lines associated with the JSON object is processed as follows. A property is created within the JSON object for each property for which the data lines provide values (properties with no values are left undefined). If the property expects arrays, it will be assigned an array of values even if only one value is provided within the data lines. Each value is then processed as follows:
value
and, if the column from which the value comes has a value in the lang
prolog line, a lang
property with that languagetype
prolog line or inferred from the syntax of the value, as described in ) is of the type integer
, decimal
or double
, then if it is numeric, it is mapped to a number, otherwise to null
boolean
, if it has the value true
or false
, it is mapped to a boolean, otherwise to null
TODO: handle recursive processing into referenced linked CSV files
TODO
Linked CSV does not have to be mapped to XML, but it can be used to create an XML document (or, in the case of a package of linked CSV files, a collection of XML documents) for systems that store information as XML.
The namespace for the standard elements is http://example.org/linked-csv
which is conventionally associated with the prefix csv
. The document element is named <csv:collection>
. It is given the following attributes:
xml:base
attribute whose value is the base URI of the linked CSV filexmlns:csv
namespace declaration for the namespace http://example.org/linked-csv
xmlns:xsd
namespace declaration for the namespace http://www.w3.org/2001/XMLSchema
xmlns:xsi
namespace declaration for the namespace http://www.w3.org/2001/XMLSchema-instance
An <csv:item>
element is generated for each entity in the linked CSV. The entities are uniquely identified by the value of the $id
column; data lines with the same $id
are merged into a single <csv:item>
element, though a separate <csv:item>
element is generated for each data line with no $id
value. The value of the $id
column becomes the value of the @href
attribute on the <csv:item>
element.
Within the <csv:item>
element, a child element is generated for each unique value of each property (values from different columns, which may have different vocabularies, datatypes or languages create separate elements). Note that the $id
column, if it exists, is not processed in this way. The name of the child element is determined as follows:
url
prolog line and the url
prolog line contains a URL for the column then
#
)#
if the URI contains a #
, or the final /
if it does not; the local name is based on the substring of the URI after the #
or /
TODO: normalisation of property names into XML names
The attributes and content of the child element are determined as follows:
url
, the element is given an href
attribute whose value is the URI in the relevant fieldxsi:type
attribute whose value is xsd:datatype
lang
prolog line, add a xml:lang
attribute whose value is the language in that prolog line
TODO: handle recursive processing into referenced linked CSV files
Linked CSV does not have to be mapped to RDF, but it can be used to create a graph (or, in the case of a package of linked CSV files, a set of graphs) for systems that store information as RDF.
Each data line describes a resource, which has properties whose URIs are generated based on the names of the columns given in the header and the URIs given in the url
prolog line, and values based on the values given within the data lines.
If the data line has a $id
value, this gives the URI for the resource (resolved against the base URI of the linked CSV file). If it does not have a $id
value, it is a blank node. Either way, a triple must be generated of the form:
resource <http://www.iana.org/assignments/relation/describedby> CSV-line .
where the CSV-line is a reference to the row that describes the resource, using a fragment identifier of the form #row:N
. Note that there may be many such describedby
statements for a single resource if its description is split over several lines.
If there is a url
prolog line in the linked CSV file, and it contains a value in a given column, this is used as the URI for the property. Otherwise, the property URI is constructed from the fragment identifier #escaped-header
with the base URI of the linked CSV file, where escaped-header is the URL-escaped version of the header for the column.
For each data line, an RDF statement is generated for each column aside from the #
and $id
columns. The URI of the property is determined as above. The value of the property is interpreted as one of:
http://www.w3.org/2001/XMLSchema#
to get the datatype URI
lang
prolog line, a literal value with the language indicated
http://www.w3.org/2001/XMLSchema#string
Multiple equivalent triples may be generated through this process if the resource is described by more than one row; these will be merged naturally as part of RDF semantics.
TODO: handle recursive processing into referenced linked CSV files
TODO: This wouldn't be too hard to do, though lossy.
This work is inspired by Google's Dataset Publishing Language and OKFN's Simple Data Format, along with some suggestions from Francis Irving and review by John Sheridan, Leigh Dodds and Tim Berners-Lee.