Many open data sets are essentially tables, or sets of tables, which follow the same regular structure. This document describes a set of conventions for CSV files that enable them to be linked together and to be interpreted as RDF.

Introduction

The requirements on which this format is based are:

Structure

The structure of a CSV file is a header followed by a number of records. The header is the first line of the file, while the remaining lines are the records. Both the header and the records contain fields separated by commas. These terms are used as defined in [[RFC4180]]. Within this document, a column is a set of fields which are at the same index within their respective rows and the column name is the value of the field in the header for that column. For example, the following is a valid CSV file which lists country codes and names:

country,name
AD,Andorra
AF,Afghanistan
AI,Anguilla
AL,Albania
      

All valid CSV files are valid linked CSV files, so the above example is also a valid linked CSV file. It has four records and two columns, whose names are country and name.

Valid CSV files MUST use CRLF to indicate the ends of lines (and thus the separation of rows). Linked CSV parsers SHOULD provide a warning if CR or LF is used for line endings, and SHOULD recover by parsing the CSV file with those line endings.

Spreadsheet programs such as Excel or OpenOffice Calc typically use the line ending used by the platform on which they are deployed (eg simply LF on Mac OS X). Allowing other line endings for linked CSV is intended to make it easier to create such documents within spreadsheet programs.

The aim of processing a linked CSV file is to generate information about a set of entities. An entity may be represented internally by the application as an object or a resource. Each entity has a number of properties, which may have one or more values.

Records within a linked CSV file may be of two different types: prolog lines (see ) and data lines. Data lines can only come after the last prolog line, if there is one. A data line is a line that contains data about an entity. A single entity may be described across multiple data lines. For each data line describing an entity, each value within the line corresponds to a value of a property of that entity (the property being labelled through the corresponding header).

The JSON version of this file, as defined in , is:

[{
  "country": "AD",
  "name": "Andorra"
},{
  "country": "AF",
  "name": "Afghanistan"
},{
  "country": "AI",
  "name": "Anguilla"
},{
  "country": "AL",
  "name": "Albania"
}]
      

Linked CSV files must be encoded as UTF-8.

It isn't usually easy to set the encoding of a CSV file when exporting from normal spreadsheet programs. It would be nice if there were a way of detecting the encoding. Perhaps it could be sniffed based on the initial characters #, in the file (with UTF-8 assumed if those aren't the initial characters)?

Identifiers

Linked CSV is built around the concept of using URIs to name things. Every record, column, and even slices of data, in a linked CSV file is addressable using URI Identifiers for the text/csv Media Type. For example, if the linked CSV file is accessed at http://example.org/countries, the first record in the CSV file above, which happens to be the first data line within the linked CSV file (which describes Andorra) is addressable with the URI:

http://example.org/countries#row:0

However, this addressing merely identifies the records within the linked CSV file, not the entities that the record describes. This distinction is important for two reasons:

By default, each data line describes an entity, each entity is described by a single data line, and there is no way to address the entities. However, adding a $id column enables entities to be given identifiers. These identifiers are always URIs, and they are interpreted relative to the location of the linked CSV file. The $id column may be positioned anywhere but by convention it should be the first column (unless there is a # column, in which case it should be the second). For example:

$id,country,name
#AD,AD,     Andorra
#AD,AD,     Principality of Andorra
#AF,AF,     Afghanistan
#AF,AF,     Islamic Republic of Afghanistan
        

For the purpose of clarity within this document, whitespace has been added to this and the remainder of the examples so that headers and values line up correctly. Whitespace within linked CSV files is normally significant.

The prefix $ is used because the prefix @ is interpreted as indicating a formula when entered into spreadsheet programs such as Excel.

This linked CSV file contains two entities, which have the identifiers http://example.org/countries#AD and http://example.org/countries#AF. The first is described by the first two data lines and the second by the next two. The JSON generated for this file would be:

[{
  "@id": "http://example.org/countries#AD",
  "country": "AD",
  "name": [ "Andorra", "Principality of Andorra" ]
},{
  "@id": "http://example.org/countries#AF",
  "country": "AF",
  "name": [ "Afghanistan", "Islamic Republic of Afghanistan" ]
}]
        

and the RDF would be:

@prefix rel: <http://www.iana.org/assignments/relation/>
PREFIX : <http://example.org/countries#>
<http://example.org/countries#AD>
	rel:describedby <http://example.org/countries#row:0> ;
	:country "AD" ;
	:name "Andorra" , "Principality of Andorra" ;
	.

<http://example.org/countries#AF>
	rel:describedby <http://example.org/countries#row:1> ;
	:country "AF" ;
	:name "Afghanistan" , "Islamic Republic of Afghanistan" ;
	.
      	

As shown by this example, when multiple data lines describe a single entity, a given property takes only the distinct values within the column for that entity rather than being duplicated. However, the file can be made shorter if it doesn't contain duplicates in the first case; the following CSV is equivalent:

$id,country,name
#AD,AD,     Andorra
#AD,,       Principality of Andorra
#AF,AF,     Afghanistan
#AF,,       Islamic Republic of Afghanistan
        

Interpreting Identifiers

By default, properties within the linked CSV file are assumed to apply to the thing described by the resource located by the URI identifier. For example, if the file contained identifier URIs that were Wikipedia pages, as in

$id,                                     country,name
http://en.wikipedia.org/wiki/Andorra,    AD,     Andorra
http://en.wikipedia.org/wiki/Andorra,    AD,     Principality of Andorra
http://en.wikipedia.org/wiki/Afghanistan,AF,     Afghanistan
http://en.wikipedia.org/wiki/Afghanistan,AF,     Islamic Republic of Afghanistan
          

applications should interpret the properties labelled country and name to apply to the countries described by those Wikipedia pages, not the Wikipedia pages themselves. In general this distinction does not matter, but it may do when using linked CSV to describe resources that are available on the web. Individual properties may be used differently, and apply to the content found at the referenced URI; how they are interpreted should be incorporated into the property documentation.

Prolog Lines

A linked CSV file can contain any number of prolog lines. Prolog lines describe additional processing of the linked CSV file, usually related to the file or some portion or the file, or related to some or all of the columns. Prolog lines can only be present if there is a column named #; any record that has a value in that column is a prolog line, and the value for that column indicates how the line should be interpreted:

type
This value indicates that the line provides information about the type of the values in each column
lang
This value indicates that the line provides information about the language of the values in each column
meta
This value indicates that the line provides metadata about the linked CSV file or rows within it
url
This value indicates that the line provides global URIs for the properties in each column
see
This value indicates that the line provides details of additional resources that may provide information about some or all of the entities whose identifiers are given within the column
empty
Having no value in the # column indicates that the line is a data line rather than a prolog line

Prolog lines must all be at the start of a linked CSV file. Any prolog lines that appear after the first data line must be ignored by processors. Prolog lines of different types can appear in any order.

Ignoring prolog lines that appear after the first data line aids streaming processing of linked CSV files, the hiding of prolog information within spreadsheet applications, and ease of reading for humans.

Could add other kinds of prolog lines. The thing to do is probably to have a separate registry of prolog line types that provide for configuration of the processing that should be done on the values in particular columns. For example, you could have prolog lines that enable to to specify a separator used within the values, to enable the creation of list values, or a date-syntax line that enabled you to specify the date syntax used in the values in that particular column.

Property Types

In the simple CSV example we have been looking at, all the values are strings, which works fine for country codes and names. We will now introduce a separate file, http://example.org/af-population, which initially looks like:

country,year,population
AF,     1960,9616353
AF,     1961,9799379
AF,     1962,9989846
AF,     1963,10188299
          

In this example, the property year holds years and the property population holds an integer. To indicate the types of these properties, we can add a type prolog line. The value of a type prolog line indicates the type of the values in the column that it is in. The type must be one of:

  • string
  • url
  • integer
  • decimal
  • double
  • boolean (true or false)
  • time — values of this type can be any of the date/time syntaxes supported by XML Schema, namely gYear, gMonth, gDay, gYearMonth, gMonthDay, date, time, dateTime

If there is no type indication in the header for the column, the default type for a particular value depends on the syntax of the value, as follows:

  • values matching XML Schema date/time syntax (aside from xs:gYear) are assumed to be date/time values
  • values matching [0-9]+ are assumed to be integers
  • values matching [0-9]+\.[0-9]+ are assumed to be decimal numbers
  • values matching [0-9]+(\.[0-9]+)?[eE][-+][0-9]+(\.[0-9]+)? are assuming to be floating point numbers
  • the value true is assumed to be the boolean value true, and the value false the boolean value false
  • otherwise, the value is assumed to be a string

Could enable quoting of values using """...""" delimited values within the CSV?

In the example above, we can add a type prolog line to indicate the types of the properties that are created. We can also change the country column to use the Wikipedia URIs that we previously used for the countries, and indicate that this is being done by giving its type as url. Since the population figures are all syntactically integers, there is no need to annotate that column with a type, but such an annotation can be added for clarity:

#,   country,                                 year,population
type,url,                                     time,integer
,    http://en.wikipedia.org/wiki/Afghanistan,1960,9616353
,    http://en.wikipedia.org/wiki/Afghanistan,1961,9799379
,    http://en.wikipedia.org/wiki/Afghanistan,1962,9989846
,    http://en.wikipedia.org/wiki/Afghanistan,1963,10188299
          

Conversion to JSON cannot preserve all this information as it does not support date/time datatypes. The resulting data would include the years as integers:

[{
  "country": "http://en.wikipedia.org/wiki/Afghanistan",
  "year": 1960,
  "population": 9616353
}, {
  "country": "http://en.wikipedia.org/wiki/Afghanistan",
  "year": 1961,
  "population": 9799379
}, {
  "country": "http://en.wikipedia.org/wiki/Afghanistan",
  "year": 1962,
  "population": 9989846
}, {
  "country": "http://en.wikipedia.org/wiki/Afghanistan",
  "year": 1963,
  "population": 10188299
}]
          

The mapping to RDF can preserve the datatype information:

@prefix xsd: <http://www.w3.org/2001/XMLSchema#>
@prefix rel: <http://www.iana.org/assignments/relation/>
@prefix : <http://example.org/af-population#>

[ rel:describedby <http://example.org/af-population#row:0> ;
  :country <http://en.wikipedia.org/wiki/Afghanistan> ;
  :year "1960"^^xsd:gYear ;
  :population 9616353 ]

[ rel:describedby <http://example.org/af-population#row:1> ;
  :country <http://en.wikipedia.org/wiki/Afghanistan> ;
  :year "1961"^^xsd:gYear ;
  :population 9799379 ]

[ rel:describedby <http://example.org/af-population#row:2> ;
  :country <http://en.wikipedia.org/wiki/Afghanistan> ;
  :year "1962"^^xsd:gYear ;
  :population 9989846 ]

[ rel:describedby <http://example.org/af-population#row:3> ;
  :country <http://en.wikipedia.org/wiki/Afghanistan> ;
  :year "1963"^^xsd:gYear ;
  :population 10188299 ]
          

In generating the Turtle, the syntax of the values in the year column is used to determine what kind of date/time value each value should be mapped on to. Without the time annotation, the values would be mapped to integers.

Languages

A lang prolog line indicates the language used within each column. For example, the file that contains the country details can also be expanded to include the names of the countries in other languages:

  #,   $id,                                     country,english name,                   french name
  lang,,                                        ,       en,                             fr
  ,    http://en.wikipedia.org/wiki/Andorra,    AD,     Andorra,                        Andorre
  ,    http://en.wikipedia.org/wiki/Andorra,    ,       Principality of Andorra,
  ,    http://en.wikipedia.org/wiki/Afghanistan,AF,     Afghanistan,                    Afghanistan
  ,    http://en.wikipedia.org/wiki/Afghanistan,,       Islamic Republic of Afghanistan,
          

In this case, the values of the english name column are labelled as being in English while those in the french name column are labelled as being in French. The JSON would look like:

[{
  "@id": "http://en.wikipedia.org/wiki/Andorra",
  "country": "AD",
  "english name": [{
    "value": "Andorra",
    "lang": "en"
  }, {
    "value": "Principality of Andorra",
    "lang": "en"
  }],
  "french name": {
    "value": "Andorre",
    "lang": "fr"
  }
},{
  "@id": "http://en.wikipedia.org/wiki/Afghanistan",
  "country": "AF",
  "english name": [{
    "value": "Afghanistan",
    "lang": "en"
  }, {
    "value": "Islamic Republic of Afghanistan",
    "lang": "en"
  }],
  "french name": {
    "value": "Afghanistan",
    "lang": "fr"
  }
}]
          

The Turtle would look like:

@prefix rel: <http://www.iana.org/assignments/relation/>
@prefix : <http://example.org/af-population#>

<http://en.wikipedia.org/wiki/Andorra>
  rel:describedby 
    <http://example.org/countries#row:0>, 
    <http://example.org/countries#row:1> ;
  :country "AD" ;
  :english.name "Andorra"@en, "Principality of Andorra"@en ;
  :french.name "Andorre"@fr ;
  .

<http://en.wikipedia.org/wiki/Afghanistan>
  rel:describedby 
    <http://example.org/countries#row:2>, 
    <http://example.org/countries#row:3> ;
  :country "AF" ;
  :english.name "Afghanistan"@en , "Islamic Republic of Afghanistan"@en ;
  :french.name "Afghanistan"@fr ;
  .
          

Global Property Identifiers

When there are separate columns providing values in different languages for the same property, or When a large dataset is split across multiple files, as in the example here where the set of population figures is split across multiple country-specific files such as http://example.org/af-population, it is useful to be able to indicate when the separate labels in the CSV headers refer to the same property of a given entity.

To facilitate this, url prolog lines can indicate global identifiers for the properties. These lines contain URIs which are resolved relative to the location of the file itself. In the previous example, the two headers english name and french name both refer to the same name property. We can use a url line to indicate that these both refer to the same property:

#,   $id,                                     country,english name,                   french name
url, ,                                        ,       #name,                          #name
lang,,                                        ,       en,                             fr
,    http://en.wikipedia.org/wiki/Andorra,    AD,     Andorra,                        Andorre
,    http://en.wikipedia.org/wiki/Andorra,    ,       Principality of Andorra,
,    http://en.wikipedia.org/wiki/Afghanistan,AF,     Afghanistan,                    Afghanistan
,    http://en.wikipedia.org/wiki/Afghanistan,,       Islamic Republic of Afghanistan,
          

When this is converted to JSON, the URI for the property is processed to give just the property name:

[{
  "@id": "http://example.org/countries#AD",
  "country": "AD",
  "name": [{
    "value": "Andorra",
    "lang": "en"
  }, {
    "value": "Andorre",
    "lang": "fr"
  }, {
    "value": "Principality of Andorra",
    "lang": "en"
  }]
},{
  "@id": "http://example.org/countries#AF",
  "country": "AF",
  "name": [{
    "value": "Afghanistan",
    "lang": "en"
  }, {
    "value": "Afghanistan",
    "lang": "fr"
  }, {
    "value": "Islamic Republic of Afghanistan",
    "lang": "en"
  }]
}]
          

In the conversion to RDF, the RDF includes the labels for the properties:

@prefix rel: <http://www.iana.org/assignments/relation/>
@prefix rdfs: <...>
@prefix : <http://example.org/af-population#>

<http://en.wikipedia.org/wiki/Andorra>
  rel:describedby 
    <http://example.org/countries#row:0>, 
    <http://example.org/countries#row:1> ;
  :country "AD" ;
  :name "Andorra"@en, "Andorre"@fr, "Principality of Andorra"@en ;
  .

<http://en.wikipedia.org/wiki/Afghanistan>
  rel:describedby 
    <http://example.org/countries#row:2>, 
    <http://example.org/countries#row:3> ;
  :country "AF" ;
  :name "Afghanistan"@en , "Afghanistan"@fr, "Islamic Republic of Afghanistan"@en ;
  .

:name
  rdfs:label "english name" , "french name" ;
  .
          

When properties are shared across multiple files, the URIs in the url prolog line should resolve to the same URL. For example, if we wanted to indicate that the country property within the af-population file means the same as the country property within the ad-population file, we could associate them both with the same URI by adding the same url prolog line in both files:

#,   country,                                  year,                population
type,url,                                      time,                integer
url, /def/statistics#country,                  /def/statistics#year,/def/statistics#population
,    http://en.wikipedia.org/wiki/Afghanistan, 1960,                9616353
,    http://en.wikipedia.org/wiki/Afghanistan, 1961,                9799379
,    http://en.wikipedia.org/wiki/Afghanistan, 1962,                9989846
,    http://en.wikipedia.org/wiki/Afghanistan, 1963,                10188299
          

The resulting RDF would use these URLs for the country, year and population properties:

@prefix xsd: <http://www.w3.org/2001/XMLSchema#>
@prefix rel: <http://www.iana.org/assignments/relation/>
@prefix : <http://example.org/def/statistics#>

[ rel:describedby <http://example.org/af-population#row:2> ;
  :country <http://en.wikipedia.org/wiki/Afghanistan> ;
  :year "1960"^^xsd:gYear ;
  :population 9616353 ]

[ rel:describedby <http://example.org/af-population#row:3> ;
  :country <http://en.wikipedia.org/wiki/Afghanistan> ;
  :year "1961"^^xsd:gYear ;
  :population 9799379 ]

[ rel:describedby <http://example.org/af-population#row:4> ;
  :country <http://en.wikipedia.org/wiki/Afghanistan> ;
  :year "1962"^^xsd:gYear ;
  :population 9989846 ]

[ rel:describedby <http://example.org/af-population#row:5> ;
  :country <http://en.wikipedia.org/wiki/Afghanistan> ;
  :year "1963"^^xsd:gYear ;
  :population 10188299 ]
          

Similarly, the resulting XML will use the property URIs to determine the namespace URIs for the child elements of the <csv:item> elements representing each entity:

<csv:collection xml:base="http://example.org/af-population"
  xmlns:csv="http://example.org/linked-csv"
  xmlns:xsd="http://www.w3.org/2001/XMLSchema"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xmlns="http://example.org/def/statistics#">
  <csv:item>
    <country href="http://en.wikipedia.org/wiki/Afghanistan" />
    <year xsi:type="xsd:gYear">1960</year>
    <population xsi:type="xsd:integer">9616353</population>
  </csv:item>
  <csv:item>
    <country href="http://en.wikipedia.org/wiki/Afghanistan" />
    <year xsi:type="xsd:gYear">1961</year>
    <population xsi:type="xsd:integer">9799379</population>
  </csv:item>
  <csv:item>
    <country href="http://en.wikipedia.org/wiki/Afghanistan" />
    <year xsi:type="xsd:gYear">1962</year>
    <population xsi:type="xsd:integer">9989846</population>
  </csv:item>
  <csv:item>
    <country href="http://en.wikipedia.org/wiki/Afghanistan" />
    <year xsi:type="xsd:gYear">1963</year>
    <population xsi:type="xsd:integer">10188299</population>
  </csv:item>
</csv:collection>
          

Applications may attempt to resolve the URIs in the url prolog lines; if they do so, this should resolve into a linked CSV file that describes the properties. In this example, http://example.org/def/statistics should contain something like:

$id,        label,     description
#country,   country,   "The country for which the population is being provided."
#year,      year,      "The year for which the population is being provided."
#population,population,"The number of people populating the given country in the given year."
          

To make it easier to use common vocabularies, a field within the URL prolog line may contain a CURIE (in the form prefix:name) as a shorthand for a URL. If a field within the URL prolog line starts with a recognised prefix, that prefix is expanded to its namespace and prepended to the remainder of the CURIE (after the colon). The recognised prefixes are:

prefixnamespacedescription
Generic Vocabularies
relhttp://www.iana.org/assignments/relation/IANA Link Relations
schemahttp://schema.org/schema.org
Metadata Vocabularies
dchttp://purl.org/dc/terms/Dublin Core Metadata Terms
dcthttp://purl.org/dc/terms/Dublin Core Metadata Terms
cchttp://creativecommons.org/ns#Creative Commons Rights Expression Language
voidhttp://rdfs.org/ns/void#VoID
wdrshttp://www.w3.org/2007/05/powder-s#POWDER-S
Schema Vocabularies
rdfhttp://www.w3.org/1999/02/22-rdf-syntax-ns#RDF
rdfshttp://www.w3.org/2000/01/rdf-schema#RDF Schema
owlhttp://www.w3.org/2002/07/owl#OWL
skoshttp://www.w3.org/2004/02/skos/core#SKOS
skos-xlhttp://www.w3.org/2008/05/skos-xl#SKOS Extensions for Labels

This list is largely based on hunches about which vocabularies are going to be useful in linked CSV documents, coupled with some dogma in pushing schema.org as the vocabulary to rule them all. An alternative would be to define the same prefixes as listed in http://www.w3.org/2011/rdfa-context/rdfa-1.1.

There's no support for declaring your own prefixes or declaring a default prefix/vocabulary.

Linked CSV files that describe the properties used within other linked CSV files SHOULD use the RDFS vocabulary, which contains properties such as rdfs:label and rdfs:comment, to provide details about the properties. For example:

$id,        label,     description
url,        rdfs:label,rdfs:comment
#country,   country,   "The country for which the population is being provided."
#year,      year,      "The year for which the population is being provided."
#population,population,"The number of people populating the given country in the given year."
          

Self Description

Linked CSV files should be self-describing. They should include important metadata about the source of the data they contain, their license conditions, and links to other files that contain non-essential supplementary information. Although the file might be described within other files, and metadata might be made available through the HTTP headers, it is safer to embed this metadata within the file as there is no guarantee that metadata stored outside the file will be available as the data is passed around.

To provide metadata about the linked CSV document, the file has to contain a meta prolog line, which provides metadata about the file or records within the file. If there is a $id column, the value within that column indicates what the metadata is about: an empty value (or a missing $id column) indicates the metadata is associated with the file as a whole.

The remainder of each metadata line should hold the following values, in order:

  1. a label for a property of the entity indicated in the $id column
  2. optionally, a type or language annotation for the property, which is interpreted in the same way as the values in a type or lang prolog line
  3. a value, the value of the property for that entity
  4. optionally, a URI that is the global identifier for the property, which is interpreted in the same way as the values in a url prolog line

In our example, the http://example.org/af-population file may be part of a series of files available for different countries, and the metadata provide a pointer to an index document (http://example.org/populations) and to a license for the file:

#,   country,                                 year,population
type,url,                                     time,integer
meta,index,                                   url, /populations
meta,license,                                 url, http://creativecommons.org/publicdomain/mark/1.0/
,    http://en.wikipedia.org/wiki/Afghanistan,1960,9616353
,    http://en.wikipedia.org/wiki/Afghanistan,1961,9799379
,    http://en.wikipedia.org/wiki/Afghanistan,1962,9989846
,    http://en.wikipedia.org/wiki/Afghanistan,1963,10188299
          

In this example, none of the remaining data lines have identifiers themselves. The corresponding JSON would be:

[{
  "@id": "http://example.org/af-population",
  "index": "http://example.org/populations",
  "license": "http://creativecommons.org/publicdomain/mark/1.0/"
}, {
  "country": "http://en.wikipedia.org/wiki/Afghanistan",
  "year": 1960,
  "population": 9616353
}, {
  "country": "http://en.wikipedia.org/wiki/Afghanistan",
  "year": 1961,
  "population": 9799379
}, {
  "country": "http://en.wikipedia.org/wiki/Afghanistan",
  "year": 1962,
  "population": 9989846
}, {
  "country": "http://en.wikipedia.org/wiki/Afghanistan",
  "year": 1963,
  "population": 10188299
}]
          

The corresponding RDF would be:

@prefix xsd: <http://www.w3.org/2001/XMLSchema#>
@prefix rel: <http://www.iana.org/assignments/relation/>
@prefix : <http://example.org/af-population#>

<>
  rel:describedby 
    <http://example.org/af-population#row:1>, 
    <http://example.org/af-population#row:2> ;
  :index <populations> ;
  :license <http://creativecommons.org/publicdomain/mark/1.0/> ;
  .

[ rel:describedby <http://example.org/af-population#row:3> ;
  :country <http://en.wikipedia.org/wiki/Afghanistan> ;
  :year "1960"^^xsd:gYear ;
  :population 9616353 ]

[ rel:describedby <http://example.org/af-population#row:4> ;
  :country <http://en.wikipedia.org/wiki/Afghanistan> ;
  :year "1961"^^xsd:gYear ;
  :population 9799379 ]

[ rel:describedby <http://example.org/af-population#row:5> ;
  :country <http://en.wikipedia.org/wiki/Afghanistan> ;
  :year "1962"^^xsd:gYear ;
  :population 9989846 ]

[ rel:describedby <http://example.org/af-population#row:6> ;
  :country <http://en.wikipedia.org/wiki/Afghanistan> ;
  :year "1963"^^xsd:gYear ;
  :population 10188299 ]
          

Metadata prolog lines can also be used to provide metadata about other parts of the linked CSV file by using URI Identifiers for the text/csv Media Type. These can be used to refer to rows, columns, and sets of rows that have common value(s) for particular fields. For example:

#,   $id,                                     country,english name,                              french name
url, ,                                        ,       #name,                                     #name
lang,,                                        ,       en,                                        fr
meta,#col:english%20name,                     note,   "contains both official and popular names",
,    http://en.wikipedia.org/wiki/Andorra,    AD,     Andorra,                                   Andorre
,    http://en.wikipedia.org/wiki/Andorra,    ,       Principality of Andorra,
,    http://en.wikipedia.org/wiki/Afghanistan,AF,     Afghanistan,                               Afghanistan
,    http://en.wikipedia.org/wiki/Afghanistan,,       Islamic Republic of Afghanistan,
          

Additional Data Sources

A prolog line in which the value of the # column is see provides pointers to other linked CSV files that describe the resources in appropriate columns.

Within a see line, columns that hold URI values (having url in the corresponding value of the type prolog line), can reference additional linked CSV files that describe the entities identified by the URIs in that column. For example, the population data within http://example.org/af-populations references a country described within http://example.org/countries. The population file would include:

#,   country,                                 year,population
type,url,                                     time,integer
see, /countries,                              ,
,    http://en.wikipedia.org/wiki/Afghanistan,1960,      9616353
,    http://en.wikipedia.org/wiki/Afghanistan,1961,      9799379
,    http://en.wikipedia.org/wiki/Afghanistan,1962,      9989846
,    http://en.wikipedia.org/wiki/Afghanistan,1963,      10188299
          

This indicates that an application can look within http://example.org/countries to find more information about some or all of the URIs within the country column. The URIs within the $id column in that file should match the URIs within the country column in this file.

If there is no type prolog line, a value in a see prolog line indicates that the column holds URIs (as if the type was set to url). If there is a type prolog line but the type of the column has a value other than url, values in the see prolog lines for that column are ignored.

This technique can also be used to point to additional data about the entities described within the linked CSV file itself. For example if another publisher also published a linked CSV file containing information about countries at http://other.example.com/countries (perhaps providing their names in other languages or describing their capital cities), we could reference it from the http://example.org/countries file as follows:

#,   $id,                                     country,english name,                   french name
url, ,                                        ,       #name,                          #name
lang,,                                        ,       en,                             fr
see, http://other.example.com/countries,      ,       ,
,    http://en.wikipedia.org/wiki/Andorra,    AD,     Andorra,                        Andorre
,    http://en.wikipedia.org/wiki/Andorra,    ,       Principality of Andorra,
,    http://en.wikipedia.org/wiki/Afghanistan,AF,     Afghanistan,                    Afghanistan
,    http://en.wikipedia.org/wiki/Afghanistan,,       Islamic Republic of Afghanistan,
          

Packaging

It is useful to be able to package together sets of linked CSV files, which may include multiple interrelated tables of data. A linked CSV package is simply a set of such files within a zip. These files should use relative links when pointing to other files within the package.

The entry point for a linked CSV package is always named index.csv. The index file is interpreted in the same way as any other linked CSV file, but the entities it describes are the files within the package. As well as an $id column, the index file will usually have a description column or similar, for example:

$id,              description
countries.csv,    "list of countries"
populations.csv,  "index of files containing information about the populations in different countries"
ad-population.csv,"populations of Andorra"
af-population.csv,"populations of Afghanistan"
...
      

Adding metadata within the index file is useful if it can help recipients understand the structure of the package as a whole. Sufficient metadata should be listed within the index file to enable the recipient to tell whether each file should be opened, but the majority of the metadata about the file should be included within the file itself.

TODO: Reference schema.org dataset vocabulary here?

An alternative design would be to use the http://www.iana.org/assignments/relation/item property to indicate the relationship between the index file and the items in the package; that relationship could then be used recursively so that there's no need to list all the files in the package within the index file. In that case, the index file would look like:

#,   $id,item
type,,   url
url, ,   http://www.iana.org/assignments/relation/item
,    ,   countries.csv
,    ,   populations.csv
,    ,   ad-population.csv
,    ,   af-population.csv
...
        

The disadvantage with this is that it's more difficult to add metadata about the files themselves.

The manifest may list any number of the files: it does not need to list them all, merely to provide entry points such that the others can be located through the see prolog lines in the files or through the URLs in the $id column or other columns labelled with the type url.

Mapping to JSON

Linked CSV does not have to be mapped to JSON, but it can be used to create a JSON document (or, in the case of a package of linked CSV files, a collection of JSON documents) for systems that store information as JSON. Two conversions are provided here. One generates a simple JSON format that loses much of the information that is encoded within the linked CSV file; the other generates a more complex JSON-LD file that preserves that information.

Parsing Linked CSV as Simple JSON

The results of this parsing is an array of objects, one per entity in the linked CSV. An entity is generated for each data line that does not have a $id value, and for each unique $id value. If the entity has an identifier value in the $id column, the JSON object is given a "@id" property whose value is that URI identifier resolved against the base URI of the linked CSV document. Thus each JSON object is associated with a sequence of one or more data lines from the linked CSV file.

Each column within the linked CSV file is mapped to a property within the JSON file, as follows:

  1. if there is a url prolog line and the url prolog line contains a URI for the column then
    1. if the URI is a fragment of the linked CSV file, then the unescaped fragment identifier of that URI (after the #)
    2. otherwise, the URI resolved against the base URI of the linked CSV file
  2. otherwise, the label of the column from the header

As the result of this algorithm, multiple columns may be mapped to a single property. Where there are multiple columns mapping to a single property, that property is marked as expecting arrays. If any of the columns comprising the property has a value within the lang prolog line, the property is marked as a language property.

Each sequence of data lines associated with the JSON object is processed as follows. A property is created within the JSON object for each property for which the data lines provide values (properties with no values are left undefined). If the property expects arrays, it will be assigned an array of values even if only one value is provided within the data lines. Each value is then processed as follows:

  1. if the property is a language property, it is mapped into an object with a property value and, if the column from which the value comes has a value in the lang prolog line, a lang property with that language
  2. if the value (as given by the type prolog line or inferred from the syntax of the value, as described in ) is of the type integer, decimal or double, then if it is numeric, it is mapped to a number, otherwise to null
  3. if the value is of the type boolean, if it has the value true or false, it is mapped to a boolean, otherwise to null
  4. if the value is a year, it is mapped to a number
  5. if the value is of another date/time datatype, it is mapped to a string
  6. if the value is typed as a URI, it is resolved as a URI against the base URI of the linked CSV file and the resulting URI is used as the (string) value of the property
  7. otherwise, it is mapped to a string

TODO: handle recursive processing into referenced linked CSV files

Parsing Linked CSV as JSON-LD

TODO

Mapping to XML

Linked CSV does not have to be mapped to XML, but it can be used to create an XML document (or, in the case of a package of linked CSV files, a collection of XML documents) for systems that store information as XML.

Parsing Linked CSV as XML

The namespace for the standard elements is http://example.org/linked-csv which is conventionally associated with the prefix csv. The document element is named <csv:collection>. It is given the following attributes:

An <csv:item> element is generated for each entity in the linked CSV. The entities are uniquely identified by the value of the $id column; data lines with the same $id are merged into a single <csv:item> element, though a separate <csv:item> element is generated for each data line with no $id value. The value of the $id column becomes the value of the @href attribute on the <csv:item> element.

Within the <csv:item> element, a child element is generated for each unique value of each property (values from different columns, which may have different vocabularies, datatypes or languages create separate elements). Note that the $id column, if it exists, is not processed in this way. The name of the child element is determined as follows:

  1. if there is a url prolog line and the url prolog line contains a URL for the column then
    1. if the URI is a fragment of the linked CSV file, then the child element is in no namespace and the local name is based on the unescaped fragment identifier of that URI (after the #)
    2. otherwise, the URI is resolved against the base URI of the linked CSV file; the child element's namespace is the part of the URI up to and including the final # if the URI contains a #, or the final / if it does not; the local name is based on the substring of the URI after the # or /
  2. otherwise, the child element is in no namespace and the local name is based on the label of the column from the header line

TODO: normalisation of property names into XML names

The attributes and content of the child element are determined as follows:

  1. if the column has the type url, the element is given an href attribute whose value is the URI in the relevant field
  2. otherwise, the element's content is set to the value of the field; additionally
    1. if the value has a datatype associated with it, add a xsi:type attribute whose value is xsd:datatype
    2. if the column is associated with a language through the lang prolog line, add a xml:lang attribute whose value is the language in that prolog line

TODO: handle recursive processing into referenced linked CSV files

Mapping to RDF

Linked CSV does not have to be mapped to RDF, but it can be used to create a graph (or, in the case of a package of linked CSV files, a set of graphs) for systems that store information as RDF.

Parsing Linked CSV as RDF

Each data line describes a resource, which has properties whose URIs are generated based on the names of the columns given in the header and the URIs given in the url prolog line, and values based on the values given within the data lines.

If the data line has a $id value, this gives the URI for the resource (resolved against the base URI of the linked CSV file). If it does not have a $id value, it is a blank node. Either way, a triple must be generated of the form:

resource <http://www.iana.org/assignments/relation/describedby> CSV-line .
        

where the CSV-line is a reference to the row that describes the resource, using a fragment identifier of the form #row:N. Note that there may be many such describedby statements for a single resource if its description is split over several lines.

If there is a url prolog line in the linked CSV file, and it contains a value in a given column, this is used as the URI for the property. Otherwise, the property URI is constructed from the fragment identifier #escaped-header with the base URI of the linked CSV file, where escaped-header is the URL-escaped version of the header for the column.

For each data line, an RDF statement is generated for each column aside from the # and $id columns. The URI of the property is determined as above. The value of the property is interpreted as one of:

  1. if the column holds URIs, a URI reference to another resource
  2. otherwise, a literal value:
    1. if the value has a datatype, append the datatype to the URI http://www.w3.org/2001/XMLSchema# to get the datatype URI
    2. otherwise, if the column is associated with a language through the lang prolog line, a literal value with the language indicated
    3. otherwise a literal value with the datatype http://www.w3.org/2001/XMLSchema#string

Multiple equivalent triples may be generated through this process if the resource is described by more than one row; these will be merged naturally as part of RDF semantics.

TODO: handle recursive processing into referenced linked CSV files

Publishing RDF as Linked CSV

TODO: This wouldn't be too hard to do, though lossy.

Acknowledgements

This work is inspired by Google's Dataset Publishing Language and OKFN's Simple Data Format, along with some suggestions from Francis Irving and review by John Sheridan, Leigh Dodds and Tim Berners-Lee.