-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy pathindex.html
959 lines (928 loc) · 58.2 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
<!DOCTYPE html>
<html>
<head>
<title>Linked CSV</title>
<meta http-equiv='Content-Type' content='text/html;charset=utf-8'/>
<script class='remove'>
var respecConfig = {
specStatus: "unofficial",
shortName: "linked-csv",
editors: [
{ name: "Jeni Tennison",
url: "http://www.jenitennison.com/blog/",
company: "Open Data Institute",
companyURL: "http://theodi.org/" }
],
// previousMaturity: "FPWD",
// previousPublishDate: "2010-03-15",
// wg: "Open Data Institute",
// wgURI: "http://theodi.org",
// wgPublicList: "public-animals",
// wgPatentURI: "http://www.w3.org/2004/01/pp-impl/424242/status",
};
</script>
<script src='http://www.w3.org/Tools/respec/respec-w3c-common' class='remove'></script>
</head>
<body>
<section id='abstract'>
<p>
Many open data sets are essentially tables, or sets of tables, which follow the same regular structure. This document describes a set of conventions for CSV files that enable them to be linked together and to be interpreted as RDF.
</p>
</section>
<section>
<h2>Introduction</h2>
<p>
The requirements on which this format is based are:
</p>
<ul>
<li>the format must be valid CSV according to [[!RFC4180]]</li>
<li>every valid CSV file must be a valid linked CSV file</li>
<li>URIs should be used as identifiers</li>
<li>the format must be at least as expressive as JSON</li>
<li>it must be possible to zip up packages of linked CSV files for bulk downloads</li>
<li>it must be possible to publish individual linked CSV files, not just packages</li>
<li>files must carry their own metadata so that they are self-contained</li>
<li>it must be simple to filter out metadata by ignoring or hiding particular columns and/or rows</li>
<li>aside from decompressing packages, applications should not need to parse more than one format (eg XML or JSON in addition to CSV)</li>
<li>it should be possible to associate provenance information with pieces of data represented in the format</li>
<li>it must be possible to break up large tables into smaller subsets</li>
<li>it must be possible for datasets to be built from linked CSV published by different websites</li>
<li>the format must be mappable to JSON</li>
<li>the format must be mappable to XML [[XML11]]</li>
<li>the format must be mappable to RDF [[RDF-PRIMER]]</li>
</ul>
</section>
<section>
<h2>Structure</h2>
<p>
The structure of a CSV file is a header followed by a number of records. The <dfn>header</dfn> is the first line of the file, while the remaining lines are the <dfn title="record">records</dfn>. Both the header and the records contain <dfn>fields</dfn> separated by commas. These terms are used as defined in [[RFC4180]]. Within this document, a <code>column</code> is a set of fields which are at the same index within their respective rows and the <dfn>column name</dfn> is the value of the field in the header for that column. For example, the following is a valid CSV file which lists country codes and names:
</p>
<pre class="example highlight">
country,name
AD,Andorra
AF,Afghanistan
AI,Anguilla
AL,Albania
</pre>
<p>
All valid CSV files are valid linked CSV files, so the above example is also a valid linked CSV file. It has four records and two columns, whose names are <code>country</code> and <code>name</code>.
</p>
<p>
Valid CSV files MUST use <code>CRLF</code> to indicate the ends of lines (and thus the separation of rows). Linked CSV parsers SHOULD provide a warning if <code>CR</code> or <code>LF</code> is used for line endings, and SHOULD recover by parsing the CSV file with those line endings.
</p>
<p class="note">
Spreadsheet programs such as Excel or OpenOffice Calc typically use the line ending used by the platform on which they are deployed (eg simply <code>LF</code> on Mac OS X). Allowing other line endings for linked CSV is intended to make it easier to create such documents within spreadsheet programs.
</p>
<p>
The aim of processing a linked CSV file is to generate information about a set of entities. An <dfn>entity</dfn> may be represented internally by the application as an object or a resource. Each entity has a number of <dfn title="property">properties</dfn>, which may have one or more <dfn title="value">values</dfn>.
</p>
<p>
Records within a linked CSV file may be of three different types: <a title="prolog line">prolog lines</a> (see <a href="#prolog-lines" class="sectionRef"></a>), <a title="data line">data lines</a> and <a title="epilog line">epilog lines</a> (see <a href="#epilog-lines" class="sectionRef"></a>). Data lines can only come after the last prolog line, if there is one, and before the first epilog line, if there is one. A <dfn>data line</dfn> is a line that contains data about an entity. A single entity may be described across multiple data lines. For each data line describing an entity, each value within the line corresponds to a value of a property of that entity (the property being labelled through the corresponding header).
</p>
<p>
The JSON version of this file, as defined in <a href="#json-mapping" class="sectionRef"></a>, is:
</p>
<pre class="json example highlight">
[{
"country": "AD",
"name": "Andorra"
},{
"country": "AF",
"name": "Afghanistan"
},{
"country": "AI",
"name": "Anguilla"
},{
"country": "AL",
"name": "Albania"
}]
</pre>
<p>
Linked CSV files must be encoded as UTF-8.
</p>
<p class="issue">
It isn't usually easy to set the encoding of a CSV file when exporting from normal spreadsheet programs. It would be nice if there were a way of detecting the encoding. Perhaps it could be sniffed based on the initial characters <code>#,</code> in the file (with UTF-8 assumed if those aren't the initial characters)?
</p>
<section>
<h3>Identifiers</h3>
<p>
Linked CSV is built around the concept of using URIs to name things. Every record, column, and even slices of data, in a linked CSV file is addressable using <a href="http://tools.ietf.org/html/draft-hausenblas-csv-fragment-00">URI Identifiers for the text/csv Media Type</a>. For example, if the linked CSV file is accessed at <code>http://example.org/countries</code>, the first <a>record</a> in the CSV file above, which happens to be the first <a>data line</a> within the linked CSV file (which describes Andorra) is addressable with the URI:
</p>
<pre>http://example.org/countries#row:0</pre>
<p>
However, this addressing merely identifies the records within the linked CSV file, not the <a title="entity">entities</a> that the record describes. This distinction is important for two reasons:
</p>
<ul>
<li>a single entity may be described by multiple records within the linked CSV file</li>
<li>addressing entities and records separately enables us to make statements about the source of the information within a particular record</li>
</ul>
<p>
By default, each data line describes an <a>entity</a>, each entity is described by a single data line, and there is no way to address the entities. However, adding a <code>$id</code> column enables entities to be given identifiers. These identifiers are always URIs, and they are interpreted relative to the location of the linked CSV file. The <code>$id</code> column may be positioned anywhere but by convention it should be the first column (unless there is a <code>#</code> column, in which case it should be the second). For example:
</p>
<pre class="example highlight">
$id,country,name
#AD,AD, Andorra
#AD,AD, Principality of Andorra
#AF,AF, Afghanistan
#AF,AF, Islamic Republic of Afghanistan
</pre>
<p class="note">
For the purpose of clarity within this document, whitespace has been added to this and the remainder of the examples so that headers and values line up correctly. Whitespace within linked CSV files is normally significant.
</p>
<p class="note">
The prefix <code>$</code> is used because the prefix <code>@</code> is interpreted as indicating a formula when entered into spreadsheet programs such as Excel.
</p>
<p>
This linked CSV file contains two entities, which have the identifiers <code>http://example.org/countries#AD</code> and <code>http://example.org/countries#AF</code>. The first is described by the first two data lines and the second by the next two. The JSON generated for this file would be:
</p>
<pre class="json example highlight">
[{
"@id": "http://example.org/countries#AD",
"country": "AD",
"name": [ "Andorra", "Principality of Andorra" ]
},{
"@id": "http://example.org/countries#AF",
"country": "AF",
"name": [ "Afghanistan", "Islamic Republic of Afghanistan" ]
}]
</pre>
<p>
and the RDF would be:
</p>
<pre class="turtle example highlight">
@prefix rel: <http://www.iana.org/assignments/relation/>
PREFIX : <http://example.org/countries#>
<http://example.org/countries#AD>
rel:describedby <http://example.org/countries#row:0> ;
:country "AD" ;
:name "Andorra" , "Principality of Andorra" ;
.
<http://example.org/countries#AF>
rel:describedby <http://example.org/countries#row:1> ;
:country "AF" ;
:name "Afghanistan" , "Islamic Republic of Afghanistan" ;
.
</pre>
<p>
As shown by this example, when multiple data lines describe a single entity, a given property takes only the distinct values within the column for that entity rather than being duplicated. However, the file can be made shorter if it doesn't contain duplicates in the first case; the following CSV is equivalent:
</p>
<pre class="example highlight">
$id,country,name
#AD,AD, Andorra
#AD,, Principality of Andorra
#AF,AF, Afghanistan
#AF,, Islamic Republic of Afghanistan
</pre>
<section>
<h4>Interpreting Identifiers</h4>
<p>
By default, properties within the linked CSV file are assumed to apply to the thing described by the resource located by the URI identifier. For example, if the file contained identifier URIs that were Wikipedia pages, as in
</p>
<pre class="example highlight">
$id, country,name
http://en.wikipedia.org/wiki/Andorra, AD, Andorra
http://en.wikipedia.org/wiki/Andorra, AD, Principality of Andorra
http://en.wikipedia.org/wiki/Afghanistan,AF, Afghanistan
http://en.wikipedia.org/wiki/Afghanistan,AF, Islamic Republic of Afghanistan
</pre>
<p>
applications should interpret the properties labelled <code>country</code> and <code>name</code> to apply to the countries described by those Wikipedia pages, not the Wikipedia pages themselves. In general this distinction does not matter, but it may do when using linked CSV to describe resources that <em>are</em> available on the web. Individual properties may be used differently, and apply to the content found at the referenced URI; how they are interpreted should be incorporated into the property documentation.
</p>
</section>
</section>
<section id="prolog-lines">
<h3>Prolog Lines</h3>
<p>
A linked CSV file can contain any number of prolog lines. <dfn title="prolog line">Prolog lines</dfn> describe additional processing of the linked CSV file, usually related to the file or some portion or the file, or related to some or all of the columns. Prolog lines can only be present if there is a column named <code>#</code>; any record that has a value in that column is a prolog line, and the value for that column indicates how the line should be interpreted:
</p>
<dl>
<dt><code>type</code></dt>
<dd>This value indicates that the line provides information about the type of the values in each column</dd>
<dt><code>lang</code></dt>
<dd>This value indicates that the line provides information about the language of the values in each column</code>
<dt><code>meta</code></dt>
<dd>This value indicates that the line provides metadata about the linked CSV file or rows within it</dd>
<dt><code>url</code></dt>
<dd>This value indicates that the line provides global URIs for the properties in each column</dd>
<dt><code>see</code></dt>
<dd>This value indicates that the line provides details of additional resources that may provide information about some or all of the entities whose identifiers are given within the column</dd>
<dt><em>empty</em></dt>
<dd>Having no value in the <code>#</code> column indicates that the line is a <a>data line</a> rather than a <a>prolog line</a></dd>
</dl>
<p>
Prolog lines must all be at the start of a linked CSV file. Any prolog lines that appear after the first <a>data line</a> must be ignored by processors. Prolog lines of different types can appear in any order.
</p>
<p class="note">
Ignoring prolog lines that appear after the first data line aids streaming processing of linked CSV files, the hiding of prolog information within spreadsheet applications, and ease of reading for humans.
</p>
<p class="issue">
Could add other kinds of prolog lines. The thing to do is probably to have a separate registry of prolog line types that provide for configuration of the processing that should be done on the values in particular columns. For example, you could have prolog lines that enable to to specify a separator used within the values, to enable the creation of list values, or a date-syntax line that enabled you to specify the date syntax used in the values in that particular column.
</p>
<section id="property-types">
<h4>Property Types</h4>
<p>
In the simple CSV example we have been looking at, all the values are strings, which works fine for country codes and names. We will now introduce a separate file, <code>http://example.org/af-population</code>, which initially looks like:
</p>
<pre class="example highlight">
country,year,population
AF, 1960,9616353
AF, 1961,9799379
AF, 1962,9989846
AF, 1963,10188299
</pre>
<p>
In this example, the property <code>year</code> holds years and the property <code>population</code> holds an integer. To indicate the types of these properties, we can add a <code>type</code> prolog line. The value of a <code>type</code> prolog line indicates the type of the values in the column that it is in. The type must be one of:
</p>
<ul>
<li><code>string</code></li>
<li><code>url</code></li>
<li><code>integer</code></li>
<li><code>decimal</code></li>
<li><code>double</code></li>
<li><code>boolean</code> (<code>true</code> or <code>false</code>)</li>
<li><code>time</code> — values of this type can be any of the date/time syntaxes supported by XML Schema, namely <code>gYear</code>, <code>gMonth</code>, <code>gDay</code>, <code>gYearMonth</code>, <code>gMonthDay</code>, <code>date</code>, <code>time</code>, <code>dateTime</code></li>
</ul>
<p>
If there is no type indication in the header for the column, the default type for a particular value depends on the syntax of the value, as follows:
</p>
<ul>
<li>values matching XML Schema date/time syntax (aside from <code>xs:gYear</code>) are assumed to be date/time values</li>
<li>values matching <code>[0-9]+</code> are assumed to be integers</li>
<li>values matching <code>[0-9]+\.[0-9]+</code> are assumed to be decimal numbers</li>
<li>values matching <code>[0-9]+(\.[0-9]+)?[eE][-+][0-9]+(\.[0-9]+)?</code> are assuming to be floating point numbers</li>
<li>the value <code>true</code> is assumed to be the boolean value true, and the value <code>false</code> the boolean value false</li>
<li>otherwise, the value is assumed to be a string</li>
</ul>
<p class="issue">
Could enable quoting of values using <code>"""..."""</code> delimited values within the CSV?
</p>
<p>
In the example above, we can add a <code>type</code> prolog line to indicate the types of the properties that are created. We can also change the <code>country</code> column to use the Wikipedia URIs that we previously used for the countries, and indicate that this is being done by giving its type as <code>url</code>. Since the population figures are all syntactically integers, there is no need to annotate that column with a type, but such an annotation can be added for clarity:
</p>
<pre class="example highlight">
<strong>#, </strong>country, year,population
<strong>type,url, time,integer</strong>
, http://en.wikipedia.org/wiki/Afghanistan,1960,9616353
, http://en.wikipedia.org/wiki/Afghanistan,1961,9799379
, http://en.wikipedia.org/wiki/Afghanistan,1962,9989846
, http://en.wikipedia.org/wiki/Afghanistan,1963,10188299
</pre>
<p>
Conversion to JSON cannot preserve all this information as it does not support date/time datatypes. The resulting data would include the years as integers:
</p>
<pre class="json example highlight">
[{
"country": "http://en.wikipedia.org/wiki/Afghanistan",
"year": 1960,
"population": 9616353
}, {
"country": "http://en.wikipedia.org/wiki/Afghanistan",
"year": 1961,
"population": 9799379
}, {
"country": "http://en.wikipedia.org/wiki/Afghanistan",
"year": 1962,
"population": 9989846
}, {
"country": "http://en.wikipedia.org/wiki/Afghanistan",
"year": 1963,
"population": 10188299
}]
</pre>
<p>
The mapping to RDF can preserve the datatype information:
</p>
<pre class="example highlight">
@prefix xsd: <http://www.w3.org/2001/XMLSchema#>
@prefix rel: <http://www.iana.org/assignments/relation/>
@prefix : <http://example.org/af-population#>
[ rel:describedby <http://example.org/af-population#row:0> ;
:country <http://en.wikipedia.org/wiki/Afghanistan> ;
:year "1960"^^xsd:gYear ;
:population 9616353 ]
[ rel:describedby <http://example.org/af-population#row:1> ;
:country <http://en.wikipedia.org/wiki/Afghanistan> ;
:year "1961"^^xsd:gYear ;
:population 9799379 ]
[ rel:describedby <http://example.org/af-population#row:2> ;
:country <http://en.wikipedia.org/wiki/Afghanistan> ;
:year "1962"^^xsd:gYear ;
:population 9989846 ]
[ rel:describedby <http://example.org/af-population#row:3> ;
:country <http://en.wikipedia.org/wiki/Afghanistan> ;
:year "1963"^^xsd:gYear ;
:population 10188299 ]
</pre>
<p class="note">
In generating the Turtle, the syntax of the values in the <code>year</code> column is used to determine what kind of date/time value each value should be mapped on to. Without the <code>time</code> annotation, the values would be mapped to integers.
</p>
</section>
<section>
<h4>Languages</h4>
<p>
A <code>lang</code> prolog line indicates the language used within each column. For example, the file that contains the country details can also be expanded to include the names of the countries in other languages:
</p>
<pre class="example highlight">
#, $id, country,english name, french name
<strong>lang,, , en, fr</strong>
, http://en.wikipedia.org/wiki/Andorra, AD, Andorra, Andorre
, http://en.wikipedia.org/wiki/Andorra, , Principality of Andorra,
, http://en.wikipedia.org/wiki/Afghanistan,AF, Afghanistan, Afghanistan
, http://en.wikipedia.org/wiki/Afghanistan,, Islamic Republic of Afghanistan,
</pre>
<p>
In this case, the values of the <code>english name</code> column are labelled as being in English while those in the <code>french name</code> column are labelled as being in French. The JSON would look like:
</p>
<pre class="json example highlight">
[{
"@id": "http://en.wikipedia.org/wiki/Andorra",
"country": "AD",
"english name": {
"en": [ "Andorra", "Principality of Andorra" ]
},
"french name": {
"fr": "Andorre"
}
},{
"@id": "http://en.wikipedia.org/wiki/Afghanistan",
"country": "AF",
"english name": {
"en": [ "Afghanistan", "Islamic Republic of Afghanistan" ]
},
"french name": {
"fr": "Afghanistan"
}
}]
</pre>
<p>
The Turtle would look like:
</p>
<pre class="turtle example highlight">
@prefix rel: <http://www.iana.org/assignments/relation/>
@prefix : <http://example.org/af-population#>
<http://en.wikipedia.org/wiki/Andorra>
rel:describedby
<http://example.org/countries#row:0>,
<http://example.org/countries#row:1> ;
:country "AD" ;
:english.name "Andorra"@en, "Principality of Andorra"@en ;
:french.name "Andorre"@fr ;
.
<http://en.wikipedia.org/wiki/Afghanistan>
rel:describedby
<http://example.org/countries#row:2>,
<http://example.org/countries#row:3> ;
:country "AF" ;
:english.name "Afghanistan"@en , "Islamic Republic of Afghanistan"@en ;
:french.name "Afghanistan"@fr ;
.
</pre>
</section>
<section>
<h4>Global Property Identifiers</h4>
<p>
When there are separate columns providing values in different languages for the same property, or When a large dataset is split across multiple files, as in the example here where the set of population figures is split across multiple country-specific files such as <code>http://example.org/af-population</code>, it is useful to be able to indicate when the separate labels in the CSV headers refer to the same property of a given entity.
</p>
<p>
To facilitate this, <code>url</code> prolog lines can indicate global identifiers for the properties. These lines contain URIs which are resolved relative to the location of the file itself. In the previous example, the two headers <code>english name</code> and <code>french name</code> both refer to the same <code>name</code> property. We can use a <code>url</code> line to indicate that these both refer to the same property:
</p>
<pre class="example highlight">
#, $id, country,english name, french name
<strong>url, , , #name, #name</strong>
lang,, , en, fr
, http://en.wikipedia.org/wiki/Andorra, AD, Andorra, Andorre
, http://en.wikipedia.org/wiki/Andorra, , Principality of Andorra,
, http://en.wikipedia.org/wiki/Afghanistan,AF, Afghanistan, Afghanistan
, http://en.wikipedia.org/wiki/Afghanistan,, Islamic Republic of Afghanistan,
</pre>
<p>
When this is converted to JSON, the URI for the property is processed to give just the property <code>name</code>:
</p>
<pre class="json example highlight">
[{
"@id": "http://example.org/countries#AD",
"country": "AD",
"name": {
"en": [ "Andorra", "Principality of Andorra" ],
"fr": "Andorre"
}
},{
"@id": "http://example.org/countries#AF",
"country": "AF",
"name": {
"en": [ "Afghanistan", "Islamic Republic of Afghanistan" ],
"fr": "Afghanistan"
}
}]
</pre>
<p>
In the conversion to RDF, the RDF includes the labels for the properties:
</p>
<pre class="turtle example highlight">
@prefix rel: <http://www.iana.org/assignments/relation/>
@prefix rdfs: <...>
@prefix : <http://example.org/af-population#>
<http://en.wikipedia.org/wiki/Andorra>
rel:describedby
<http://example.org/countries#row:0>,
<http://example.org/countries#row:1> ;
:country "AD" ;
:name "Andorra"@en, "Andorre"@fr, "Principality of Andorra"@en ;
.
<http://en.wikipedia.org/wiki/Afghanistan>
rel:describedby
<http://example.org/countries#row:2>,
<http://example.org/countries#row:3> ;
:country "AF" ;
:name "Afghanistan"@en , "Afghanistan"@fr, "Islamic Republic of Afghanistan"@en ;
.
:name
rdfs:label "english name" , "french name" ;
.
</pre>
<p>
When properties are shared across multiple files, the URIs in the <code>url</code> prolog line should resolve to the same URL. For example, if we wanted to indicate that the <code>country</code> property within the <code>af-population</code> file means the same as the <code>country</code> property within the <code>ad-population</code> file, we could associate them both with the same URI by adding the same <code>url</code> prolog line in both files:
</p>
<pre class="example highlight">
#, country, year, population
type,url, time, integer
url, /def/statistics#country, /def/statistics#year,/def/statistics#population
, http://en.wikipedia.org/wiki/Afghanistan, 1960, 9616353
, http://en.wikipedia.org/wiki/Afghanistan, 1961, 9799379
, http://en.wikipedia.org/wiki/Afghanistan, 1962, 9989846
, http://en.wikipedia.org/wiki/Afghanistan, 1963, 10188299
</pre>
<p>
The resulting RDF would use these URLs for the <code>country</code>, <code>year</code> and <code>population</code> properties:
</p>
<pre class="example highlight">
@prefix xsd: <http://www.w3.org/2001/XMLSchema#>
@prefix rel: <http://www.iana.org/assignments/relation/>
<strong>@prefix : <http://example.org/def/statistics#></strong>
[ rel:describedby <http://example.org/af-population#row:2> ;
:country <http://en.wikipedia.org/wiki/Afghanistan> ;
:year "1960"^^xsd:gYear ;
:population 9616353 ]
[ rel:describedby <http://example.org/af-population#row:3> ;
:country <http://en.wikipedia.org/wiki/Afghanistan> ;
:year "1961"^^xsd:gYear ;
:population 9799379 ]
[ rel:describedby <http://example.org/af-population#row:4> ;
:country <http://en.wikipedia.org/wiki/Afghanistan> ;
:year "1962"^^xsd:gYear ;
:population 9989846 ]
[ rel:describedby <http://example.org/af-population#row:5> ;
:country <http://en.wikipedia.org/wiki/Afghanistan> ;
:year "1963"^^xsd:gYear ;
:population 10188299 ]
</pre>
<p>
Similarly, the resulting XML will use the property URIs to determine the namespace URIs for the child elements of the <code><csv:item></code> elements representing each entity:
</p>
<pre class="xml example highlight">
<csv:collection xml:base="http://example.org/af-population"
xmlns:csv="http://example.org/linked-csv"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://example.org/def/statistics#">
<csv:item>
<country href="http://en.wikipedia.org/wiki/Afghanistan" />
<year xsi:type="xsd:gYear">1960</year>
<population xsi:type="xsd:integer">9616353</population>
</csv:item>
<csv:item>
<country href="http://en.wikipedia.org/wiki/Afghanistan" />
<year xsi:type="xsd:gYear">1961</year>
<population xsi:type="xsd:integer">9799379</population>
</csv:item>
<csv:item>
<country href="http://en.wikipedia.org/wiki/Afghanistan" />
<year xsi:type="xsd:gYear">1962</year>
<population xsi:type="xsd:integer">9989846</population>
</csv:item>
<csv:item>
<country href="http://en.wikipedia.org/wiki/Afghanistan" />
<year xsi:type="xsd:gYear">1963</year>
<population xsi:type="xsd:integer">10188299</population>
</csv:item>
</csv:collection>
</pre>
<p>
Applications may attempt to resolve the URIs in the <code>url</code> prolog lines; if they do so, this should resolve into a linked CSV file that describes the properties. In this example, <code>http://example.org/def/statistics</code> should contain something like:
</p>
<pre class="example highlight">
$id, label, description
#country, country, "The country for which the population is being provided."
#year, year, "The year for which the population is being provided."
#population,population,"The number of people populating the given country in the given year."
</pre>
<p>
To make it easier to use common vocabularies, a field within the URL prolog line may contain a CURIE (in the form <code><var>prefix</var>:<var>name</var></code>) as a shorthand for a URL. If a field within the URL prolog line starts with a <dfn>recognised prefix</dfn>, that prefix is expanded to its <dfn>namespace</dfn> and prepended to the remainder of the CURIE (after the colon). The recognised prefixes are:
</p>
<table>
<thead>
<tr><th>prefix</th><th>namespace</th><th>description</th></tr>
</thead>
<tbody>
<tr><th colspan="3">Generic Vocabularies</th></tr>
<tr><td><code>rel</code></td><td><code>http://www.iana.org/assignments/relation/</code></td><td><a href="http://www.iana.org/assignments/link-relations/link-relations.xml">IANA Link Relations</a></td></tr>
<tr><td><code>schema</code></td><td><code>http://schema.org/</code></td><td>schema.org</td></tr>
</tbody>
<tbody>
<tr><th colspan="3">Metadata Vocabularies</th></tr>
<tr><td><code>dc</code></td><td><code>http://purl.org/dc/terms/</code></td><td>Dublin Core Metadata Terms</td></tr>
<tr><td><code>dct</code></td><td><code>http://purl.org/dc/terms/</code></td><td>Dublin Core Metadata Terms</td></tr>
<tr><td><code>cc</code></td><td><code>http://creativecommons.org/ns#</code></td><td>Creative Commons Rights Expression Language</td></tr>
<tr><td><code>void</code></td><td><code>http://rdfs.org/ns/void#</code></td><td>VoID</td></tr>
<tr><td><code>wdrs</code></td><td><code>http://www.w3.org/2007/05/powder-s#</code></td><td>POWDER-S</td></tr>
</tbody>
<tbody>
<tr><th colspan="3">Schema Vocabularies</th></tr>
<tr><td><code>rdf</code></td><td><code>http://www.w3.org/1999/02/22-rdf-syntax-ns#</code></td><td>RDF</td></tr>
<tr><td><code>rdfs</code></td><td><code>http://www.w3.org/2000/01/rdf-schema#</code></td><td>RDF Schema</td></tr>
<tr><td><code>owl</code></td><td><code>http://www.w3.org/2002/07/owl#</code></td><td>OWL</td></tr>
<tr><td><code>skos</code></td><td><code>http://www.w3.org/2004/02/skos/core#</code></td><td>SKOS</td></tr>
<tr><td><code>skos-xl</code></td><td><code>http://www.w3.org/2008/05/skos-xl#</code></td><td>SKOS Extensions for Labels</td></tr>
</tbody>
</table>
<p class="issue">
This list is largely based on hunches about which vocabularies are going to be useful in linked CSV documents, coupled with some dogma in pushing schema.org as the vocabulary to rule them all. An alternative would be to define the same prefixes as listed in <a href="http://www.w3.org/2011/rdfa-context/rdfa-1.1"><code>http://www.w3.org/2011/rdfa-context/rdfa-1.1</code></a>.
</p>
<p class="issue">
There's no support for declaring your own prefixes or declaring a default prefix/vocabulary.
</p>
<p>
Linked CSV files that describe the properties used within other linked CSV files SHOULD use the RDFS vocabulary, which contains properties such as <code>rdfs:label</code> and <code>rdfs:comment</code>, to provide details about the properties. For example:
</p>
<pre class="example highlight">
$id, label, description
<strong>url, rdfs:label,rdfs:comment</strong>
#country, country, "The country for which the population is being provided."
#year, year, "The year for which the population is being provided."
#population,population,"The number of people populating the given country in the given year."
</pre>
</section>
<section id="self-describing">
<h4>Self Description</h4>
<p>
Linked CSV files should be self-describing. They should include important metadata about the source of the data they contain, their license conditions, and links to other files that contain non-essential supplementary information. Although the file might be described within other files, and metadata might be made available through the HTTP headers, it is safer to embed this metadata within the file as there is no guarantee that metadata stored outside the file will be available as the data is passed around.
</p>
<p>
To provide metadata about the linked CSV document, the file has to contain a <code>meta</code> prolog line, which provides metadata about the file or records within the file. If there is a <code>$id</code> column, the value within that column indicates what the metadata is about: an empty value (or a missing <code>$id</code> column) indicates the metadata is associated with the file as a whole.
</p>
<p>
The remainder of each metadata line should hold the following values, in order:
</p>
<ol>
<li>a label for a property of the entity indicated in the <code>$id</code> column</li>
<li>a value, the value of the property for that entity</li>
<li>optionally, a type or language annotation for the property, which is interpreted in the same way as the values in a <code>type</code> or <code>lang</code> prolog line</li>
<li>optionally, a URI that is the global identifier for the property, which is interpreted in the same way as the values in a <code>url</code> prolog line</li>
</ol>
<p>
In our example, the <code>http://example.org/af-population</code> file may be part of a series of files available for different countries, and the metadata provide a pointer to an index document (<code>http://example.org/populations</code>) and to a license for the file:
</p>
<pre class="example highlight">
#, country, year,population
type,url, time,integer
<strong>meta,index, url, /populations
meta,license, url, http://creativecommons.org/publicdomain/mark/1.0/</strong>
, http://en.wikipedia.org/wiki/Afghanistan,1960,9616353
, http://en.wikipedia.org/wiki/Afghanistan,1961,9799379
, http://en.wikipedia.org/wiki/Afghanistan,1962,9989846
, http://en.wikipedia.org/wiki/Afghanistan,1963,10188299
</pre>
<p>
In this example, none of the remaining data lines have identifiers themselves. The corresponding JSON would be:
</p>
<pre class="json example highlight">
[{
<strong> "@id": "http://example.org/af-population",
"index": "http://example.org/populations",
"license": "http://creativecommons.org/publicdomain/mark/1.0/"</strong>
}, {
"country": "http://en.wikipedia.org/wiki/Afghanistan",
"year": 1960,
"population": 9616353
}, {
"country": "http://en.wikipedia.org/wiki/Afghanistan",
"year": 1961,
"population": 9799379
}, {
"country": "http://en.wikipedia.org/wiki/Afghanistan",
"year": 1962,
"population": 9989846
}, {
"country": "http://en.wikipedia.org/wiki/Afghanistan",
"year": 1963,
"population": 10188299
}]
</pre>
<p>
The corresponding RDF would be:
</p>
<pre class="example highlight">
@prefix xsd: <http://www.w3.org/2001/XMLSchema#>
@prefix rel: <http://www.iana.org/assignments/relation/>
@prefix : <http://example.org/af-population#>
<strong><>
rel:describedby
<http://example.org/af-population#row:1>,
<http://example.org/af-population#row:2> ;
:index <populations> ;
:license <http://creativecommons.org/publicdomain/mark/1.0/> ;
.</strong>
[ rel:describedby <http://example.org/af-population#row:3> ;
:country <http://en.wikipedia.org/wiki/Afghanistan> ;
:year "1960"^^xsd:gYear ;
:population 9616353 ]
[ rel:describedby <http://example.org/af-population#row:4> ;
:country <http://en.wikipedia.org/wiki/Afghanistan> ;
:year "1961"^^xsd:gYear ;
:population 9799379 ]
[ rel:describedby <http://example.org/af-population#row:5> ;
:country <http://en.wikipedia.org/wiki/Afghanistan> ;
:year "1962"^^xsd:gYear ;
:population 9989846 ]
[ rel:describedby <http://example.org/af-population#row:6> ;
:country <http://en.wikipedia.org/wiki/Afghanistan> ;
:year "1963"^^xsd:gYear ;
:population 10188299 ]
</pre>
</section>
<section id="see-also">
<h4>Additional Data Sources</h4>
<p>
A <a>prolog line</a> in which the value of the <code>#</code> column is <code>see</code> provides pointers to other linked CSV files that describe the resources in appropriate columns.
</p>
<p>
Within a <code>see</code> line, columns that hold URI values (having <code>url</code> in the corresponding value of the <code>type</code> prolog line), can reference additional linked CSV files that describe the entities identified by the URIs in that column. For example, the population data within <code>http://example.org/af-populations</code> references a country described within <code>http://example.org/countries</code>. The population file would include:
</p>
<pre class="example highlight">
#, country, year,population
type,url, time,integer
<strong>see, /countries, ,</strong>
, http://en.wikipedia.org/wiki/Afghanistan,1960, 9616353
, http://en.wikipedia.org/wiki/Afghanistan,1961, 9799379
, http://en.wikipedia.org/wiki/Afghanistan,1962, 9989846
, http://en.wikipedia.org/wiki/Afghanistan,1963, 10188299
</pre>
<p>
This indicates that an application can look within <code>http://example.org/countries</code> to find more information about some or all of the URIs within the <code>country</code> column. The URIs within the <code>$id</code> column in that file should match the URIs within the <code>country</code> column in this file.
</p>
<p>
If there is no <code>type</code> prolog line, a value in a <code>see</code> prolog line indicates that the column holds URIs (as if the <code>type</code> was set to <code>url</code>). If there is a <code>type</code> prolog line but the type of the column has a value other than <code>url</code>, values in the <code>see</code> prolog lines for that column are ignored.
</p>
<p>
This technique can also be used to point to additional data about the entities described within the linked CSV file itself. For example if another publisher also published a linked CSV file containing information about countries at <code>http://other.example.com/countries</code> (perhaps providing their names in other languages or describing their capital cities), we could reference it from the <code>http://example.org/countries</code> file as follows:
</p>
<pre class="example highlight">
#, $id, country,english name, french name
url, , , #name, #name
lang,, , en, fr
<strong>see, http://other.example.com/countries, , ,</strong>
, http://en.wikipedia.org/wiki/Andorra, AD, Andorra, Andorre
, http://en.wikipedia.org/wiki/Andorra, , Principality of Andorra,
, http://en.wikipedia.org/wiki/Afghanistan,AF, Afghanistan, Afghanistan
, http://en.wikipedia.org/wiki/Afghanistan,, Islamic Republic of Afghanistan,
</pre>
</section>
</section>
<section id="epilog-lines">
<h3>Epilog Lines</h3>
<p>
A linked CSV file can contain any number of epilog lines. <dfn title="epilog line">Epilog lines</dfn> provide additional metadata about the contents of the linked CSV file, typically annotations on its contents. Epilog lines can only be present if there is a column named <code>#</code>; any record after the first data line that has the value <code>meta</code> in that column is an epilog line. These lines are interpreted in the same way as metadata lines in the prolog, as described in <a href="#self-describing" class="sectionRef"></a>.
</p>
<p>
Epilog lines must appear at the end of the file. Any <a title="data line">data lines</a> that appear after the first epilog line must be ignored.
</p>
<p>
Metadata epilog lines can be used to provide metadata about other parts of the linked CSV file by using <a href="http://tools.ietf.org/html/draft-hausenblas-csv-fragment-03">URI Identifiers for the text/csv Media Type</a>. These can be used to refer to rows, columns, and sets of rows that have common value(s) for particular fields. For example:
</p>
<pre class="example highlight">
#, $id, country,english name, french name
url, , , #name, #name
lang,, , en, fr
, http://en.wikipedia.org/wiki/Andorra, AD, Andorra, Andorre
, http://en.wikipedia.org/wiki/Andorra, , Principality of Andorra,
, http://en.wikipedia.org/wiki/Afghanistan,AF, Afghanistan, Afghanistan
, http://en.wikipedia.org/wiki/Afghanistan,, Islamic Republic of Afghanistan,
<strong>meta,#col=english%20name, note, "contains both official and popular names",</strong>
</pre>
<p class="note">
Enabling metadata to appear at the end of the file helps when adding metadata about the rows or cells within the linked CSV using fragment identifiers. If these are added in the prolog, their addition causes the row number of the referenced cells to increment. Adding annotations at the bottom of the file avoids this problem.
</p>
</section>
</section>
<section>
<h2>Packaging</h2>
<p>
It is useful to be able to package together sets of linked CSV files, which may include multiple interrelated tables of data. A linked CSV package is simply a set of such files within a zip. These files should use relative links when pointing to other files within the package.
</p>
<p>
The entry point for a linked CSV package is always named <code>index.csv</code>. The index file is interpreted in the same way as any other linked CSV file, but the entities it describes are the files within the package. As well as an <code>$id</code> column, the index file will usually have a description column or similar, for example:
</p>
<pre class="example highlight">
$id, description
countries.csv, "list of countries"
populations.csv, "index of files containing information about the populations in different countries"
ad-population.csv,"populations of Andorra"
af-population.csv,"populations of Afghanistan"
...
</pre>
<p>
Adding metadata within the index file is useful if it can help recipients understand the structure of the package as a whole. Sufficient metadata should be listed within the index file to enable the recipient to tell whether each file should be opened, but the majority of the metadata about the file should be included within the file itself.
</p>
<p class="issue">
TODO: Reference schema.org dataset vocabulary here?
</p>
<div class="issue">
<p>
An alternative design would be to use the <code>http://www.iana.org/assignments/relation/item</code> property to indicate the relationship between the index file and the items in the package; that relationship could then be used recursively so that there's no need to list <em>all</em> the files in the package within the index file. In that case, the index file would look like:
</p>
<pre class="example highlight">
#, $id,item
type,, url
url, , http://www.iana.org/assignments/relation/item
, , countries.csv
, , populations.csv
, , ad-population.csv
, , af-population.csv
...
</pre>
<p>
The disadvantage with this is that it's more difficult to add metadata about the files themselves.
</p>
</div>
<p>
The manifest may list any number of the files: it does not need to list them all, merely to provide entry points such that the others can be located through the <code>see</code> prolog lines in the files or through the URLs in the <code>$id</code> column or other columns labelled with the type <code>url</code>.
</p>
</section>
<section id="json-mapping">
<h2>Mapping to JSON</h2>
<p>
Linked CSV does not have to be mapped to JSON, but it can be used to create a JSON document (or, in the case of a package of linked CSV files, a collection of JSON documents) for systems that store information as JSON. Two conversions are provided here. One generates a simple JSON format that loses much of the information that is encoded within the linked CSV file; the other generates a more complex JSON-LD file that preserves that information.
</p>
<section id="csv-to-simple-json">
<h3>Parsing Linked CSV as Simple JSON</h3>
<p>
The results of this parsing is an array of objects, one per entity in the linked CSV. An entity is generated for each data line that does not have a <code>$id</code> value, and for each unique <code>$id</code> value. If the entity has an identifier value in the <code>$id</code> column, the JSON object is given a <code>"@id"</code> property whose value is that URI identifier resolved against the base URI of the linked CSV document. Thus each JSON object is associated with a sequence of one or more data lines from the linked CSV file.
</p>
<p>
Each column within the linked CSV file is mapped to a property within the JSON file, as follows:
</p>
<ol>
<li>if there is a <code>url</code> prolog line and the <code>url</code> prolog line contains a URI for the column then
<ol>
<li>if the URI is a fragment of the linked CSV file, then the unescaped fragment identifier of that URI (after the <code>#</code>)</li>
<li>otherwise, the URI resolved against the base URI of the linked CSV file</li>
</ol>
</li>
<li>otherwise, the label of the column from the header</li>
</ol>
<p>
As the result of this algorithm, multiple columns may be mapped to a single property. Where there are multiple columns mapping to a single property, that property is marked as <dfn title="expects arrays">expecting arrays</dfn>. If any of the columns comprising the property has a value within the <code>lang</code> prolog line, the property is marked as a <dfn>language property</dfn>.
</p>
<p>
Each sequence of data lines associated with the JSON object is processed as follows. A property is created within the JSON object for each property for which the data lines provide values (properties with no values are left undefined).
</p>
<p>
If the property is a <a>language property</a>, it is mapped into an object with a property for each provided language. If the property <a>expects arrays</a>, the values of these properties will be arrays; otherwise the values will be plain strings.
</p>
<p>
If the property is not a <a>language property</a>, each value will be processed as follows. If it <a>expects arrays</a>, it will be assigned an array of values even if only one value is provided within the data lines.
</p>
<ol>
<li>if the value (as given by the <code>type</code> prolog line or inferred from the syntax of the value, as described in <a href="#property-types" class="sectionRef"></a>) is of the type <code>integer</code>, <code>decimal</code> or <code>double</code>, then if it is numeric, it is mapped to a number, otherwise to <code>null</code></li>
<li>if the value is of the type <code>boolean</code>, if it has the value <code>true</code> or <code>false</code>, it is mapped to a boolean, otherwise to <code>null</code></li>
<li>if the value is a year, it is mapped to a number</li>
<li>if the value is of another date/time datatype, it is mapped to a string</li>
<li>if the value is typed as a URI, it is resolved as a URI against the base URI of the linked CSV file and the resulting URI is used as the (string) value of the property</li>
<li>otherwise, it is mapped to a string</li>
</ol>
<p class="issue">
TODO: handle recursive processing into referenced linked CSV files
</p>
</section>
<section id="csv-to-json-ld">
<h3>Parsing Linked CSV as JSON-LD</h3>
<p class="issue">
TODO
</p>
</section>
</section>
<section id="xml-mapping">
<h2>Mapping to XML</h2>
<p>
Linked CSV does not have to be mapped to XML, but it can be used to create an XML document (or, in the case of a package of linked CSV files, a collection of XML documents) for systems that store information as XML.
</p>
<section id="csv-to-xml">
<h3>Parsing Linked CSV as XML</h3>
<p>
The namespace for the standard elements is <code>http://example.org/linked-csv</code> which is conventionally associated with the prefix <code>csv</code>. The document element is named <code><csv:collection></code>. It is given the following attributes:
</p>
<ul>
<li>an <code>xml:base</code> attribute whose value is the base URI of the linked CSV file</li>
<li>a <code>xmlns:csv</code> namespace declaration for the namespace <code>http://example.org/linked-csv</code></li>
<li>a <code>xmlns:xsd</code> namespace declaration for the namespace <code>http://www.w3.org/2001/XMLSchema</code></li>
<li>a <code>xmlns:xsi</code> namespace declaration for the namespace <code>http://www.w3.org/2001/XMLSchema-instance</code></li>
</ul>
<p>
An <code><csv:item></code> element is generated for each entity in the linked CSV. The entities are uniquely identified by the value of the <code>$id</code> column; data lines with the same <code>$id</code> are merged into a single <code><csv:item></code> element, though a separate <code><csv:item></code> element is generated for each data line with no <code>$id</code> value. The value of the <code>$id</code> column becomes the value of the <code>@href</code> attribute on the <code><csv:item></code> element.
</p>
<p>
Within the <code><csv:item></code> element, a child element is generated for each unique value of each property (values from different columns, which may have different vocabularies, datatypes or languages create separate elements). Note that the <code>$id</code> column, if it exists, is not processed in this way. The name of the child element is determined as follows:
</p>
<ol>
<li>if there is a <code>url</code> prolog line and the <code>url</code> prolog line contains a URL for the column then
<ol>
<li>if the URI is a fragment of the linked CSV file, then the child element is in no namespace and the local name is based on the unescaped fragment identifier of that URI (after the <code>#</code>)</li>
<li>otherwise, the URI is resolved against the base URI of the linked CSV file; the child element's namespace is the part of the URI up to and including the final <code>#</code> if the URI contains a <code>#</code>, or the final <code>/</code> if it does not; the local name is based on the substring of the URI after the <code>#</code> or <code>/</code></li>
</ol>
</li>
<li>otherwise, the child element is in no namespace and the local name is based on the label of the column from the header line</li>
</ol>
<p class="issue">
TODO: normalisation of property names into XML names
</p>
<p>
The attributes and content of the child element are determined as follows:
</p>
<ol>
<li>if the column has the type <code>url</code>, the element is given an <code>href</code> attribute whose value is the URI in the relevant field</li>
<li>otherwise, the element's content is set to the value of the field; additionally
<ol>
<li>
if the value has a datatype associated with it, add a <code>xsi:type</code> attribute whose value is <code>xsd:<var>datatype</var></code>
</li>
<li>
if the column is associated with a language through the <code>lang</code> prolog line, add a <code>xml:lang</code> attribute whose value is the language in that prolog line
</li>
</ol>
</li>
</ol>
<p class="issue">
TODO: handle recursive processing into referenced linked CSV files
</p>
</section>
</section>
<section id="rdf-mapping">
<h2>Mapping to RDF</h2>
<p>
Linked CSV does not have to be mapped to RDF, but it can be used to create a graph (or, in the case of a package of linked CSV files, a set of graphs) for systems that store information as RDF.
</p>
<section id="csv-to-rdf">
<h3>Parsing Linked CSV as RDF</h3>
<p>
Each <a>data line</a> describes a resource, which has properties whose URIs are generated based on the names of the columns given in the header and the URIs given in the <code>url</code> prolog line, and values based on the values given within the data lines.
</p>
<p>
If the data line has a <code>$id</code> value, this gives the URI for the resource (resolved against the base URI of the linked CSV file). If it does not have a <code>$id</code> value, it is a blank node. Either way, a triple must be generated of the form:
</p>
<pre>
<var>resource</var> <http://www.iana.org/assignments/relation/describedby> <var>CSV-line</var> .
</pre>
<p>
where the <var>CSV-line</var> is a reference to the row that describes the resource, using a fragment identifier of the form <code>#row:<var>N</var></code>. Note that there may be many such <code>describedby</code> statements for a single resource if its description is split over several lines.
</p>
<p>
If there is a <code>url</code> prolog line in the linked CSV file, and it contains a value in a given column, this is used as the URI for the property. Otherwise, the property URI is constructed from the fragment identifier <code>#<var>escaped-header</var></code> with the base URI of the linked CSV file, where <var>escaped-header</var> is the URL-escaped version of the header for the column.
</p>
<p>
For each data line, an RDF statement is generated for each column aside from the <code>#</code> and <code>$id</code> columns. The URI of the property is determined as above. The value of the property is interpreted as one of:
</p>
<ol>
<li>if the column holds URIs, a URI reference to another resource</li>
<li>otherwise, a literal value:
<ol>
<li>if the value has a datatype, append the datatype to the URI <code>http://www.w3.org/2001/XMLSchema#</code> to get the datatype URI
</li>
<li>
otherwise, if the column is associated with a language through the <code>lang</code> prolog line, a literal value with the language indicated
</li>
<li>otherwise a literal value with the datatype <code>http://www.w3.org/2001/XMLSchema#string</code></li>
</ol>
</li>
</ol>
<p>
Multiple equivalent triples may be generated through this process if the resource is described by more than one row; these will be merged naturally as part of RDF semantics.
</p>
<p class="issue">
TODO: handle recursive processing into referenced linked CSV files
</p>
</section>
<section id="rdf-to-csv">
<h3>Publishing RDF as Linked CSV</h3>
<p class="note">
TODO: This wouldn't be too hard to do, though lossy.
</p>
</section>
</section>
<section class="appendix">
<h2>Acknowledgements</h2>
<p>
This work is inspired by <a href="https://developers.google.com/public-data/">Google's Dataset Publishing Language</a> and <a href="http://www.dataprotocols.org/en/latest/simple-data-format.html">OKFN's Simple Data Format</a>, along with some suggestions from Francis Irving and review by John Sheridan, Leigh Dodds and Tim Berners-Lee.
</p>
</section>
</body>
</html>