Skip to main content

tableschema-r

CRAN_Status_Badge Build Status Coverage status Project Status: Active – The project has reached a stable, usable state and is being actively developed. minimal R version Stats Licence Codebase Support

Description#

R library for working with Table Schema.

Features#

  • Table class for working with data and schema
  • Schema class for working with schemas
  • Field class for working with schema fields
  • validate function for validating schema descriptors
  • infer function that creates a schema based on a data sample

Getting started#

Installation#

In order to install the latest distribution of R software to your computer you have to select one of the mirror sites of the Comprehensive R Archive Network, select the appropriate link for your operating system and follow the wizard instructions.

For windows users you can:

  1. Go to CRAN
  2. Click download R for Windows
  3. Click Base (This is what you want to install R for the first time)
  4. Download the latest R version
  5. Run installation file and follow the instrustions of the installer.

(Mac) OS X and Linux users may need to follow different steps depending on their system version to install R successfully and it is recommended to read the instructions on CRAN site carefully.

Even more detailed installation instructions can be found in R Installation and Administration manual.

To install RStudio, you can download RStudio Desktop with Open Source License and follow the wizard instructions:

  1. Go to RStudio
  2. Click download on RStudio Desktop
  3. Download on RStudio Desktop free download
  4. Select the appropriate file for your system
  5. Run installation file

To install the tableschema library it is necessary to install first devtools library to make installation of github libraries available.

# Install devtools package if not already
install.packages("devtools")

Install tableschema.r

# from CRAN version
install.packages("tableschema.r")
# or install the development version from github
devtools::install_github("frictionlessdata/tableschema-r")

Load library#

# Install devtools package if not already
# install.packages("jsonlite")
library(jsonlite)
# Install devtools package if not already
# install.packages("future")
library(future)
# load the library using
library(tableschema.r)

Documentation#

Jsonlite package is internally used to convert json data to list objects. The input parameters of functions could be json strings, files or lists and the outputs are in list format to easily further process your data in R environment and exported as desired. The examples below show how to use jsonlite package to convert the output back to json adding indentation whitespace. More details about handling json you can see jsonlite documentation or vignettes here.

Moreover future package is also used to load and create Table and Schema classes asynchronously. To retrieve the actual result of the loaded Table or Schema you have to use value(...) to the variable you stored the loaded Table/Schema. More details about future package and sequential and parallel processing you can find here.

Working with Table#

A table is a core concept in a tabular data world. It represents a data with a metadata (Table Schema). Let’s see how we could use it in practice.

Consider we have some local csv file. It could be inline data or remote link - all supported by Table class (except local files for in-brower usage of course). But say it’s data.csv for now:

data/cities.csv

city,location
london,"51.50,-0.11"
paris,"48.85,2.30"
rome,N/A

Let’s create and read a table. We use static Table.load method and table.read method with a keyed option to get list of keyed rows:

def = Table.load('inst/extdata/data.csv')
table = value(def)
# add indentation whitespace to JSON output with jsonlite package
toJSON(table$read(keyed = TRUE), pretty = TRUE) # function from jsonlite package
## [
## {
## "city": ["london"],
## "location": ["\"51.50 -0.11\""]
## },
## {
## "city": ["paris"],
## "location": ["\"48.85 2.30\""]
## },
## {
## "city": ["rome"],
## "location": ["N/A"]
## }
## ]
table.headers = table$headers
table.headers
## [[1]]
## [1] "city"
##
## [[2]]
## [1] "location"

As we could see our locations are just a strings. But it should be geopoints. Also Rome’s location is not available but it’s also just a N/A string instead of null. First we have to infer Table Schema:

# add indentation whitespace to JSON output with jsonlite package
toJSON(table$infer(), pretty = TRUE) # function from jsonlite package
## {
## "fields": [
## {
## "name": ["city"],
## "type": ["string"],
## "format": ["default"]
## },
## {
## "name": ["location"],
## "type": ["string"],
## "format": ["default"]
## }
## ],
## "missingValues": [
## [""]
## ]
## }
toJSON(table$schema$descriptor, pretty = TRUE) # function from jsonlite package
## {
## "fields": [
## {
## "name": ["city"],
## "type": ["string"],
## "format": ["default"]
## },
## {
## "name": ["location"],
## "type": ["string"],
## "format": ["default"]
## }
## ],
## "missingValues": [
## [""]
## ]
## }
table$read(keyed = TRUE) # Fails

Let’s fix not available location. There is a missingValues property in Table Schema specification. As a first try we set missingValues to N/A in table$schema$descriptor. Schema descriptor could be changed in-place but all changes should be commited by table$schema$commit():

table$schema$descriptor['missingValues'] = 'N/A'
table$schema$commit()
## [1] TRUE
table$schema$valid # false
## [1] FALSE
table$schema$errors
## [[1]]
## [1] "Descriptor validation error:\n data.missingValues - is the wrong type"

As a good citiziens we’ve decided to check out schema descriptor validity. And it’s not valid! We sould use an list for missingValues property. Also don’t forget to have an empty string as a missing value:

table$schema$descriptor[['missingValues']] = list("", 'N/A')
table$schema$commit()
## [1] TRUE
table$schema$valid # true
## [1] TRUE

All good. It looks like we’re ready to read our data again:

table$read() # or
toJSON(table$read(), pretty = TRUE) # function from jsonlite package

Now we see that:

  • locations are lists with numeric lattide and longitude

  • Rome’s location is null

And because there are no errors on data reading we could be sure that our data is valid againt our schema. Let’s save it:

table$schema$save('schema.json')
table$save('data.csv')

Our data.csv looks the same because it has been stringified back to csv format. But now we have schema.json:

{
"fields": [
{
"name": "city",
"type": "string",
"format": "default"
},
{
"name": "location",
"type": "geopoint",
"format": "default"
}
],
"missingValues": [
"",
"N/A"
]
}

If we decide to improve it even more we could update the schema file and then open it again. But now providing a schema path.

def = Table.load('inst/extdata/data.csv', schema = 'inst/extdata/schema.json')
table = value(def)
table
## <Table>
## Public:
## clone: function (deep = FALSE)
## headers: active binding
## infer: function (limit = 100)
## initialize: function (src, schema = NULL, strict = FALSE, headers = 1)
## iter: function (keyed, extended, cast = TRUE, relations = FALSE, stream = FALSE)
## read: function (keyed = FALSE, extended = FALSE, cast = TRUE, relations = FALSE,
## save: function (connection)
## schema: active binding
## Private:
## createRowStream_: function (src)
## createUniqueFieldsCache: function (schema)
## currentStream_: NULL
## headers_: NULL
## headersRow_: 1
## rowNumber_: 0
## schema_: Schema, R6
## src: inst/extdata/data.csv
## strict_: FALSE
## uniqueFieldsCache_: list

It was only basic introduction to the Table class. To learn more let’s take a look on Table class API reference.

Working with Schema#

A model of a schema with helpful methods for working with the schema and supported data. Schema instances can be initialized with a schema source as a url to a JSON file or a JSON object. The schema is initially validated (see validate below). By default validation errors will be stored in schema$errors but in a strict mode it will be instantly raised.

Let’s create a blank schema. It’s not valid because descriptor$fields property is required by the Table Schema specification:

def = Schema.load({})
schema = value(def)
schema$valid # false
## [1] FALSE
schema$errors
## [[1]]
## [1] "Descriptor validation error:\n data.fields - is required"

To do not create a schema descriptor by hands we will use a schema$infer method to infer the descriptor from given data:

toJSON(
schema$infer(helpers.from.json.to.list('[
["id", "age", "name"],
["1","39","Paul"],
["2","23","Jimmy"],
["3","36","Jane"],
["4","28","Judy"]
]')), pretty = TRUE) # function from jsonlite package
## {
## "fields": [
## {
## "name": ["id"],
## "type": ["integer"]
## },
## {
## "name": ["age"],
## "type": ["integer"]
## },
## {
## "name": ["name"],
## "type": ["string"]
## }
## ]
## }
schema$valid # true
## [1] TRUE
toJSON(
schema$descriptor,
pretty = TRUE) # function from jsonlite package
## {
## "fields": [
## {
## "name": ["id"],
## "type": ["integer"],
## "format": ["default"]
## },
## {
## "name": ["age"],
## "type": ["integer"],
## "format": ["default"]
## },
## {
## "name": ["name"],
## "type": ["string"],
## "format": ["default"]
## }
## ],
## "missingValues": [
## [""]
## ]
## }

Now we have an inferred schema and it’s valid. We could cast data row against our schema. We provide a string input by an output will be cast correspondingly:

toJSON(
schema$castRow(helpers.from.json.to.list('["5", "66", "Sam"]')),
pretty = TRUE, auto_unbox = TRUE) # function from jsonlite package
## [
## 5,
## 66,
## "Sam"
## ]

But if we try provide some missing value to age field cast will fail because for now only one possible missing value is an empty string. Let’s update our schema:

schema$castRow(helpers.from.json.to.list('["6", "N/A", "Walt"]'))
## Error in schema$castRow(helpers.from.json.to.list("[\"6\", \"N/A\", \"Walt\"]")): There are 1 cast errors (see following - Wrong type for header: age and value: N/A
# Cast error
schema$descriptor$missingValues = list('', 'NA')
schema$commit()
## [1] TRUE
schema$castRow(helpers.from.json.to.list('["6", "", "Walt"]'))
## [[1]]
## [1] 6
##
## [[2]]
## NULL
##
## [[3]]
## [1] "Walt"

We could save the schema to a local file. And we could continue the work in any time just loading it from the local file:

schema$save('schema.json')
schema = Schema.load('schema.json')

It was only basic introduction to the Schema class. To learn more let’s take a look on Schema class API reference.

Working with Field#

Class represents field in the schema.

Data values can be cast to native R types. Casting a value will check the value is of the expected type, is in the correct format, and complies with any constraints imposed by a schema.

{
"name": "birthday",
"type": "date",
"format": "default",
"constraints": {
"required": true,
"minimum": "2015-05-30"
}
}

Following code will not raise the exception, despite the fact our date is less than minimum constraints in the field, because we do not check constraints of the field descriptor

field = Field$new(helpers.from.json.to.list('{"name": "name", "type": "number"}'))
dateType = field$cast_value('12345') # cast
dateType # print the result
## [1] 12345

And following example will raise exception, because we set flag ‘skip constraints’ to false, and our date is less than allowed by minimum constraints of the field. Exception will be raised as well in situation of trying to cast non-date format values, or empty values

tryCatch(
dateType = field$cast_value(value = '2014-05-29', constraints = FALSE),
error = function(e){# uh oh, something went wrong
})
## Error in private$castValue(...): Field character(0) can't cast value 2014-05-29 for type number with format default

Values that can’t be cast will raise an Error exception. Casting a value that doesn’t meet the constraints will raise an Error exception.

Table below shows the available types, formats and resultant value of the cast:

TypeFormatsCasting result
anydefaultAny
listdefault
booleandefaultBoolean
datedefault, anyDate
datetimedefault, anyDate
durationdefaultDuration
geojsondefault, topojsonObject
geopointdefault, list, object[Number, Number]
integerdefaultNumber
numberdefaultNumber
objectdefaultObject
stringdefault, uri, email, binaryString
timedefault, anyDate
yeardefaultNumber
yearmonthdefault[Number, Number]

Working with Validate#

validate() validates whether a schema is a validate Table Schema accordingly to the specifications. It does not validate data against a schema.

Given a schema descriptor validate returns a validation object:

valid_errors = validate('inst/extdata/schema.json')
valid_errors
## $valid
## [1] TRUE
##
## $errors
## list()

Working with Infer#

Given data source and headers infer will return a Table Schema as a JSON object based on the data values.

Given the data file, example.csv:

id,age,name
1,39,Paul
2,23,Jimmy
3,36,Jane
4,28,Judy

Call infer with headers and values from the datafile:

descriptor = infer('inst/extdata/data_infer.csv')

The descriptor variable is now a list object that can easily converted to JSON:

toJSON(
descriptor,
pretty = TRUE
) # function from jsonlite package
## {
## "fields": [
## {
## "name": ["id"],
## "type": ["integer"],
## "format": ["default"]
## },
## {
## "name": ["age"],
## "type": ["integer"],
## "format": ["default"]
## },
## {
## "name": ["name"],
## "type": ["string"],
## "format": ["default"]
## }
## ],
## "missingValues": [
## [""]
## ]
## }

API Reference#

Table#

Table representation

table$headers ⇒ List.\<string\>#

Headers

Returns: List.\<string\> - data source headers

table$schema ⇒ Schema#

Schema

Returns: Schema - table schema instance

table$iter(keyed, extended, cast, forceCast, relations, stream) ⇒ AsyncIterator | Stream#

Iterate through the table data

And emits rows cast based on table schema (async for loop). With a stream flag instead of async iterator a Node stream will be returned. Data casting can be disabled.

Returns: AsyncIterator | Stream - async iterator/stream of rows: - [value1, value2] - base - {header1: value1, header2: value2} - keyed - [rowNumber, [header1, header2], [value1, value2]] - extended Throws:

  • TableSchemaError raises any error occurred in this process
ParamTypeDescription
keyedbooleaniter keyed rows
extendedbooleaniter extended rows
castbooleandisable data casting if false
forceCastbooleaninstead of raising on the first row with cast error return an error object to replace failed row. It will allow to iterate over the whole data file even if it’s not compliant to the schema. Example of output stream: [['val1', 'val2'], TableSchemaError, ['val3', 'val4'], ...]
relationsObjectobject of foreign key references in a form of {resource1: [{field1: value1, field2: value2}, ...], ...}. If provided foreign key fields will checked and resolved to its references
streambooleanreturn Node Readable Stream of table rows

table$read(limit) ⇒ List.\<List\> | List.\<Object\>#

Read the table data into memory

The API is the same as table.iter has except for:

Returns: List.\<List\> | List.\<Object\> - list of rows: - [value1, value2] - base - {header1: value1, header2: value2} - keyed - [rowNumber, [header1, header2], [value1, value2]] - extended

ParamTypeDescription
limitintegerlimit of rows to read

table$infer(limit) ⇒ Object#

Infer a schema for the table.

It will infer and set Table Schema to table.schema based on table data.

Returns: Object - Table Schema descriptor

ParamTypeDescription
limitnumberlimit rows sample size

table$save(target) ⇒ Boolean#

Save data source to file locally in CSV format with , (comma) delimiter

Returns: Boolean - true on success Throws:

  • TableSchemaError an error if there is saving problem
ParamTypeDescription
targetstringpath where to save a table data

Table.load(source, schema, strict, headers, parserOptions) ⇒ Table#

Factory method to instantiate Table class.

This method is async and it should be used with await keyword or as a Promise. If references argument is provided foreign keys will be checked on any reading operation.

Returns: Table - data table class instance Throws:

  • TableSchemaError raises any error occurred in table creation process
ParamTypeDescription
sourcestringList.\<List\>
schemastringObject
strictbooleanstrictness option to pass to Schema constructor
headersnumberList.\<string\>
parserOptionsObjectoptions to be used by CSV parser. All options listed at https://csv.js.org/parse/options/. By default ltrim is true according to the CSV Dialect spec.

Schema#

Schema representation

schema$valid ⇒ Boolean#

Validation status

It always true in strict mode.

Returns: Boolean - returns validation status

schema$errors ⇒ List.\<Error\>#

Validation errors

It always empty in strict mode.

Returns: List.\<Error\> - returns validation errors

schema$descriptor ⇒ Object#

Descriptor

Returns: Object - schema descriptor

schema$primaryKey ⇒ List.\<string\>#

Primary Key

Returns: List.\<string\> - schema primary key

schema$foreignKeys ⇒ List.\<Object\>#

Foreign Keys

Returns: List.\<Object\> - schema foreign keys

schema$fields ⇒ List.\<Field\>#

Fields

Returns: List.\<Field\> - schema fields

schema$fieldNames ⇒ List.\<string\>#

Field names

Returns: List.\<string\> - schema field names

schema$getField(fieldName) ⇒ Field | null#

Return a field

Returns: Field | null - field instance if exists

ParamType
fieldNamestring

schema$addField(descriptor) ⇒ Field#

Add a field

Returns: Field - added field instance

ParamType
descriptorObject

schema$removeField(name) ⇒ Field | null#

Remove a field

Returns: Field | null - removed field instance if exists

ParamType
namestring

schema$castRow(row, failFalst) ⇒ List.\<List\>#

Cast row based on field types and formats.

Returns: List.\<List\> - cast data row

ParamTypeDescription
rowList.\<List\>data row as an list of values
failFalstboolean

schema$infer(rows, headers) ⇒ Object#

Infer and set schema.descriptor based on data sample.

Returns: Object - Table Schema descriptor

ParamTypeDescription
rowsList.\<List\>list of lists representing rows
headersintegerList.\<string\>

schema$commit(strict) ⇒ Boolean#

Update schema instance if there are in-place changes in the descriptor.

Returns: Boolean - returns true on success and false if not modified Throws:

  • TableSchemaError raises any error occurred in the process
ParamTypeDescription
strictbooleanalter strict mode for further work

Example

descriptor <- '{"fields": [{"name": "field", "type": "string"}]}'
def <- Schema.load(descriptor)
schema <- value(def)
schema$getField('name')
## NULL
schema$descriptor$fields[[1]]$type
## [1] "string"
schema$descriptor$fields[[1]]$type <-'number'
schema$descriptor$fields[[1]]$type
## [1] "number"
schema$commit()
## [1] TRUE

schema$save(target) ⇒ boolean#

Save schema descriptor to target destination.

Returns: boolean - returns true on success Throws:

  • TableSchemaError raises any error occurred in the process
ParamTypeDescription
targetstringpath where to save a descriptor

Schema.load(descriptor, strict) ⇒ Schema#

Factory method to instantiate Schema class.

This method is async and it should be used with await keyword or as a Promise.

Returns: Schema - returns schema class instance Throws:

  • TableSchemaError raises any error occurred in the process
ParamTypeDescription
descriptorstringObject
strictbooleanflag to alter validation behaviour: - if false error will not be raised and all error will be collected in schema.errors - if strict is true any validation error will be raised immediately

Field#

Field representation

new Field(descriptor, missingValues)#

Constructor to instantiate Field class.

Returns: Field - returns field class instance Throws:

  • TableSchemaError raises any error occured in the process
ParamTypeDescription
descriptorObjectschema field descriptor
missingValuesList.\<string\>an list with string representing missing values

field$name ⇒ string#

Field name

field$type ⇒ string#

Field type

field$format ⇒ string#

Field format

field$required ⇒ boolean#

Return true if field is required

field$constraints ⇒ Object#

Field constraints

field$descriptor ⇒ Object#

Field descriptor

field$castValue(value, constraints) ⇒ any#

Cast value

Returns: any - cast value

ParamTypeDescription
valueanyvalue to cast
constraintsObjectfalse

field$testValue(value, constraints) ⇒ boolean#

Check if value can be cast

ParamTypeDescription
valueanyvalue to test
constraintsObjectfalse

validate(descriptor) ⇒ Object#

This function is async so it has to be used with await keyword or as a Promise.

Returns: Object - returns {valid, errors} object

ParamTypeDescription
descriptorstringObject

infer(source, headers, options) ⇒ Object#

This function is async so it has to be used with await keyword or as a Promise.

Returns: Object - returns schema descriptor Throws:

  • TableSchemaError raises any error occured in the process
ParamTypeDescription
sourcestringList.\<List\>
headersList.\<string\>list of headers
optionsObjectany Table.load options

Contributing#

The project follows the Open Knowledge International coding standards. There are common commands to work with the project.Recommended way to get started is to create, activate and load the library environment. To install package and development dependencies into active environment:

devtools::install_github("frictionlessdata/tableschema-r", dependencies = TRUE)

To make test:

test_that(description, {
expect_equal(test, expected result)
})

To run tests:

devtools::test()

More detailed information about how to create and run tests you can find in testthat package.

Changelog - News#

In NEWS.md described only breaking and the most important changes. The full changelog could be found in nicely formatted commit history.

Last updated on by roll