How to create a Table Schema (or why you should use Data Curator)


#1

In this post I’ll describe how to create a Table Schema by hand. Doing this a few times motivated me to make Data Curator. Hopefully after reading this you’ll understand what happens behind the scenes in Data Curator when you describe the Column Properties.

The basics

Comma Separated Value (CSV) files are designed to hold tabular data.

A Table Schema can be used to describe the structure of tabular data.

Table Schemas help people describe and validate data. When a Table Schema and the data are shared together, people who are seeing the data for the first time, can better understand what the data in each column represents.

Name, Title and Description

The schema can provide a name, title and description for each column. This is especially important if the CSV file doesn’t include the column names in the first row.

You can handcraft a schema. Let’s make one for this data

id name contact
1 Fred Bloggs FBloggs@example.com
2 Nancy Drew Nanna@email.com

A Table Schema is written in JavaScript Object Notation, or JSON for short. Sounds scary but it’s just a bunch of brackets wrapped around a set of properties and assigned values.

In JSON “name”: “id” would assign the property “name” with the value “id”.

So to give the first column a name, title and description, we’d write:

“name”: “id”,
“title”: “Identifier”,
“description”: “A unique customer identifier”

Types and Formats

All the values in a column should be the same type of data, e.g. a string of text, an integer, a date. You can use the following types to describe your data:

Some types of data can be in a special format.

For example a column containing:

  • email addresses is “type”: “string” with the "format": "email".
  • dates in the form 25/12/2017 are “type”: “date” with the “format”: “%d/%m/%Y”

Being more precise about data types and formats improves data validation, understandability and importantly, allows software to work efficiently with the data.

To give the first column in the table above a type we could add:

“type”: “integer”

Integers don’t have any special formats to more precisely describe the data (unlike the string and email example above). If you wanted to be complete you could write the following, but it’s unnecessary:

“type”: “integer”,
“format”: “default”

Each description of a column (a field) is contained in curly brackets { }. So the first column would be described:

{
“name”: “id”,
“title”: “Identifier”,
“description”: “A unique customer identifier”,
“type”: “integer”
}

To describe all the fields in a table, we give them a collective name, “fields”: and enclose all the field descriptions in square brackets [ ]. Then we wrap the whole thing in curly brackets.

So let’s describe the whole table:

{
“fields”: [
               {
                 “name”: “id”,
                 “title”: “Identifier”,
                 “description”: “A unique customer identifier”,
                 “type”: “integer”
               },
	           {
                 “name”: “name”,
                 “title”: “Name”,
                 “description”: “Customer name”,
                 “type”: “string”
               },
               {
                 “name”: “contact”,
                 “title”: “Contact email”,
                 “description”: “The electronic mail address for the customer”,
                 “type”: “string”,
                 “format”: “email”
               }
          ]
}

Constraints

Constraints allow you to be even more precise when describing your data and further improve data validation. With constraints, you can state that a value:

  • is required in every row of a column
  • must be unique in every row of a column
  • has a minimum length and/or maximum length
  • has a minimum and/or maximum value
  • follows a specific pattern (e.g. an Australian mobile phone number starts with “04” followed by 8 digits)
  • is one of a set of listed, or enumerated, values (e.g. “Club”, “Diamond”, “Heart” or “Spade”)

So let’s add some constraints. In the data above, the customer id is both unique, required and must be a minimum value of 1. You would describe this as:

{
“name”: “id”,
“title”: “Identifier”,
“description”: “A unique customer identifier”,
“type”: “integer”,
“constraint”: {
	        	“unique”: true,
	        	“required”: true,
		        “minimum”: 1
              }
}

Let’s use some different data to illustrate the use of enum and pattern.

game player card suite value
1 Amanda Club 2
1 Amanda Spade A
1 Jason Heart 2
1 Jason Diamond Q

To test that the column “card suite” only includes “Heart”, “Diamond”, “Spade” or “Club”, you’d write:

{
“name”: “card suite”,
“title”: “Card Suite”,
“type”: “string”,
“constraint”: {
		      “enum”: [“Heart”, “Diamond”, “Spade”, “Club”] 
              }
}

I’m not sure of any limit to how many values you can include inside an enum constraint but if you have lots, you can use another trick, Primary and Foreign Keys (but that’s another topic).

To test that the column “card value” only includes numbers from 2 to 10 or the letters “J”, “Q”, “K” or “A”, you’d write:

{
“name”: “card value”,
“title”: “Card Value”,
“Description”: “Card values from 2 to 10 and J, Q, K and A
“type”: “string”,
“constraint”: {
                “pattern”: “[JQKA]|[2-9]|10”,
                “required”: true
              }
}

What is “/[2-9,J,Q,K,A]/g”? It’s a regular expression - a powerful language to describe search patterns. The computer reads the expression and searches for characters in the data that match. Learning regular expressions (regex) can be tricky. Use online resources, such as Regex.com to learn and test your regular expressions.

View a more complicated regular expression for Australian Postcodes.

Missing Values

There’s no data in a table cell! What does that missing value mean? Is the value is unknown, missing or not applicable? If you know the answer, this should be documented in the column description or elsewhere in the data package.

Sometimes unknown, missing or not applicable values are represented by a special value e.g. -9999, N/A, or - . You can describe what values should be treated as null by using the missingValues property:

“missingValues”: [“-9999”, “N/A”, “-”] 

Missing values sits at same level as other column properties, e.g.

{
“fields”: [
              {
                “name”: “id”,
                “type”: “integer”
                “constraint”: {
                                “required”: true
                              },
                “missingValues”: [“-9999”]
              }
         ]
}

Next steps

Now that you know how to handcraft a Table Schema, you can understand why we created Data Curator. It provides a simple user interface so you can describe each column and a schema is generated automatically for you behind the scenes :hugs: