A vision for Comma Chameleon / Data Curator


#1

I’ve got a vision for Comma Chameleon and I’d love to hear your thoughts.

Comma Chameleon - a desktop CSV editor to describe, validate and share usable data

With Comma Chameleon open data producers can:

  • create new tabular data from scratch or from a template
  • open data from a CSV or Microsoft Excel file
  • open multiple related data tables from a Data Package
  • edit data and automatically correct common problems

Using data from any of these sources, you can:

  • automatically create a schema that describes the data
  • refine the schema to include extra data validation rules
  • describe the provenance of your data
  • save data as a valid CSV file in various CSV dialects

The schema enables you to:

  • validate the whole table at once
  • validate a column at a time
  • validate data as you type

Once the data is described and validated, you can share the data using Comma Chameleon to:

  • export a Data Package ready for publishing on your open data portal
  • export a Data Package to be used as a template for others to make similar data
  • publish the data to GitHub and generate a website via the ODI Octopub tool

Open data consumers can use published data packages to:

  • view the data structure (the schema) to help determine if the data is fit for their purpose
  • download the data together with its metadata in a single file
  • use a suite of tools to work with the data

Comma Chameleon is free and runs on Windows, MacOS and Linux desktops. Updates are released on comma-chameleon.io and are also available in the Apple App Store and Microsoft Windows Store.

As a open data producer or consumer, would you find these changes to Comma Chameleon useful? Are there other features you would include?

Here’s the back story…


#2

I have a question whether the vision for Comma Chameleon redux is for a CSV creation tool with the added benefits of schema creation or whether the tool is now envisioned as requiring the user to have a schema or create a schema in the process of their workflow.

The current Comma Chameleon is create and validate, with optional data description. This is consistent with the Toolbox approach of simple onboarding tools or apps - a least work, most benefit outlook. In this approach schemas are a value add rather than the thrust of relevant ODI toolbox tool.

I get the sense from the various documents that Comma Chameleon Redux is far more about create and describe, with the expectation that the user will always end up with a schema accompanying their data files, whether through inference or their manual effort. Am I reading that intent correctly?


#3

That’s correct. When you open a CSV or XLS, the schema is inferred. You can validate against the inferred types or improve the schema by refining types or adding constraints.

If you create data from scratch in Comma Chameleon, you can use the Guess Column Properties to infer the schema.

We could drop the “infer schema on open” functionality to allow people to do a “schema-less” validation. Infer on open was done as a convenience. (Infer on open could be a setting in preferences)

A workflow could be:

  • open csv or xls (no schema inferred)
  • validate table - checks for ragged rows, empty rows and other table structure issues.
  • save as valid csv

The value of Comma Chameleon is being extended to describing the data. I did originally write the spec as a standalone schema builder but it seemed silly for people to go in and out of different apps for their publishing process.

My drive is for better quality open data and providing schemas and provenance information are part of that (as recommended by W3C and others).

Hopefully the user interface and language used keeps things simple e.g. “Column properties” not “schema”.

What does everyone think - separate the infer schema from the open function or not?


#4

I’m still getting up to speed, but whatever fits within frictionless data standard is OK with me. Also whatever adheres to the scope is my only concern here - once finalised, anything like that can be separated easily enough, post project, if there isn’t room for this use case already. I’m still getting up to speed, but I don’t understand the use case where an end-user really cares about whether there is an accompanying data package there? Adding it can be done at any time though I think, whether now or later.


#5

I think the point being made above is if the schema is inferred automatically, then the user can’t just test the structure of the data - they’ll also get test results for the data type tests applied to each column. If you don’t want the data type tests, then you’ll need to remove the inferred types from each column property.

So, “Infer the schema on open” by default. Provide an option in preferences to turn this off.


#6

Hi Steve, I agree with the option of adding auto-schema in the preferences tab. By doing this, we would be giving user freedom to make choices. Apparently, I have noticed (primarily in the private space), developers may not adhere to schema standard (personal observation). Hence, if the schema will be created automatically. I see it of less use for the wider audience.


#7

Drag and drop csv/excel to the window for validation


#8

Comma Chameleon will create a valid datapackage.json file which includes a table schema. This will solve the problem of hand crafting a table schema which can be hard (especially since some tools don’t currently implement v1 of the Frictionless Data spec.

If you try open an invalid data package using Comma Chameleon, an error will occur.

Opening using drag and drop is on the backlog. We’ll try include it if we have time.


#11

From Comma Chameleon to Data Curator

ODI HQ explains how the ODI Australian Network has built on Comma Chameleon’s code to enhance the application while working in the spirit of open source software development