CSV Files

CSV files is one of the most common ways to exchange data. Loading and writing CSV files is therefore quite customizable in GreyCat to support the various possible combinations.

CSVFormat

Reading and writing CSV files rely on the format of the file to read or write. The CsvFormat object makes it possible to describe the internal format of the file through various attributes.

Attribute Description
header_lines: int? allows to specify how many of the top lines of the file have to be considered header lines, and therefore ignored when reading the content
separator: char? specifies the character used to separate the fields/columns within a line. Usually , (default) or ;
decimal_separator: char? specifies the character used to integer and the decimal parts of numbers (defaults to .)
thousands_separator: char? defined the character used to separate thousands in big numbers, if any
string_delimiter: char? defines the characters used to delimit strings. This allows to ignore the separators that may apear in the strings
columns_size: int? specifies the number of columns expected in the CSV, to increase the performances if known in advance
columns: Array<CsvColumn>? allows to define the specific types for some or all columns of the CSV if known in advance. This is particularly practical when dealing with dates or time for instance, since you may have to define the format for parsing the value

For the definition of columns, you can use two approaches, but they should ideally be exclusive to avoid unexpected interactions.

  1. Provide the full array of CsvColumn definitions, in the order of the file. If you don’t want to specify a type, you have to set the position to null (as illustrated in the 1st example below).
  2. Provide an array of only the columns you want to specify and indicate the offset of the column in its definition, as presented in the 2nd example.

An alternative to manually specifying the CsvFormat columns, you can use the CsvAnalysis type to explore a Csv, and then call the infer() method of CsvFormat to create the CsvFormat. The documentation on CsvAnalysis describes this procedure.

var format_a = CsvFormat {
  header_lines: 2,              //2 first lines are headers to be ignored
  separator: ',',
  decimal_separator: '.',
  columns: [
    CsvColumnString{name: "id"},
    CsvColumnDate{name: "start", format:"%Y-%m-%d"},
    CsvColumnDate{name: "end", format:"%Y-%m-%d"},
    null,
    CsvColumnFloat{name: "value"},
    CsvColumnIgnored{},
  ];
}

var format_b = CsvFormat {
  separator: ';',
  string_delimiter: '"',
  decimal_separator: ',',
  thousands_separator: '_';
  columns: [
    CsvColumnString{offset:2},
    CsvColumnDate{offet:1, format:"%Y-%m-%d"},
    CsvColumnBoolean{offset:5},
  ];
}

CsvColumns

CsvColumns classes help to precisely define the types of data per columns. All types contain some common attributes.
name: String? allows to define a name for the column. This might be useful for debuging purposes but is not used for the parsing (it is totaly independent of the column header in the file). mandatory: bool? indicated to the parser is a field is mandatory or not. False by default, it triggers an exception during the parsing of a line if the value for the mandatory field is null. offset: int? defines the position of this column definition from the leftmost column starting a 0.

CsvColumn is an abstract type and cannot be instantiated as is. You have to use one of the concrete types of columns.
Yet, some of the column types don’t have specific attributes in addition to the generic ones. It is the case for CsvColumnInteger, CsvColumnFloat and CsvColumnBoolean

var columns = [
  CsvColumnInteger{name: "nb_stuff", offset: 2},
  CsvColumnFloat{},
  CsvColumnBoolean{name:"valid", mandatory: true},
  CsvColumnIgnored{name: "alias"},
];

CsvColumnTime and CsvColumnDuration

Although parsing to different types, CsvColumnTime and CsvColumnDuration share the same API.
CsvColumnTime is to be used to parse an integer value representing a date/time (from 1900-01-01). If your date is serialized as a string, use CsvColumnDate
CsvColumnDuration is to be used to parse duration (of an event for instance).
In both cases, the only parameter to specify is the unit in which the time, or duration, has been written in the file.

var columns = [
  CsvColumnTime{name: "timestamp", offset: 0, unit: DurationUnit::milliseconds},
  CsvColumnDuration{name: "stop_duration", unit: DurationUnit::seconds},
];

CsvColumnDate

CsvColumnDate specifies how to parse columns containing dates serialized in a string format (such as the ISO 8601). If your date is a timestamp (serialized as integer value, use CsvColumnTime).

format: String? allows you to specify the expected format of the string, so the parser can identify the various fields (year, month, hour, etc). The format follows the libc time formats. The description is available here: https://www.gnu.org/software/libc/manual/html_node/Formatting-Calendar-Time.html. The default format is set to the ISO 8601 format: “%Y-%m-%dT%H:%M:%SZ” equivalent to “%FT%TZ”.
tz: TimeZone? if not specified in the string itself, this field allows to define the time zone of the date/time. If not specified, it will be considered UTC.

type CsvColumnDate extends CsvColumn {
  format: String?;
  tz: TimeZone?;
  as_time: bool?;
}

CsvColumnString

CsvColumnString help better handling the string columns. You can specify the following additional attributes:
trim: bool? default to false, it allows to automatically remove white spaces at the front and end of the string, to keep only the core data.
try_number: bool? an attempt will be made to try to parse the string content as a number. False by default.
try_json: bool? an attempt will be made to try to parse the string content as an (JSON) Object. False by default.
values: Array? makes it possible to specify a list of possible values this string column can have. This will increase the performances, by not allocating new strings each time and relying on their representation in the compiled GreyCat program. Values from the dataset that are not present in the list will still be parsed and returned. encoder: TextEncoder? allows to specify the encoded format of the string. Possible values are:

enum TextEncoder {
  plain;
  base64;
  base64url;
  hexadecimal;
}

CsvReader

CSV files can be processed (read) using the CsvReader of GreyCat. The CsvReader constructor takes the path to the file, along with a CsvFormat specifying how the content is formatted.

Reading a CSV file with common values for the various delimiters is as simple as:

fn run() {
  var format = CsvFormat {}; // Default format
  var reader = CsvReader::new("./data/myFile.csv", format);
  if(reader != null) {
    while(reader.available() > 0) {
      // Do read
    }
  }
}
fn run() {
  var format = CsvFormat {
    separator: ';',
    string_delimiter: '\'',
    decimal_separator: ',',
  };
  var reader = CsvReader::new("./data/myFile.csv", format);
  if(reader != null) {
    while(reader.available() > 0) {
      //Do read
    }
  }
}

The CsvReader offers several functions:

CsvReader offers two means to read a file:

  1. read(): Array? allows to read one line. The line is parsed according to the CsvFormat specified, and each cell is parsed according to the column format specified, if any, or inferred or string if the inference is deactivated.
while(reader.available() > 0) {
  // Read on line
  var line = reader.read();
  // Process
  var name: String = line[0] as String;
  var age: int = line[3] as int;
}

With this method, typing is weak, because the method returns an Array<any?>. It is also a bit costly in memory because an array is allocated for each line. Finally, it will require moving all indexes around if the format evolves.

  1. read_to(target: any)

read_to(target: any) provides an alternative to reading into an array, by reading the line and pushing the values directly into a template object.

type CsvLineTemplate {
  name: String?;
  age: int?:
}
[...]
var lineTemplate = CsvLineTemplate{};
while(reader.available() > 0) {
  // Read on line
  reader.read_to(lineTemplate);
  // Process
  var name: String = lineTemplate.name!!;
  var age: int = lineTemplate.age!!;
}

The template object fields (attributes) are filled in the order, from top to bottom, with the cells of the line, from left to right. The object in the example is reused, saving memory, and increasing speed and the fields are directly typed.
If you have a type to represent the lines of your CSV as a record, you can also create a new object template for each read, fill it, and store it in the graph.

CsvFormat provides a generateType method, which will generate the GCL code that defines a type matching a format. See CsvAnalysis for more information.

In case of parsing error, the exception will indicate, when appropriate, the column index (the column offset) and a the start of the offending line, in bytes from the beginning. You can then use the Explorer to connect to your application or model (started with greycat serve command) to navigate the file. Also, this error position can be used with your operating system command or editor.

CsvWriter

The CsvWriter works quite similarly to the CsvReader, expecting the path of the file you want to write (or append to), and a definition of the internal format of the CSV file you want to produce. You can then call the write(data: any?) function to push data to the file.

In the following example, we write a CSV file which fields are separated with ‘,’ and strings delimited with ‘"’. We also specify that the third column will be of type time, and the time must be serialized in milliseconds.

fn run() {
  var format = CsvFormat {
    separator: ';',
    string_delimiter: '"',  // optional, default value
    decimal_separator: ',',
    columns: [
      CsvColumnTime{
        offset: 2,
        unit: DurationUnit::milliseconds,
      }
    ]
  };
  var writer = CsvWriter::new("./data/myFile.csv", format);
  if(writer != null) {
    writer.write(["John", "Doe", time::now(), 56]);
    writer.write(["Jane", "Doe", time::now(), 34]);
  }
}

The string_delimiter attribute of CsvFormat works as follows: