6.10.94-stable

CSV Files

CSV files is one of the most common ways to exchange data. Loading and writing CSV files is therefore quite customizable in GreyCat to support the various possible combinations.


CSVFormat

Reading and writing CSV files rely on the format of the file to read or write. The CsvFormat object makes it possible to describe the internal format of the file through various attributes.

Attribute Description
header_lines: int? allows to specify how many of the top lines of the file have to be considered header lines, and therefore ignored when reading the content
separator: char? specifies the character used to separate the fields/columns within a line. Usually , (default) or ;
decimal_separator: char? specifies the character used to integer and the decimal parts of numbers (defaults to .)
thousands_separator: char? defined the character used to separate thousands in big numbers, if any
string_delimiter: char? defines the characters used to delimit strings. This allows to ignore the separators that may appear in the strings
columns_size: int? specifies the number of columns expected in the CSV, to increase the performances if known in advance
columns: Array<CsvColumn>? allows to define the specific types for some or all columns of the CSV if known in advance. This is particularly practical when dealing with dates or time for instance, since you may have to define the format for parsing the value

For the definition of columns, you can use two approaches, but they should ideally be exclusive to avoid unexpected interactions.

  1. Provide the full array of CsvColumn definitions, in the order of the file. If you don’t want to specify a type, you have to set the position to null (as illustrated in the 1st example below).
  2. Provide an array of only the columns you want to specify and indicate the offset of the column in its definition, as presented in the 2nd example.

An alternative to manually specifying the CsvFormat columns, you can use the CsvAnalysis type to explore a Csv, and then call the infer() method of CsvFormat to create the CsvFormat. The documentation on CsvAnalysis describes this procedure.

var format_a = CsvFormat {
  header_lines: 2,              //2 first lines are headers to be ignored
  separator: ',',
  decimal_separator: '.',
  columns: [
    CsvColumnString{name: "id"},
    CsvColumnDate{name: "start", format:"%Y-%m-%d"},
    CsvColumnDate{name: "end", format:"%Y-%m-%d"},
    null,
    CsvColumnFloat{name: "value"},
    CsvColumnIgnored{},
  ],
}

var format_b = CsvFormat {
  separator: ';',
  string_delimiter: '"',
  decimal_separator: ',',
  thousands_separator: '_';
  columns: [
    CsvColumnString{offset:2},
    CsvColumnDate{offset:1, format:"%Y-%m-%d"},
    CsvColumnBoolean{offset:5},
  ],
}

CsvColumns

CsvColumns classes help to precisely define the types of data per columns. All types contain some common attributes.
name: String? allows to define a name for the column. This might be useful for debugging purposes but is not used for the parsing (it is totally independent of the column header in the file). mandatory: bool? indicated to the parser is a field is mandatory or not. False by default, it triggers an exception during the parsing of a line if the value for the mandatory field is null. offset: int? defines the position of this column definition from the leftmost column starting a 0.

CsvColumn is an abstract type and cannot be instantiated as is. You have to use one of the concrete types of columns.
Yet, some of the column types don’t have specific attributes in addition to the generic ones. It is the case for CsvColumnInteger, CsvColumnFloat and CsvColumnBoolean

var columns = [
  CsvColumnInteger{name: "nb_stuff", offset: 2},
  CsvColumnFloat{},
  CsvColumnBoolean{name:"valid", mandatory: true},
  CsvColumnIgnored{name: "alias"},
];

CsvColumnTime and CsvColumnDuration

Although parsing to different types, CsvColumnTime and CsvColumnDuration share the same API.
CsvColumnTime is to be used to parse an integer value representing a date/time (from 1900-01-01). If your date is serialized as a string, use CsvColumnDate
CsvColumnDuration is to be used to parse duration (of an event for instance).
In both cases, the only parameter to specify is the unit in which the time, or duration, has been written in the file.

var columns = [
  CsvColumnTime{name: "timestamp", offset: 0, unit: DurationUnit::milliseconds},
  CsvColumnDuration{name: "stop_duration", unit: DurationUnit::seconds},
];

CsvColumnDate

CsvColumnDate specifies how to parse columns containing dates serialized in a string format (such as the ISO 8601). If your date is a timestamp (serialized as integer value, use CsvColumnTime).

format: String? allows you to specify the expected format of the string, so the parser can identify the various fields (year, month, hour, etc). The format follows the libc time formats. The description is available here: https://www.gnu.org/software/libc/manual/html_node/Formatting-Calendar-Time.html. The default format is set to the ISO 8601 format: “%Y-%m-%dT%H:%M:%SZ” equivalent to “%FT%TZ”.
tz: TimeZone? if not specified in the string itself, this field allows to define the time zone of the date/time. If not specified, it will be considered UTC.

type CsvColumnDate extends CsvColumn {
  format: String?;
  tz: TimeZone?;
  as_time: bool?;
}

CsvColumnString

CsvColumnString help better handling the string columns. You can specify the following additional attributes:
trim: bool? default to false, it allows to automatically remove white spaces at the front and end of the string, to keep only the core data.
try_number: bool? an attempt will be made to try to parse the string content as a number. False by default.
try_json: bool? an attempt will be made to try to parse the string content as an (JSON) Object. False by default.
values: Array? makes it possible to specify a list of possible values this string column can have. This will increase the performances, by not allocating new strings each time and relying on their representation in the compiled GreyCat program. Values from the dataset that are not present in the list will still be parsed and returned. encoder: TextEncoder? allows to specify the encoded format of the string. Possible values are:

enum TextEncoder {
  plain;
  base64;
  base64url;
  hexadecimal;
}

CsvReader

CSV files can be processed (read) using the CsvReader of GreyCat. The CsvReader constructor takes the path to the file, along with a CsvFormat specifying how the content is formatted.

Reading a CSV file with common values for the various delimiters is as simple as:

fn run() {
  var format = CsvFormat {}; // Default format
  var reader = CsvReader::new("./data/myFile.csv", format);
  if(reader != null) {
    while(reader.available() > 0) {
      // Do read
    }
  }
}
fn run() {
  var format = CsvFormat {
    separator: ';',
    string_delimiter: '\'',
    decimal_separator: ',',
  };
  var reader = CsvReader::new("./data/myFile.csv", format);
  if(reader != null) {
    while(reader.available() > 0) {
      //Do read
    }
  }
}

The CsvReader offers several functions:

  • available(): int provides the remaining bytes to read
  • lastLine(): String? returns the last line read, as a String
  • get_pos(): int provides the current position of the reader, within the file, from its beginning.
  • set_pos(pos: int) allows to progress fast in the file (in case you read an update), by jumping directly to a specific position. The position, in byte from the beginning of the file, must be within the size of the file (available()). Note that the lastLine() will be reset by this call.
  • validate(path: String, format: CsvFormat, max_rows: int, max_invalid: int, invalid_path: String?) verifies if the CsvFormat matches the Csv file provided. max_rows sets the number of rows to verify, rows in error can be copied to a separate file, invalid_path, because the input file can be large, the caller can limit the number of rows in error to be copied, max_invalid. Reporting the rows in error is optional, setting max_invalid to 0 or setting a null file, switches the reporting off.
  • sample(path: String, format: CsvFormat?, offset: int?, max: int?): Table allows you to extract a part of a Csv file, according to a format, to a table. The offset indicates the approximate start of the extract (in bytes from the beginning), null means the beginning of the file. max indicates the size (in bytes) of the extract, null means until the end of the file.

CsvReader offers two means to read a file:

  1. read(): Array? allows to read one line. The line is parsed according to the CsvFormat specified, and each cell is parsed according to the column format specified, if any, or inferred or string if the inference is deactivated.
while(reader.available() > 0) {
  // Read on line
  var line = reader.read();
  // Process
  var name: String = line[0] as String;
  var age: int = line[3] as int;
}

With this method, typing is weak, because the method returns an Array<any?>. It is also a bit costly in memory because an array is allocated for each line. Finally, it will require moving all indexes around if the format evolves.

  1. read_to(target: any)

read_to(target: any) provides an alternative to reading into an array, by reading the line and pushing the values directly into a template object.

type CsvLineTemplate {
  name: String?;
  age: int?:
}
[...]
var lineTemplate = CsvLineTemplate{};
while(reader.available() > 0) {
  // Read on line
  reader.read_to(lineTemplate);
  // Process
  var name: String = lineTemplate.name!!;
  var age: int = lineTemplate.age!!;
}

The template object fields (attributes) are filled in the order, from top to bottom, with the cells of the line, from left to right. The object in the example is reused, saving memory, and increasing speed and the fields are directly typed.
If you have a type to represent the lines of your CSV as a record, you can also create a new object template for each read, fill it, and store it in the graph.

CsvFormat provides a generateType method, which will generate the GCL code that defines a type matching a format. See CsvAnalysis for more information.

In case of parsing error, the exception will indicate, when appropriate, the column index (the column offset) and a the start of the offending line, in bytes from the beginning. You can then use the Explorer to connect to your application or model (started with greycat serve command) to navigate the file. Also, this error position can be used with your operating system command or editor.

CsvWriter

The CsvWriter works quite similarly to the CsvReader, expecting the path of the file you want to write (or append to), and a definition of the internal format of the CSV file you want to produce. You can then call the write(data: any?) function to push data to the file.

In the following example, we write a CSV file which fields are separated with ‘,’ and strings delimited with ‘"’. We also specify that the third column will be of type time, and the time must be serialized in milliseconds.

fn run() {
  var format = CsvFormat {
    separator: ';',
    string_delimiter: '"',  // optional, default value
    decimal_separator: ',',
    columns: [
      CsvColumnTime{
        offset: 2,
        unit: DurationUnit::milliseconds,
      }
    ]
  };
  var writer = CsvWriter::new("./data/myFile.csv", format);
  if(writer != null) {
    writer.write(["John", "Doe", time::now(), 56]);
    writer.write(["Jane", "Doe", time::now(), 34]);
  }
}

The string_delimiter attribute of CsvFormat works as follows:

  • when set, all strings will be enclosed with the separator,
  • when not set, strings will not be enclosed, unless the field to be written requires it.
    It is required to conform to parsing rules, such as when a column separator is part of the field (if set with separator).
    In this case, the default " character is used.