7.0.1685-testing

CSV Files

CSV files is one of the most common ways to exchange data. Loading and writing CSV files is therefore quite customizable in GreyCat to support the various possible combinations.


CsvReader

CSV files can be processed (read) using the CsvReader of GreyCat. The CsvReader constructor takes the path to the file, along with a CsvFormat specifying how the content is formatted.

Reading a CSV file with common values for the various delimiters is as simple as:

fn run() {
  var format = CsvFormat {}; // Default format, comma separated value

  // Custom csv format
  format = CsvFormat {
    separator: ';',
    string_delimiter: '\'',
    decimal_separator: ',',
  };

  var reader = CsvReader{ path: "./data.csv", format: format };
  while(reader.can_read()) {
    // Do read
  }
}

You can also store the position and restart from where you left of

var prevPos: node<int?>;

fn run(){
  var reader = CsvReader { path: "./data.csv", pos: *prevPos };

  while(reader.can_read()){
    // Do stuff
  }

  prevPos.set(reader.pos);
}

The CsvReader offers several utilities:

  • available(): int provides the remaining bytes to read
  • can_read(): bool if there is content available to be read (to be used as a condition for a while loop)
  • lastLine(): String? returns the last line read, as a String
  • read(): any? | T returns the line as an object if type specified or an Array
  • pos: int provides the current position of the reader, within the file, from its beginning.

CsvReader offers two means to read a file:

Reading without a type

Allows to read one line. The line is parsed according to the CsvFormat specified, and each cell is parsed according to the column format specified, if any, or inferred or string if the inference is deactivated.

With this method, typing is weak, because the method returns an Array<any?>. It is also a bit costly in memory because an array is allocated for each line. Finally, it will require moving all indexes around if the format evolves.

Reading using types

The CsvReadercan also be typed and directly read lines as the declared type.

The template object fields (attributes) are filled in the order, from top to bottom, with the cells of the line, from left to right. The object in the example is reused, saving memory, and increasing speed and the fields are directly typed.

If the csv data doesn't match the provided type an error will be thrown
@volatile
type Entry {
  id: int;
  name: String;
  values: Array<int>;
}

fn main() {
  var reader = CsvReader<Entry> {
    path: "files/entries.csv",
    format: CsvFormat {
      header_lines: 1,
    },
  };

  while (reader.can_read()) {
    var entry = reader.read();
    println(entry); // Entry { id: 0, name: "aaa", values: [1, 2, 3] }
  }
}

Which would work with a csv file like the following:

id,name,value_0,value_1,value_2
0,aaa,1,2,3
1,bbb,4,5,6
2,ccc,7,8,9

Note that if the custom type includes an Array attribute, the array will be consumed greedily. Hence, there should only be 1 Array and it must target the trail columns of a csv definition.

Notice the volatile @volatile pragma on top the type, this i used to facilitate upgrades when changing the underlying type attributes and types for more information Volatile

Special types

geo

type Record {
  position: geo;
}
  • consumes two columns
  • order is important, consumes a latitude first (float) and then the longitude (float)

Note that the following would also work:

type Record {
  lat: float;
  lng: float;
}

As well as:

type Record {
  position: Tuple<float, float>;
}

null

You can also define a null type, in which case the column will be skipped, leverage this to speed up read speed.

@volatile
type Entry {
  id: int;
  name: null; // will be skipped
  values: Array<int>;
}

T “nesting”

type Record {
  column_0: int;
  child: RecordChild; // nested parsing from the flat columns
  column_3: float;
}

type RecordChild {
  name: String; // column 1
  value: int;   // column 2
}
  • nesting types will work as expected, consuming the columns to produce the intermediate instances:
0,a,1000,0.1
Record {                // will yield this instance
  column_0: 0,
  child: RecordChild {
    name: "a",
    value: 1000,
  },
  column_3: 0.1,
}

Enum

foo
by_value
baz

Enums are matched by field key first, then the associated value.

type Record {
  value: MyEnum;
}

enum MyEnum {
  foo;
  bar("by_value");
  baz;
}

fn main() {
  var reader = CsvReader<Record> { /*...*/ };

  println(reader.read()); // Record { value: MyEnum::foo }
  println(reader.read()); // Record { value: MyEnum::bar }
  println(reader.read()); // Record { value: MyEnum::baz }
}

time

The @format annotation can be used on time fields to fine-tune the behavior of the parser

type Record {
  @format("%d/%m/%y %H:%M") // only accept strings like: "01/11/25 15:42"
  date: time;
}

It accepts the following signatures, where the dateformat is the GNU libc standard

// interprets the time (String) respecting the given dateformat
@format("%d/%m/%y %H:%M")
// interprets the time (String) respecting the given dateformat in the given timezone
@format("%d/%m/%y %H:%M", TimeZone::"Europe/Luxembourg")
// interprets the time (String) in the given timezone
@format(TimeZone::"Europe/Luxembourg")
// interprets the time (int) as a UNIX epoch in milliseconds
@format(DurationUnit::milliseconds)

duration

type Record {
  @format(DurationUnit::hours) // interprets the parsed int as hours
  elapsed: duration;
}

The @format annotation can be used on duration fields to fine-tune the behavior of the parser

It accepts the following signature:

// interprets the duration (int) as seconds
@format(DurationUnit::seconds)

Other

type Record {
  a: Tuple<int, String>; // consumes 2 columns, an `int` and a `String`
  b: bool;               // TRUEISH:  "true", "1", "yes", "y", "t" (ignores case)
                         // FALSEISH: "false", "0", "no", "n", "f" (ignores case)
  d: t2;                 // consumes 2 `int` columns, `t2f` consumes 2 `float`
  e: t3;                 // consumes 3 `int` columns, `t3f` consumes 3 `float`
  f: t4;                 // consumes 4 `int` columns, `t4f` consumes 4 `float`
}

CSVFormat

Reading and writing CSV files rely on the format of the file to read or write. The CsvFormat object makes it possible to describe the internal format of the file through various attributes.

Attribute Type Description
header_lines int? Allows to specify how many of the top lines of the file have to be considered header lines, and therefore ignored when reading the content
separator char? Specifies the character used to separate the fields/columns within a line. Usually , (default) or ;
decimal_separator char? Specifies the character used to integer and the decimal parts of numbers (defaults to .)
thousands_separator char? Defined the character used to separate thousands in big numbers, if any
string_delimiter char? Defines the characters used to delimit strings. This allows to ignore the separators that may appear in the strings
format String? The format to parse date in, defaults to ISO8601/epoch timestamp in milliseconds
tz TimeZone? The timezone to interpret times in, defaults to the host global timezone
var format_a = CsvFormat {
  header_lines: 2,              //2 first lines are headers to be ignored
  separator: ',',
  decimal_separator: '.',
}

var format_b = CsvFormat {
  separator: ';',
  string_delimiter: '"',
  decimal_separator: ',',
  thousands_separator: '_';
}

CsvWriter

The CsvWriter works quite similarly to the CsvReader, expecting the path of the file you want to write (or append to), and a definition of the internal format of the CSV file you want to produce. You can then call the write(data: any?) function to push data to the file.

In the following example, we write a CSV file which fields are separated with ‘,’ and strings delimited with ‘"’. We also specify that the third column will be of type time, and the time must be serialized in milliseconds.

fn run() {
  var format = CsvFormat {
    separator: ';',
    string_delimiter: '"',  // optional, default value
    decimal_separator: ',',
  };
  var writer = CsvWriter {path: "./data/myFile.csv", format: format };
  if(writer != null) {
    writer.write(["John", "Doe", time::now(), 56]);
    writer.write(["Jane", "Doe", time::now(), 34]);
  }
}

The string_delimiter attribute of CsvFormat works as follows:

  • when set, all strings will be enclosed with the separator,
  • when not set, strings will not be enclosed, unless the field to be written requires it.
    It is required to conform to parsing rules, such as when a column separator is part of the field (if set with separator).
    In this case, the default " character is used.