7.0.1685-testing

Csv Analysis feature of GreyCat

This feature of GreyCat is very fast and powerful type deduction feature, which allows the user to explore millions of Csv file lines in a matter of seconds.

It allows you to explore:

8,000,000 lines of Csv (1.3 GB, 15 columns) in 4.8s,

Average speed of: 1.7M rows/second (277 MB/s.) with unlimited date check limits.

Analysis

Dataset

timestamp, value, category, status
2023-01-01 12:00:00, 100.5, "A", true
2023-01-02 12:00:00, 105.2, "B", false
2023-01-03 12:00:00, 102.8, "A", true
2023-01-04 12:00:00, 110.0, "C", false
2023-01-05 12:00:00, 108.3, "B", true

Result

Analyzing the above CSV would yield:

CsvStatistics {
  header_lines: 1,
  separator: ',',
  string_delimiter: null,
  decimal_separator: null,
  thousands_separator: null,
  columns: Array<CsvColumnStatistics> { /*...*/ },
  line_count: 5,
  fail_count: 0,
  file_count: 1
}

Analysis: column 0

CsvColumnStatistics {
  name: "timestamp",
  example: "2023-01-01 12:00:00",
  null_count: 0,
  bool_count: 0,
  int_count: 0,
  float_count: 0,
  string_count: 0,
  date_count: 5,
  date_format_count: Map<String,int> {
    "%Y-%m-%d %H:%M:%S": 5
  },
  enumerable_count: Map<any,int> {},
  profile: Gaussian {
    sum: null,
    sumsq: null,
    count: null,
    min: null,
    max: null
  }
}

Analysis: column 1

CsvColumnStatistics {
  name: "value",
  example: 100.5,
  null_count: 0,
  bool_count: 0,
  int_count: 0,
  float_count: 5,
  string_count: 0,
  date_count: 0,
  date_format_count: Map<String,int> {},
  enumerable_count: Map<any,int> {},
  profile: Gaussian {
    sum: 526.8,
    sumsq: 55564.02,
    count: 5,
    min: 100.5,
    max: 110.0
  }
}

Analysis: column 2

CsvColumnStatistics {
  name: "category",
  example: "A",
  null_count: 0,
  bool_count: 0,
  int_count: 0,
  float_count: 0,
  string_count: 5,
  date_count: 0,
  date_format_count: Map<String,int> {},
  enumerable_count: Map<any,int> {
    "C": 1,
    "B": 2,
    "A": 2
  },
  profile: Gaussian {
    sum: null,
    sumsq: null,
    count: null,
    min: null,
    max: null
  }
}

Analysis: column 3

CsvColumnStatistics {
  name: "status",
  example: true,
  null_count: 0,
  bool_count: 5,
  int_count: 0,
  float_count: 0,
  string_count: 0,
  date_count: 0,
  date_format_count: Map<String,int> {},
  enumerable_count: Map<any,int> {},
  profile: Gaussian {
    sum: null,
    sumsq: null,
    count: null,
    min: null,
    max: null
  }
}

Code generation

From the CsvStatistics we can generate the GCL types/enums for the reader:

var code = Csv::generate(stats);
println(code);
@volatile
private type Record {
  /// column=0
  @format("%Y-%m-%d %H:%M:%S")
  timestamp: time;
  /// column=1, min=100.5, max=110.0, avg=105.36
  value: float;
  /// column=2
  category: Category;
  /// column=3
  status: bool;
}

@volatile
private enum Category {
  C,
  B,
  A,
}

This feature is available in the explorer to ease the import of multiple complex datasets Explore Analyze Stats Code