7.2.293-stable Switch to dev

Csv Analysis feature of GreyCat

This feature of GreyCat is very fast and powerful type deduction feature, which allows the user to explore millions of Csv file lines in a matter of seconds.

It allows you to explore:

8,000,000 lines of Csv (1.3 GB, 15 columns) in 4.8s,

Average speed of: 1.7M rows/second (277 MB/s.) with unlimited date check limits.

Analysis

Dataset

timestamp, value, category, status
2023-01-01 12:00:00, 100.5, "A", true
2023-01-02 12:00:00, 105.2, "B", false
2023-01-03 12:00:00, 102.8, "A", true
2023-01-04 12:00:00, 110.0, "C", false
2023-01-05 12:00:00, 108.3, "B", true

Code

var files = Array<File> { File { path: "input.csv" } };
var config = CsvAnalysisConfig { separator: ',' };
var csv_stats = Csv::analyze(files, CsvAnalysisConfig { separator: ',' });

Result

Analyzing the above CSV would yield:

CsvStatistics {
  header_lines: 1,
  separator: ',',
  string_delimiter: null,
  decimal_separator: null,
  thousands_separator: null,
  columns: Array<CsvColumnStatistics> { /*...*/ },
  line_count: 5,
  fail_count: 0,
  file_count: 1
}

Analysis: column 0

CsvColumnStatistics {
  name: "timestamp",
  example: "2023-01-01 12:00:00",
  null_count: 0,
  bool_count: 0,
  int_count: 0,
  float_count: 0,
  string_count: 0,
  date_count: 5,
  date_format_count: Map<String,int> {
    "%Y-%m-%d %H:%M:%S": 5
  },
  enumerable_count: Map<any,int> {},
  profile: Gaussian {
    sum: null,
    sumsq: null,
    count: null,
    min: null,
    max: null
  }
}

Analysis: column 1

CsvColumnStatistics {
  name: "value",
  example: 100.5,
  null_count: 0,
  bool_count: 0,
  int_count: 0,
  float_count: 5,
  string_count: 0,
  date_count: 0,
  date_format_count: Map<String,int> {},
  enumerable_count: Map<any,int> {},
  profile: Gaussian {
    sum: 526.8,
    sumsq: 55564.02,
    count: 5,
    min: 100.5,
    max: 110.0
  }
}

Analysis: column 2

CsvColumnStatistics {
  name: "category",
  example: "A",
  null_count: 0,
  bool_count: 0,
  int_count: 0,
  float_count: 0,
  string_count: 5,
  date_count: 0,
  date_format_count: Map<String,int> {},
  enumerable_count: Map<any,int> {
    "C": 1,
    "B": 2,
    "A": 2
  },
  profile: Gaussian {
    sum: null,
    sumsq: null,
    count: null,
    min: null,
    max: null
  }
}

Analysis: column 3

CsvColumnStatistics {
  name: "status",
  example: true,
  null_count: 0,
  bool_count: 5,
  int_count: 0,
  float_count: 0,
  string_count: 0,
  date_count: 0,
  date_format_count: Map<String,int> {},
  enumerable_count: Map<any,int> {},
  profile: Gaussian {
    sum: null,
    sumsq: null,
    count: null,
    min: null,
    max: null
  }
}

Code generation

From the CsvStatistics we can generate the GCL types/enums for the reader:

var code = Csv::generate(stats);
println(code);
@volatile
private type Record {
  /// column=0
  @format("%Y-%m-%d %H:%M:%S")
  timestamp: time;
  /// column=1, min=100.5, max=110.0, avg=105.36
  value: float;
  /// column=2
  category: Category;
  /// column=3
  status: bool;
}

@volatile
private enum Category {
  C,
  B,
  A,
}

This feature is available in the explorer to ease the import of multiple complex datasets Explore Analyze Stats Code