Output infrastructure and Python analysis library. (!317) · Merge requests · Air Shower Physics / corsika

Remy Prechelt requested to merge rprechelt-output into master Feb 07, 2021

This MR contains the first draft of an output/writing infrastructure for C8 along with a Python analysis library that allows for reading the generated files. As per the extensive format testing that Ralf and I did last year, the default output format is Parquet but the current implementation allows users to replace the "writer" portion (via a template) so that any alternative output format can be used.

The current design is intended to be very explicit in how outputs are setup for maximum flexibility - see the vertical_EAS file for an actual example (but I also discuss some things below). As discussed last year, the output is structured around a top-level library directory with a subdirectory for each process that has outputs.

All outputs must satisfy the BaseOutput.hpp interface which requires a combination of methods - namely, startOfLibrary, startOfShower, endOfShower, endOfLibrary, getConfig, and getSummary. Almost all of those are optional but are provided so that users can hook into various states of C8 if needed. startOfLibrary and endOfLibrary are called at the beginning and end (before the first shower and after the last shower), startOfShower and endOfShower are called before and after each individual shower in the library.

Each output must also return a YAML::Node containing a complete representation of its configuration with getConfig() - this is to help reproducibility so we can reconstruct the exact configuration of a particular library from it's config.yaml. At the end of the entire library generation, getSummary() is called on each process and the results are written into a summary.yaml file in each subdirectory - this is the ideal place for summary statistics or info that is not part of the configuration of a process and is not raw high-volume data.

As long as these methods are available in the process, the output infrastructure will work. To decouple the particular implementation of the writer from the process itself, I am currently implementing the writing functionality as a defaulted template argument (but this is of course not required). For example,

ObservationPlane observationLevel(obsPlane, DirectionVector(rootCS, {1., 0., 0.}));

is actually equivalent to (because of a default template assignment)

ObservationPlane<ObservationPlaneParquetWriter> observationLevel(obsPlane, DirectionVector(rootCS, {1., 0., 0.}));

Therefore, swapping out that template parameter with another class, for example ObservationPlaneWriterROOT, allows processes to be individually changed to a different output format or implementation, i.e.

# if we ever wanted to support ROOT, we could provide "ROOT" writers.
ObservationPlane<ObservationPlaneROOTWriter> observationLevel(obsPlane, DirectionVector(rootCS, {1., 0., 0.}));

In the rest of the MR I walkthrough how the vertical_EAS example sets up the writing infrastructure and how the output files can be analyzed in Python.

To start, we create an OutputManager instance - each output manager writes into a single "library" directory.

OutputManager output("vertical_EAS_outputs"); // the argument is the name of the library/directory

Any processes that wish to write output into the library must be added/registered with this manager, i.e.

ObservationPlane observationLevel(obsPlane, DirectionVector(rootCS, {1., 0., 0.}));
output.add("obsplane", obsPlane); // the first argument is the name of the outputs subdirectory in the library.

The output is then given to the Cascade instance so that it can coordinate the library generation.

Cascade EAS(env, tracking, sequence, output, stack);

At the end of the program, output.endOfLibrary() must be called to notify all of the processes to close their streams, file handles, etc.

This results in a filesystem directory that looks like:

vertical_EAS_outputs/
    config.yaml
    summary.yaml
    obsplane/
        config.yaml
        summary.yaml
        particles.parquet

The config.yaml and summary.yaml files in the top-level are still quite bare and will certainly have more information added to them in the future. Currently, here is the config.yaml that is written:

name: vertical_EAS_outputs
creator: CORSIKA8
version: 8.0.0-prealpha

and here is the current information written to the summary.yaml:

showers: 2
start time: 06/02/2021 23:46:18 HST
end time: 06/02/2021 23:46:42 HST
runtime: 00:00:24.260

To work with this library from Python, this MR contains a pip-installable Python library called corsika. An example of how this can be used is (>>> is the Python prompt):

>>> import corsika
>>> lib = corsika.Library("vertical_EAS_outputs")
>>> lib.config  # this gets the library configuration as a Python dictionary
{'name': 'vertical_EAS_outputs',
 'creator': 'CORSIKA8',
 'version': '8.0.0-prealpha'}
>>> lib.names  # get a list of all registered processes in the library
['obsplane']
>>> lib.summary  # you can also load the summary information
{'showers': 1,
 'start time': '06/02/2021 23:46:18 HST',
 'end time': '06/02/2021 23:46:30 HST',
 'runtime': 11.13}
>>> lib.get("obsplane")  # you can then get the process by its registered name.
ObservationPlane('obsplane')
>>> lib.get("obsplane").config  # and you can also get its config as well
{'type': 'ObservationPlane',
 'plane': {'center': [0, 0, 6371000],
  'center.units': 'm',
  'normal': [0, 0, 1]},
 'x-axis': [1, 0, 0],
 'y-axis': [0, 1, 0],
 'delete_on_hit': True,
 'name': 'obsplane'}
>>> lib.get("obsplane").data  # this returns the data as a Pandas data frame
     shower  pdg        energy         x         y    radius
0         0  211  9.066702e+10  2.449931 -5.913341  7.093710
1         0   22  2.403024e+11 -1.561504 -1.276160  2.024900
2         0  211  1.306354e+11 -4.626045 -3.237780  6.009696
3         0  211  1.773324e+11 -1.566567  4.172961  4.461556
4         0  211  7.835374e+10  3.152863 -1.049201  3.330416
..      ...  ...           ...       ...       ...       ...
>>> lib.get("obsplane").astype("arrow")  # you can also request the data in a different format
pyarrow.Table
shower: int32 not null
pdg: int32 not null
energy: double not null
x: double not null
y: double not null
radius: double not null

This is working quite well for me but I need to get the tests working on the CI infrastructure since we have to store some state between running the vertical_EAS example and then testing the reading/writing from Python. There are also some minor architectural changes that I'll finish up this week as well but the API will not change. There's clearly still much to do but this should be useable right now and just requires that we move some more of the existing processes to this new format (hopefully I'll get some more done before this is ready to be merged).

Edited Apr 08, 2021 by Remy Prechelt

Output infrastructure and Python analysis library.

Merge request reports