Node LB Data
Manager object profiling data.
The node LB data manager component vt::
, accessed via vt::
manages instrumentation data from objects in a collection. It holds data per node on the timing of these objects and communication between them demarcated by phase and subphase.
When LB is invoked in vt, the LB Manager passes the node LB data to the various LB strategies to run the load balancer. The node LB data component can also dump the LB data it holds to files, which can be read externally. The LBAF (Load Balancing Analysis Framework) can also then read this data to analyze the quality of the load distribution at any phase in the file.
Exporting LB Data Files (VOM)
The NodeLBData
component, after collecting LB data from the running program, can dump these to files in a VOM file (Virtual Object Map). As indicated by the name, the VOM file specifies the mapping of object to node for each phase along with LB data for each object (computation time and communication load).
To output VOM files, pass --vt_lb_data
to enable output along with --vt_lb_data_dir=<my-directory>
and --vt_lb_data_file=<my-base-name>
to control the directory the files are generated along with the base file name. With this enabled, vt will generate a file for each node that contains the LB data and mapping.
File Format
The VOM files are output in JSON format, either compressed with brotli compression (default on) or pure JSON if the argument --vt_lb_data_compress
is set to false
.
The JSON files contain an array of phases
that have been captured by vt and output to the file. Each phase has an id
indicating which phase it was while the application was running. Each phase also has an array of tasks
that represent work that was done during that phase. Each task has a time
, resource
, node
, entity
, and optionally a list of subphases
. The entity
contains information about the task that performed this work. If that entity
is a virtual collection object, it will specify the unique id
for the object, and optionally the index
, home
, and collection_id
for that object.
{ "metadata": { "phases": { "identical_to_previous": { "list": [], "range": [] }, "skipped": { "list": [], "range": [] } }, "rank": 0, "shared_node": { "id": 0, "num_nodes": 1, "rank": 0, "size": 1 }, "type": "LBDatafile" }, "phases": [ { "communications": [ { "bytes": 160.0, "from": { "collection_id": 7, "home": 0, "id": 262147, "index": [ 0 ], "migratable": true, "type": "object" }, "messages": 1, "to": { "collection_id": 7, "home": 0, "id": 262147, "index": [ 0 ], "migratable": true, "type": "object" }, "type": "SendRecv" } ], "id": 0, "tasks": [ { "entity": { "collection_id": 7, "home": 0, "id": 262147, "index": [ 0 ], "migratable": true, "type": "object" }, "node": 0, "resource": "cpu", "subphases": [ { "id": 0, "time": 0.00031375000025946065 } ], "time": 0.00031375000025946065 }, { "entity": { "home": 0, "id": 1, "migratable": false, "type": "object" }, "node": 0, "resource": "cpu", "subphases": [ { "id": 0, "time": 1.1669999366858974e-06 } ], "time": 1.1669999366858974e-06 }, { "entity": { "home": 0, "id": 3145740, "migratable": false, "objgroup_id": 786435, "type": "object" }, "node": 0, "resource": "cpu", "time": 0.0 }, { "entity": { "home": 0, "id": 4194316, "migratable": false, "objgroup_id": 1048579, "type": "object" }, "node": 0, "resource": "cpu", "time": 0.0 }, { "entity": { "home": 0, "id": 5242892, "migratable": false, "objgroup_id": 1310723, "type": "object" }, "node": 0, "resource": "cpu", "time": 0.0 }, { "entity": { "home": 0, "id": 0, "migratable": false, "type": "object" }, "node": 0, "resource": "cpu", "time": 0.0 } ] } ] }
Each phase in the file may also have a communications
array that specify any communication between tasks that occurred during the phase. Each communication has type
, which is described below in the following table. Additionally, it specifies the bytes
, number of messages
, and the two entities that were involved in the operator as to
and from
. The entities may be of different types, like an object
or node
depending on the type of communication.
{ "phases": [ { "communications": [ { "bytes": 1456.0, "from": { "home": 0, "id": 1, "migratable": false, "type": "object" }, "messages": 26, "to": { "home": 1, "id": 5, "migratable": false, "type": "object" }, "type": "SendRecv" }, { "bytes": 1456.0, "from": { "home": 0, "id": 1, "migratable": false, "type": "object" }, "messages": 26, "to": { "home": 2, "id": 9, "migratable": false, "type": "object" }, "type": "SendRecv" } ] } ] }
The type of communication lines up with the enum vt::vrt::collection::balance::CommCategory
in the code.
Value | Enum entry | Description |
---|---|---|
1 | CommCategory::SendRecv | A send-receive edge between two collection elements |
2 | CommCategory::CollectionToNode | A send from a collection element to a node |
3 | CommCategory::NodeToCollection | A send from a node to a collection element |
4 | CommCategory::Broadcast | A broadcast from a collection element to a whole collection (receive-side) |
5 | CommCategory::CollectionToNodeBcast | A broadcast from a collection element to all nodes (receive-side) |
6 | CommCategory::NodeToCollectionBcast | A broadcast from a node to a whole collection (receive-side) |
7 | CommCategory::CollectiveToCollectionBcast | Collective 'broadcast' from every node to the local collection elements (receive-side) |
For all the broadcast-like edges, the communication logging will occur on the receive of the broadcast side (one entry per broadcast recipient).
LB Data File Validator
All input JSON files will be validated using the JSON_data_files_validator.py
found in the scripts
directory, which ensures that a given JSON adheres to the following schema:
from schema import And, Optional, Schema def validate_ids(field): """ Ensure that 1) either seq_id or id is provided, and 2) if an object is migratable, collection_id has been set. """ if 'seq_id' not in field and 'id' not in field: raise ValueError('Either id (bit-encoded) or seq_id must be provided.') if field.get("migratable") is True and 'seq_id' in field and 'collection_id' not in field: raise ValueError('If an entity is migratable, it must have a collection_id') return field task = { 'entity': And({ Optional('collection_id'): int, 'home': int, Optional('id'): int, Optional('seq_id'): int, Optional('index'): [int], 'type': str, 'migratable': bool, Optional('objgroup_id'): int }, validate_ids), 'node': int, 'resource': str, Optional('subphases'): [ { 'id': int, 'time': float, } ], 'time': float, Optional('user_defined'): dict, Optional('attributes'): dict } communication = { 'type': str, 'to': And({ 'type': str, Optional('id'): int, Optional('seq_id'): int, Optional('home'): int, Optional('collection_id'): int, Optional('migratable'): bool, Optional('index'): [int], Optional('objgroup_id'): int, }, validate_ids), 'messages': int, 'from': And({ 'type': str, Optional('id'): int, Optional('seq_id'): int, Optional('home'): int, Optional('collection_id'): int, Optional('migratable'): bool, Optional('index'): [int], Optional('objgroup_id'): int, }, validate_ids), 'bytes': float } LBDatafile_schema = Schema( { Optional('type'): And(str, "LBDatafile", error="'LBDatafile' must be chosen."), Optional('metadata'): { Optional('type'): And(str, "LBDatafile", error="'LBDatafile' must be chosen."), Optional('rank'): int, Optional('shared_node'): { 'id': int, 'size': int, 'rank': int, 'num_nodes': int, }, Optional('phases'): { Optional('count'): int, 'skipped': { 'list': [int], 'range': [[int]], }, 'identical_to_previous': { 'list': [int], 'range': [[int]], }, }, Optional('attributes'): dict }, 'phases': [ { 'id': int, 'tasks': [ task ], Optional('communications'): [ communication ], Optional('user_defined'): dict, Optional('lb_iterations'): [ { 'id': int, 'tasks': [ task ], Optional('communications'): [ communication ], Optional('user_defined'): dict } ] }, ] } )
LB Specification File
In order to customize when LB output is enabled and disabled, a LB specification file can be passed to vt via a command-line flag: --vt_lb_spec --vt_lb_spec_file=filename.spec
.
For details about vt's Specification File see Spec File