Introduction » Node LB Data

Manager object profiling data.

The node LB data manager component vt::vrt::collection::balance::NodeLBData, accessed via vt::theNodeLBData() manages instrumentation data from objects in a collection. It holds data per node on the timing of these objects and communication between them demarcated by phase and subphase.

When LB is invoked in vt, the LB Manager passes the node LB data to the various LB strategies to run the load balancer. The node LB data component can also dump the LB data it holds to files, which can be read externally. The LBAF (Load Balancing Analysis Framework) can also then read this data to analyze the quality of the load distribution at any phase in the file.

Exporting LB Data Files (VOM)

The NodeLBData component, after collecting LB data from the running program, can dump these to files in a VOM file (Virtual Object Map). As indicated by the name, the VOM file specifies the mapping of object to node for each phase along with LB data for each object (computation time and communication load).

To output VOM files, pass --vt_lb_data to enable output along with --vt_lb_data_dir=<my-directory> and --vt_lb_data_file=<my-base-name> to control the directory the files are generated along with the base file name. With this enabled, vt will generate a file for each node that contains the LB data and mapping.

File Format

The VOM files are output in JSON format, either compressed with brotli compression (default on) or pure JSON if the argument --vt_lb_data_compress is set to false.

The JSON files contain an array of phases that have been captured by vt and output to the file. Each phase has an id indicating which phase it was while the application was running. Each phase also has an array of tasks that represent work that was done during that phase. Each task has a time, resource, node, entity, and optionally a list of subphases. The entity contains information about the task that performed this work. If that entity is a virtual collection object, it will specify the unique id for the object, and optionally the index, home, and collection_id for that object.

{
  "metadata": {
    "phases": {
      "identical_to_previous": {
        "list": [],
        "range": []
      },
      "skipped": {
        "list": [],
        "range": []
      }
    },
    "rank": 0,
    "shared_node": {
      "id": 0,
      "num_nodes": 1,
      "rank": 0,
      "size": 1
    },
    "type": "LBDatafile"
  },
  "phases": [
    {
      "communications": [
        {
          "bytes": 152.0,
          "from": {
            "collection_id": 7,
            "home": 0,
            "id": 262147,
            "index": [
              0
            ],
            "migratable": true,
            "type": "object"
          },
          "messages": 1,
          "to": {
            "collection_id": 7,
            "home": 0,
            "id": 262147,
            "index": [
              0
            ],
            "migratable": true,
            "type": "object"
          },
          "type": "SendRecv"
        }
      ],
      "id": 0,
      "tasks": [
        {
          "entity": {
            "collection_id": 7,
            "home": 0,
            "id": 262147,
            "index": [
              0
            ],
            "migratable": true,
            "type": "object"
          },
          "node": 0,
          "resource": "cpu",
          "subphases": [
            {
              "id": 0,
              "time": 0.0004248750001352164
            }
          ],
          "time": 0.0004248750001352164
        },
        {
          "entity": {
            "home": 0,
            "id": 3145740,
            "migratable": false,
            "objgroup_id": 786435,
            "type": "object"
          },
          "node": 0,
          "resource": "cpu",
          "time": 0.0
        },
        {
          "entity": {
            "home": 0,
            "id": 4194316,
            "migratable": false,
            "objgroup_id": 1048579,
            "type": "object"
          },
          "node": 0,
          "resource": "cpu",
          "time": 0.0
        },
        {
          "entity": {
            "home": 0,
            "id": 5242892,
            "migratable": false,
            "objgroup_id": 1310723,
            "type": "object"
          },
          "node": 0,
          "resource": "cpu",
          "time": 0.0
        },
        {
          "entity": {
            "home": 0,
            "id": 1,
            "migratable": false,
            "type": "object"
          },
          "node": 0,
          "resource": "cpu",
          "subphases": [
            {
              "id": 0,
              "time": 1.041999894368928e-06
            }
          ],
          "time": 1.041999894368928e-06
        },
        {
          "entity": {
            "home": 0,
            "id": 6291468,
            "migratable": false,
            "objgroup_id": 1572867,
            "type": "object"
          },
          "node": 0,
          "resource": "cpu",
          "time": 0.0
        },
        {
          "entity": {
            "home": 0,
            "id": 7340044,
            "migratable": false,
            "objgroup_id": 1835011,
            "type": "object"
          },
          "node": 0,
          "resource": "cpu",
          "time": 0.0
        },
        {
          "entity": {
            "home": 0,
            "id": 8388620,
            "migratable": false,
            "objgroup_id": 2097155,
            "type": "object"
          },
          "node": 0,
          "resource": "cpu",
          "time": 0.0
        },
        {
          "entity": {
            "home": 0,
            "id": 0,
            "migratable": false,
            "type": "object"
          },
          "node": 0,
          "resource": "cpu",
          "time": 0.0
        }
      ]
    }
  ]
}

Each phase in the file may also have a communications array that specify any communication between tasks that occurred during the phase. Each communication has type, which is described below in the following table. Additionally, it specifies the bytes, number of messages, and the two entities that were involved in the operator as to and from. The entities may be of different types, like an object or node depending on the type of communication.

{
    "phases": [
        {
            "communications": [
                {
                    "bytes": 1456.0,
                    "from": {
                        "home": 0,
                        "id": 1,
                        "migratable": false,
                        "type": "object"
                    },
                    "messages": 26,
                    "to": {
                        "home": 1,
                        "id": 5,
                        "migratable": false,
                        "type": "object"
                    },
                    "type": "SendRecv"
                },
                {
                    "bytes": 1456.0,
                    "from": {
                        "home": 0,
                        "id": 1,
                        "migratable": false,
                        "type": "object"
                    },
                    "messages": 26,
                    "to": {
                        "home": 2,
                        "id": 9,
                        "migratable": false,
                        "type": "object"
                    },
                    "type": "SendRecv"
                }
            ]
        }
    ]
}

The type of communication lines up with the enum vt::vrt::collection::balance::CommCategory in the code.

Value	Enum entry	Description
1	`CommCategory::SendRecv`	A send-receive edge between two collection elements
2	`CommCategory::CollectionToNode`	A send from a collection element to a node
3	`CommCategory::NodeToCollection`	A send from a node to a collection element
4	`CommCategory::Broadcast`	A broadcast from a collection element to a whole collection (receive-side)
5	`CommCategory::CollectionToNodeBcast`	A broadcast from a collection element to all nodes (receive-side)
6	`CommCategory::NodeToCollectionBcast`	A broadcast from a node to a whole collection (receive-side)
7	`CommCategory::CollectiveToCollectionBcast`	Collective 'broadcast' from every node to the local collection elements (receive-side)

For all the broadcast-like edges, the communication logging will occur on the receive of the broadcast side (one entry per broadcast recipient).

LB Data File Validator

All input JSON files will be validated using the JSON_data_files_validator.py found in the scripts directory, which ensures that a given JSON adheres to the following schema:

from schema import And, Optional, Schema

def validate_ids(field):
    """
    Ensure that 1) either seq_id or id is provided,
    and 2) if an object is migratable, collection_id has been set.
    """
    if 'seq_id' not in field and 'id' not in field:
        raise ValueError('Either id (bit-encoded) or seq_id must be provided.')

    if field.get("migratable") is True and 'seq_id' in field and 'collection_id' not in field:
        raise ValueError('If an entity is migratable, it must have a collection_id')

    return field

task = {
    'entity': And({
        Optional('collection_id'): int,
        'home': int,
        Optional('id'): int,
        Optional('seq_id'): int,
    Optional('index'): [int],
        'type': str,
        'migratable': bool,
        Optional('objgroup_id'): int
    }, validate_ids),
    'node': int,
    'resource': str,
    Optional('subphases'): [
        {
            'id': int,
            'time': float,
        }
    ],
    'time': float,
    Optional('user_defined'): dict,
    Optional('attributes'): dict
}

communication = {
    'type': str,
    'to': And({
        'type': str,
        Optional('id'): int,
        Optional('seq_id'): int,
        Optional('home'): int,
        Optional('collection_id'): int,
        Optional('migratable'): bool,
        Optional('index'): [int],
        Optional('objgroup_id'): int,
    }, validate_ids),
    'messages': int,
    'from': And({
        'type': str,
        Optional('id'): int,
        Optional('seq_id'): int,
        Optional('home'): int,
        Optional('collection_id'): int,
        Optional('migratable'): bool,
        Optional('index'): [int],
        Optional('objgroup_id'): int,
    }, validate_ids),
    'bytes': float
}

LBDatafile_schema = Schema(
    {
        Optional('type'): And(str, "LBDatafile", error="'LBDatafile' must be chosen."),
        Optional('metadata'): {
            Optional('type'): And(str, "LBDatafile", error="'LBDatafile' must be chosen."),
            Optional('rank'): int,
            Optional('rank_alpha'): And(
                float, lambda x: x >= 0.0,
                error="Should be of type 'float' and >= 0"),
            Optional('shared_node'): {
                'id': int,
                'size': int,
                'rank': int,
                'num_nodes': int,
            },
            Optional('phases'): {
                Optional('count'): int,
                'skipped': {
                    'list': [int],
                    'range': [[int]],
                },
                'identical_to_previous': {
                    'list': [int],
                    'range': [[int]],
                },
            },
            Optional('attributes'): dict
        },
        'phases': [
            {
                'id': int,
                'tasks': [
                    task
                ],
                Optional('communications'): [
                    communication
                ],
                Optional('user_defined'): dict,
                Optional('lb_iterations'): [
                    {
                        'id': int,
                        'tasks': [
                            task
                        ],
                        Optional('communications'): [
                            communication
                        ],
                        Optional('user_defined'): dict
                    }
                ]
            },
        ]
    }
)

LB Specification File

In order to customize when LB output is enabled and disabled, a LB specification file can be passed to vt via a command-line flag: --vt_lb_spec --vt_lb_spec_file=filename.spec.

For details about vt's Specification File see Spec File