Running Your First Task

The central object in autojob is the task. This tutorial will walk you through running your first task.

Objectives

Upon completing this tutorial, you will be able to:

  • create tasks

  • harvest task results

  • archive task results

  • restart tasks with new input parameters

  • archive multiple task results into a single file

Prerequisites

This tutorial assumes that you have basic knowledge of Python and that you have a virtual environment with autojob and gpaw installed. While not required, it will be useful if you have read about tasks.

Creating the Task

To create a task, one must create its various inputs and then define its metadata. In this tutorial, we will be creating a Calculation task. Calculation tasks have task and calculation inputs, TaskInputs and CalculationInputs, respectively.

Side Quest

Afterwards, try repeating this tutorial with different concrete TaskBase classes (e.g., BondScan, Vibration, MolecularDynamics).

Creating Task Inputs

The first set of inputs that we will create will be task inputs. All tasks have task inputs. The TaskInputsBase abstract base class describes the data model for task inputs. The most important field to set for task inputs is usually TaskInputsBase.atoms, corresponding to the input atoms object(s). Thus, we start by creating the structure:

from ase.build import molecule

atoms = molecule("H2O")

Side Quest

The snippet above creates an H2O molecule. Try:

  • creating a Cu slab with ase.build.surface

  • creating an adsorbate complex with ccu

  • importing a structure from Materials Project

At this point, you can inspect and modify the structure using Atoms.edit(). For example, tag the oxygen atom with a 1 and center the Atoms object in a 10 Å vacuum cell.

>>> atoms.edit()
>>> atoms.set_tags([1 if a.symbol == "O" else 0 for a in atoms])
>>> atoms.get_tags()
array([1, 0, 0])
>>> atoms.center(vacuum=10)
>>> atoms.get_cell()
Cell([20.0, 21.526478, 20.596309])

Tip

This is a good point to add any metadata that will be used to identify the output of the calculation. For example, you may wish to store the name of the structure in the Atoms.info dictionary of the structure being used:

atoms.info["structure"] = "H2O"

We now create a TaskInputs object with our H2O molecule as the input atoms.

from autojob.tasks.task import TaskInputs

task_inputs = TaskInputs(
    atoms=atoms,
    atoms_filename="H2O.traj",
    task_script="run.sh",
    task_script_template="run.sh.j2",
)

This will create a TaskInputs object that specifies the task inputs for our calculation. The H2O structure will be the input atoms, and the filename used to save the input atoms will be "H2O.traj".

The task script that will be written to the task directory will be constructed from the run.sh.j2 template and will be named run.sh. We also could have specified the names of files_to_copy and files_to_delete. These fields can be used to specify which files should be copied from the directory of the completed task and which files should be deleted from the task directory when it is completed (as in restart()), respectively.

Creating Calculation Inputs

The second set of inputs that we will create are specific to Calculation tasks and subclasses. Calculation tasks differ from the basic Task tasks in that they have additional inputs and outputs. In addition to task inputs, Calculation tasks have calculation inputs (CalculationInputs) and scheduler inputs (SchedulerInputs). Typical calculation inputs include the ase calculator, the ase Optimizer, and corresponding parameters. Similar to TaskInputs, one can specify the template used to construct the calculation script and the filename to which it is written. Further still, the analyses field can be used to specify post-calculation analyses.

from autojob.harvest.harvesters.gpaw import GPAW_LOG
from autojob.tasks.calculation import CalculationInputs

calc_inputs = CalculationInputs(
    calculator="gpaw",
    optimizer="ase.optimize.lbfgs.LBFGS",
    calc_params={
        "mode": "pw",
        "convergence": {
            "energy": 1e-3,
        },
        "txt": GPAW_LOG,
    },
    opt_params={
        "init": {
            "trajectory": "opt.traj",
            "logfile": "opt.log",
        },
        "run": {
            "fmax": 0.5,
            "steps": 5,
        },
    },
    calculation_script="my_calc.py",
    calculation_script_template="run.py.j2",
)

This will create a CalculationInputs object in which the calculator is specified to be a gpaw calculator and the optimizer is set to LBFGS. The calc_params parameter specifies a dictionary for instantiating the gpaw calculator. We have set very minimal parameters for the gpaw calculator. You should consult the GPAW documentation for how to set reasonable defaults for your application.

Also, note that we specify initialization and optimization parameters for the LBFGS Optimizer in a dictionary under the "init" and "run" keys of opt_params, respectively. Although not necessary, structuring opt_params this way simplifies Optimizer instantiation and optimization. The optimization parameters should correspond to parameters for the Optimizer.run() method. See the ASE documentation on structure optimization, for example.

calculation_script controls the name of the calculation script that will be written when calculation inputs are dumped to a directory. calculation_script_template specifies the template that will be used to write the calculation script. Calculation script templates are read from the template directory, which is controlled by the setting, SETTINGS.TEMPLATE_DIR, or the AUTOJOB_TEMPLATE_DIR environment variable. By default, the autojob templates will be used (Default Templates). See How To Write a Template for more details on customizing templates.

What’s with the format of the strings passed as calculator and optimizer?

Although autojob permits any string to be passed as calculator and optimizer, this tutorial passes only the name of the ase calculator as opposed to the fully qualified class name of the ase optimizer. This is because ase calculator classes can be imported using the ase.calculators.calculator.get_calculator_class() function. Thus, by simply passing the name of an installed ase calculator, one can instantiate the desired calculator within a calculation script as so:

from ase.calculators.calculator import get_calculator_class

calc_class = get_calculator_class(calc_inputs.calculator)
calc = calc_class(**calc_inputs.calc_params)

No such utility function exists for ase Optimizers. Thus, we store the fully qualified class name of the ase optimizer, the optimizer can then be instantiated like so:

from importlib import import_module

*mod_parts, opt_cls_name = calc_inputs.optimizer.split(".")
opt_cls = getattr(import_module(".".join(mod_parts)), opt_cls_name)
opt = opt_cls(task_inputs.atoms, **calc_inputs.opt_params["init"])

Again, note here that we’ve used the values corresponding to the "init" key of the opt_params field to initialize the optimizer.

Warning

Although input parameters are defined in inputs classes, (e.g., TaskInputs and CalculationInputs), these inputs do not uniquely define task execution behaviour at runtime. Specifically, task execution further depends on the templates used to write scripts and environment configuration (e.g., installed libraries). Thus, it is important to the specified templates and the environment in which the task will be run are configured correctly before submitting large numbers of tasks.

Creating Task Metadata

Finally, we will create the object to host the task metadata.

from autojob.tasks.task import TaskMetadata

metadata = TaskMetadata(
    label="My custom H2O task",
    tags=[
        "H2O",
        "GPAW",
    ]
)

The label field can be used to store a convenient label for the task (e.g., "test calculation 1"), and the tags field can be used to associate strings with a given task for cross-referencing. When not set, metadata fields will be populated with default values. What is the task ID?

>>> metadata.task_id

Tip

To see the names of all fields in task metadata, check the attributes of TaskMetadataBase.

Instantiating the Task

Putting it all together, we can now create the task:

from autojob.tasks.calculation import Calculation

calc = Calculation(
    task_inputs=task_inputs,
    task_metadata=metadata,
    calculation_inputs=calc_inputs,
)

Writing the Task Inputs

With the new task created, we can now write the inputs required to execute our task to a directory. This can be accomplished using the create_task_tree() function.

from pathlib import Path
from autojob.next import create_task_tree

dest = Path()
task_dir = create_task_tree(calc, dest)

You should notice that a new directory has been created in the current working directory. What is the directory named?

Note

Under the hood, create_task_tree() uses the InputWriter interface to write task inputs. That means that alternatively, one could do:

dest = Path("my_calc")
dest.mkdir()
calc.write_inputs(dest)

In general, however, create_task_tree() is more featured as it can be used to create a new task from an old task directory and use dynamic naming schemes to create new task directories.

See Serialization for more explanation.

If you would like to use a different naming scheme, you can use the name_template argument. The name_template argument controls the naming scheme used to create new task directories. Most characters in this string will be rendered as-is, however, characters contained in curly brackets that correspond to fields in TaskMetadataBase will be substituted. Additionally, {structure} will be substituted with the value of the atoms_filename field of the task inputs of the task, and {i} can control the location of the index used to avoid naming conflicts. (The default is at the end of the filename stem.)

dest = Path()
task_dir = create_task_tree(calc, dest, name_template="calc_{i}")

You should also notice a number of files in the directory, task_dir. What are they called? You should notice two files corresponding to the values of the task_script and calculation_script attributes of the task inputs and calculation inputs, respectively. Inspect these two files to understand the logic for the task. In general, it is recommended that task scripts specify resource requirements, configure the software environment, and dictate high-level control flow for the task. For a calculation, the control flow will typically involve running the calculation script, archiving any results, and executing any auto-restart or workflow progression logic. (This is how the default task and calculation script templates are structured.) For a brief explanation of the purpose of each of these files, see the How autojob structures directories 💭 page.

Running the Task

The task that we have defined can be executed locally, so we can execute our task simply by running the task script in the shell. Navigate to the newly created task directory and execute

bash run.sh

Alternatively, if the task class inherits from ScheduledMixin (as is the case for Calculation) and the task script specifies scheduler resources (as is the case the run.sh.j2 template), then the task script can be used to submit the task using a scheduler. For example, with SLURM and sbatch, you can do:

sbatch run.sh

When the task is complete, we should notice several output files. Notably, we should notice an output atoms file, final.traj, and an output JSON, archive.json. You can inspect the output atoms object using the ASE GUI.

ase gui final.traj

Question

What other files are present?

Retrieving Task Outputs

Locate and inspect the task archive file, archive.json, in the task directory. This should be a JSON dictionary with a single entry. The key should be the task’s task ID, and the value should be a dictionary that can be loaded into a task. Since we know the type of task that we would like to instantiate, we can do so from the JSON like so:

import json

from autojob.tasks.calculation import Calculation

with Path("archive.json").open(mode="r", encoding="utf-8") as file:
    data = json.load(file)

calc = Calculation(**next(iter(data.values())))

If we did not know the type of task contained in archive.json, then we could retrieve the task outputs using the harvest() function.

from autojob.harvest.harvest import harvest

calc = next(iter(harvest(dir_name=Path(), use_cache=True)))

harvest() will load the outputs of all task directories that it finds. The dir_name argument to harvest() is the directory in which to search for tasks whose results should be harvested. All directories containing task metadata files will be assumed to be task directories. The use_cache argument instructs autojob to load each task directly from the archive.json. If instead, you would prefer that task outputs are loaded fresh from the files in a directory, set use_cache=False.

Note

Under the hood, harvest() uses the PathLoadable interface when use_cache=False so that

from pathlib import Path

from autojob.harvest.harvest import harvest

calc = next(iter(harvest(dir_name=Path(), use_cache=False)))

is equivalent to

from pathlib import Path

from autojob.tasks.task import Task

completed = Task.from_directory(Path(), magic_mode=True)

Calculation tasks have both task outputs and calculation outputs. Task outputs are relatively simple; their interface is described by the TaskOutputsBase class, and they typically only include the output Atoms object. You can visualize the output atoms like so:

completed.task_outputs.atoms.edit()

How does this structure compare with the input atoms?

The calculation outputs are generally more complex as they can contain any of the output from a calculation. The data model of calculation outputs is described in the CalculationOutputs class. Most basically, the calculation outputs include whether the calculation converged, the system energy, atomic forces, and results for the calculator, optimizer, and any analyses that were run. The contents of task and calculation outputs are populated by adherents to the HarvesterBase protocol. Because we used the gpaw calculator, the calculator results were harvested with the harvest_gpaw_results() function.

Side Quest

Inspect completed.calculation_outputs. Did the calculation converge? What is the final energy? What are the content of the completed.calculation_outputs.calculator_results dictionary?

Restarting the Task with Different Inputs

Now, suppose that we wanted to re-run that same calculation but with a higher convergence tolerance and compare our results. We could manually edit the run.py script and then re-execute our task, but then we might lose some of the data in the current directory. We could clone our current directory and then re-run the task, but then we should also update the metadata file too. This is the purpose of restart(). restart() accepts the name of a task directory and returns a 2-tuple representing a newly created task. The first element is a TaskBase instance (whose type depends on that of the previous task), and the second element is the task directory as a Path.

Calculator parameters (corresponding to calc_params of CalculationInputs objects) can be modified with the calc_mods argument. We can restart our task like so:

from autojob.next.restart import restart

calc_mods = {
    "convergence": {
        "energy": 1e-4,
    },
}
task, new_task_dir = restart(task_dir, calc_mods=calc_mods)

This will create a restart task from the task directory task_dir and dump the task to a sibling directory of the src directory. task is a newly created Calculation; new_task_dir is its path. What is the task ID of the new task? What is the name of the new task directory?

Side Quest

Check out the signature of the restart() function. Try restarting this task with different calc_mods. Try different naming schemes with name_template.

Submit all new tasks from the CLI.

bash run.sh

Note

autojob also defines a simplified interface to restart() via the autojob restart CLI command. From the CLI, autojob restart submits the new task by default. Submission of the new task to the job queue can be foregone with the --no-submit option. This may be useful if more complicated modification to task inputs are required by directly editing the inputs.json file. For more info, run autojob restart -h from the command-line.

Storing Task Results

Once all tasks have completed running, we will create a single archive in which to store all results. This can be accomplished with harvest() together with archive(). From the parent directory of the completed tasks, run:

from autojob.harvest.archive import archive

dir_name = Path()
all_calcs = harvest(dir_name=dir_name, use_cache=True)
archive(tasks=all_calcs)

This will create a single JSON file in the current working directory named according to SETTINGS.ARCHIVE_FILE (defaults to archive.json). You can time stamp the archive using the time_stamp argument and select JSON and CSV archive modes by setting archive_mode="both". The filename stem can be changed using the stem argument.

archives = archive(
    tasks=all_calcs,
    archive_mode="both",
    time_stamp=True,
    stem="all_calcs",
)

In principle, you can now save these archive files (maybe using git), distribute them, extract the data, and perform any further analyses. This may be extracting final energies to construct free energy diagrams, combining/permuting structures, or using the data to construct additional tasks. In any case, you are now well-equipped to work with tasks within autojob!