Running Your First Task¶
The central object in autojob is the task. This
tutorial will walk you through running your first task.
Objectives¶
Upon completing this tutorial, you will be able to:
create tasks
harvest task results
archive task results
restart tasks with new input parameters
archive multiple task results into a single file
Prerequisites¶
This tutorial assumes that you have basic knowledge of Python
and that you have a virtual environment with autojob and
gpaw installed. While not required, it will be useful if you
have read about tasks.
Creating the Task¶
To create a task, one must create its various inputs and then define
its metadata. In this tutorial, we will be creating a Calculation task.
Calculation tasks have task and calculation inputs, TaskInputs
and CalculationInputs, respectively.
Side Quest
Afterwards, try repeating this tutorial with different concrete
TaskBase classes (e.g., BondScan,
Vibration, MolecularDynamics).
Creating Task Inputs¶
The first set of inputs that we will create will be task inputs. All tasks
have task inputs. The TaskInputsBase abstract base class describes
the data model for task inputs. The most important field to
set for task inputs is usually TaskInputsBase.atoms, corresponding to
the input atoms object(s). Thus, we start by creating the structure:
from ase.build import molecule
atoms = molecule("H2O")
Side Quest
The snippet above creates an H2O molecule. Try:
creating a Cu slab with
ase.build.surfacecreating an adsorbate complex with
ccuimporting a structure from Materials Project
At this point, you can inspect and modify the structure using
Atoms.edit(). For example, tag the oxygen atom with a 1
and center the Atoms object in a 10 Å vacuum cell.
>>> atoms.edit()
>>> atoms.set_tags([1 if a.symbol == "O" else 0 for a in atoms])
>>> atoms.get_tags()
array([1, 0, 0])
>>> atoms.center(vacuum=10)
>>> atoms.get_cell()
Cell([20.0, 21.526478, 20.596309])
Tip
This is a good point to add any metadata that will be used
to identify the output of the calculation. For example, you
may wish to store the name of the structure in the Atoms.info
dictionary of the structure being used:
atoms.info["structure"] = "H2O"
We now create a TaskInputs object with our H2O
molecule as the input atoms.
from autojob.tasks.task import TaskInputs
task_inputs = TaskInputs(
atoms=atoms,
atoms_filename="H2O.traj",
task_script="run.sh",
task_script_template="run.sh.j2",
)
This will create a TaskInputs object that specifies the task inputs
for our calculation. The H2O structure will be the input atoms, and
the filename used to save the input atoms will be "H2O.traj".
The task
script that will be written to the task directory will be constructed from the
run.sh.j2 template and will be named run.sh. We also could have
specified the names of files_to_copy and files_to_delete. These fields
can be used to specify which files should be copied from the directory of
the completed task and which files should be deleted from the task directory
when it is completed (as in restart()), respectively.
See also
Creating Calculation Inputs¶
The second set of inputs that we will create are specific to Calculation tasks
and subclasses. Calculation tasks differ from the basic Task
tasks in that they have additional inputs and outputs. In addition
to task inputs, Calculation tasks have calculation inputs (CalculationInputs)
and scheduler inputs (SchedulerInputs). Typical calculation inputs
include the ase calculator, the ase Optimizer, and corresponding
parameters. Similar to TaskInputs, one can specify the template
used to construct the calculation script and the filename to which it is
written. Further still, the analyses field can be used to specify post-calculation
analyses.
from autojob.harvest.harvesters.gpaw import GPAW_LOG
from autojob.tasks.calculation import CalculationInputs
calc_inputs = CalculationInputs(
calculator="gpaw",
optimizer="ase.optimize.lbfgs.LBFGS",
calc_params={
"mode": "pw",
"convergence": {
"energy": 1e-3,
},
"txt": GPAW_LOG,
},
opt_params={
"init": {
"trajectory": "opt.traj",
"logfile": "opt.log",
},
"run": {
"fmax": 0.5,
"steps": 5,
},
},
calculation_script="my_calc.py",
calculation_script_template="run.py.j2",
)
This will create a CalculationInputs object in which the calculator is
specified to be a gpaw calculator and the optimizer is set to LBFGS.
The calc_params parameter specifies a dictionary for instantiating the
gpaw calculator. We have set very minimal parameters for the
gpaw calculator. You should consult the GPAW documentation for how to set
reasonable defaults for your application.
Also, note that we specify initialization and optimization parameters for the LBFGS
Optimizer in a dictionary under the "init" and "run" keys of
opt_params, respectively. Although not necessary, structuring opt_params this
way simplifies Optimizer instantiation and optimization. The optimization
parameters should correspond to parameters for the Optimizer.run() method. See the
ASE documentation on structure optimization, for example.
calculation_script controls the name of the calculation script that will
be written when calculation inputs are dumped to a directory.
calculation_script_template specifies the template that will be used to
write the calculation script. Calculation script templates are read from the
template directory, which is controlled by the setting, SETTINGS.TEMPLATE_DIR,
or the AUTOJOB_TEMPLATE_DIR environment variable. By default, the autojob
templates will be used (Default Templates). See How To Write a Template
for more details on customizing templates.
What’s with the format of the strings passed as calculator and optimizer?
Although autojob permits any string to be passed as calculator and optimizer,
this tutorial passes only the name of the ase calculator as opposed to the fully qualified
class name of the ase optimizer. This is because ase calculator classes
can be imported using the ase.calculators.calculator.get_calculator_class()
function. Thus, by simply passing the name of an installed ase calculator,
one can instantiate the desired calculator within a calculation script as so:
from ase.calculators.calculator import get_calculator_class
calc_class = get_calculator_class(calc_inputs.calculator)
calc = calc_class(**calc_inputs.calc_params)
No such utility function exists for ase Optimizers. Thus, we store the
fully qualified class name of the ase optimizer, the optimizer can then
be instantiated like so:
from importlib import import_module
*mod_parts, opt_cls_name = calc_inputs.optimizer.split(".")
opt_cls = getattr(import_module(".".join(mod_parts)), opt_cls_name)
opt = opt_cls(task_inputs.atoms, **calc_inputs.opt_params["init"])
Again, note here that we’ve used the values corresponding to the "init" key
of the opt_params field to initialize the optimizer.
Warning
Although input parameters are defined in inputs classes, (e.g.,
TaskInputs and CalculationInputs), these inputs do
not uniquely define task execution behaviour at runtime. Specifically,
task execution further depends on the templates used to write scripts
and environment configuration (e.g., installed libraries). Thus, it is
important to the specified templates and the environment in which the task
will be run are configured correctly before submitting large numbers of tasks.
Creating Task Metadata¶
Finally, we will create the object to host the task metadata.
from autojob.tasks.task import TaskMetadata
metadata = TaskMetadata(
label="My custom H2O task",
tags=[
"H2O",
"GPAW",
]
)
The label field can be used to store a convenient label for the task
(e.g., "test calculation 1"), and the tags field can be used to
associate strings with a given task for cross-referencing. When not set,
metadata fields will be populated with default values. What is the task ID?
>>> metadata.task_id
Tip
To see the names of all fields in task metadata, check the attributes of
TaskMetadataBase.
Instantiating the Task¶
Putting it all together, we can now create the task:
from autojob.tasks.calculation import Calculation
calc = Calculation(
task_inputs=task_inputs,
task_metadata=metadata,
calculation_inputs=calc_inputs,
)
Writing the Task Inputs¶
With the new task created, we can now write the inputs required to execute our
task to a directory. This can be accomplished using the create_task_tree()
function.
from pathlib import Path
from autojob.next import create_task_tree
dest = Path()
task_dir = create_task_tree(calc, dest)
You should notice that a new directory has been created in the current working directory. What is the directory named?
Note
Under the hood, create_task_tree() uses the InputWriter
interface to write task inputs. That means that alternatively, one could
do:
dest = Path("my_calc")
dest.mkdir()
calc.write_inputs(dest)
In general, however, create_task_tree() is more featured as it
can be used to create a new task from an old task directory and
use dynamic naming schemes to create new task directories.
See Serialization for more explanation.
If you would like to use a different naming scheme, you can use the
name_template argument. The name_template argument controls
the naming scheme used to create new task directories. Most characters
in this string will be rendered as-is, however, characters contained in curly
brackets that correspond to fields in TaskMetadataBase will be
substituted. Additionally, {structure} will be substituted with the value
of the atoms_filename field of the task inputs of the task, and {i}
can control the location of the index used to avoid naming conflicts.
(The default is at the end of the filename stem.)
dest = Path()
task_dir = create_task_tree(calc, dest, name_template="calc_{i}")
You should also notice a number of files in the directory, task_dir.
What are they called? You should notice two files corresponding to the
values of the task_script and calculation_script attributes of the task
inputs and calculation inputs, respectively. Inspect these two files
to understand the logic for the task. In general, it is
recommended that task scripts specify resource requirements, configure the
software environment, and dictate high-level control flow for the task.
For a calculation, the control flow will typically involve running the calculation
script, archiving any results, and executing any auto-restart or workflow
progression logic. (This is how the
default task and calculation script templates
are structured.) For a brief explanation of the purpose
of each of these files, see the How autojob structures directories 💭 page.
Running the Task¶
The task that we have defined can be executed locally, so we can execute our task simply by running the task script in the shell. Navigate to the newly created task directory and execute
bash run.sh
Alternatively, if the task class inherits from ScheduledMixin
(as is the case for Calculation) and the task script specifies
scheduler resources (as is the case the run.sh.j2 template), then
the task script can be used to submit the task using a scheduler. For
example, with SLURM and sbatch, you can do:
sbatch run.sh
When the task is complete, we should notice several output files. Notably,
we should notice an output atoms file, final.traj, and an output
JSON, archive.json. You can inspect the output atoms object using the
ASE GUI.
ase gui final.traj
Question
What other files are present?
Retrieving Task Outputs¶
Locate and inspect the task archive file, archive.json, in the task directory.
This should be a JSON dictionary with a single entry. The key should
be the task’s task ID, and the value should be a dictionary that can be loaded into a task.
Since we know the type of task that we would like to instantiate, we
can do so from the JSON like so:
import json
from autojob.tasks.calculation import Calculation
with Path("archive.json").open(mode="r", encoding="utf-8") as file:
data = json.load(file)
calc = Calculation(**next(iter(data.values())))
If we did not know the type of task contained in archive.json, then
we could retrieve the task outputs using the harvest() function.
from autojob.harvest.harvest import harvest
calc = next(iter(harvest(dir_name=Path(), use_cache=True)))
harvest() will load the outputs of all task directories that it finds.
The dir_name argument to harvest() is the directory in which to search
for tasks whose results should be harvested. All directories containing
task metadata files will be assumed to be task directories. The use_cache
argument instructs autojob to load each task directly from the archive.json.
If instead, you would prefer that task outputs are loaded fresh from the files in a
directory, set use_cache=False.
Note
Under the hood, harvest() uses the PathLoadable interface
when use_cache=False so that
from pathlib import Path
from autojob.harvest.harvest import harvest
calc = next(iter(harvest(dir_name=Path(), use_cache=False)))
is equivalent to
from pathlib import Path
from autojob.tasks.task import Task
completed = Task.from_directory(Path(), magic_mode=True)
Calculation tasks have both task outputs and calculation outputs.
Task outputs are relatively simple; their interface is described by the
TaskOutputsBase class, and they typically only include the output
Atoms object. You can visualize the output atoms like so:
completed.task_outputs.atoms.edit()
How does this structure compare with the input atoms?
The calculation outputs are generally more complex as they can contain
any of the output from a calculation. The data model of calculation outputs
is described in the CalculationOutputs class. Most basically, the
calculation outputs include whether the calculation converged, the system
energy, atomic forces, and results for the calculator, optimizer,
and any analyses that were run. The contents of task and calculation outputs
are populated by adherents to the HarvesterBase protocol. Because
we used the gpaw calculator, the calculator results were harvested with
the harvest_gpaw_results() function.
Side Quest
Inspect completed.calculation_outputs. Did the calculation converge?
What is the final energy? What are the content of the
completed.calculation_outputs.calculator_results dictionary?
Restarting the Task with Different Inputs¶
Now, suppose that we wanted to re-run that same calculation but with a
higher convergence tolerance and compare our results. We could manually
edit the run.py script and then re-execute our task, but then we
might lose some of the data in the current directory. We could clone
our current directory and then re-run the task, but then we should also
update the metadata file too. This is the purpose of restart().
restart() accepts the name of a task directory and returns a
2-tuple representing a newly created task. The first element is a
TaskBase instance (whose type depends on that of the previous task),
and the second element is the task directory as a Path.
Calculator parameters (corresponding to calc_params of
CalculationInputs objects)
can be modified with the calc_mods argument. We can restart our
task like so:
from autojob.next.restart import restart
calc_mods = {
"convergence": {
"energy": 1e-4,
},
}
task, new_task_dir = restart(task_dir, calc_mods=calc_mods)
This will create a restart task from the task directory task_dir and
dump the task to a sibling directory of the src directory.
task is a newly created Calculation; new_task_dir is its path.
What is the task ID of the new task? What is the name of the new task directory?
Side Quest
Check out the signature of the restart() function. Try restarting
this task with different calc_mods. Try different naming schemes
with name_template.
Submit all new tasks from the CLI.
bash run.sh
Note
autojob also defines a simplified interface to restart() via the
autojob restart CLI command. From the CLI, autojob restart submits
the new task by default. Submission of the new task to the job queue can
be foregone with the --no-submit option. This may be useful if more
complicated modification to task inputs are required by directly editing
the inputs.json file. For more info, run autojob restart -h from the
command-line.
Storing Task Results¶
Once all tasks have completed running, we will create a single archive
in which to store all results. This can be accomplished with harvest()
together with archive(). From the parent directory of the completed
tasks, run:
from autojob.harvest.archive import archive
dir_name = Path()
all_calcs = harvest(dir_name=dir_name, use_cache=True)
archive(tasks=all_calcs)
This will create a single JSON file in the current working directory named
according to SETTINGS.ARCHIVE_FILE (defaults to archive.json).
You can time stamp the archive using the time_stamp argument and select
JSON and CSV archive modes by setting archive_mode="both". The filename
stem can be changed using the stem argument.
archives = archive(
tasks=all_calcs,
archive_mode="both",
time_stamp=True,
stem="all_calcs",
)
In principle, you can now save these archive files (maybe using git),
distribute them, extract the data, and perform any further analyses. This may
be extracting final energies to
construct free energy diagrams, combining/permuting structures, or using
the data to construct additional tasks. In any case, you are now well-equipped
to work with tasks within autojob!