Spalloc: SpiNNaker machine allocation client¶
Spalloc is a Python library and set of command-line programs for requesting SpiNNaker machines from a spalloc server.
Quick-start¶
Step 1: Install spalloc:
$ pip install spalloc
Step 2: Write a configuration file indicating your email address and the spalloc
server’s address (run python -m spalloc.config
to discover what to call
your config file on your machine):
[spalloc]
hostname = my_server
owner = jdh@cs.man.ac.uk
Step 3: Request a system using the command-line interface, e.g. a three-board machine:
$ spalloc 3
…or request one from Python
…
>>> from spalloc_client import Job
>>> with Job(3) as j:
... my_boot(j.hostname, j.width, j.height)
... my_application(j.hostname)
Note
When a machine is allocated it is powered on but not booted: that is up to
you. If Rig is installed on your system the spalloc
commandline tool
provides a --boot
option which will boot the allocated machine for you.
Note
The dimensions of a machine may not be what you’re used to and may change from allocation to allocation, even for the same number of boards.
Note
When you’re finished with the boards you were allocated, pressing enter (or
exiting the with
block in the Python version) will automatically shut
them down and allow them to be used by others.
Configuration file format and defaults¶
The spalloc command-line tool and Python library determine their default configuration options from a spalloc configuration file if present.
Note
Use of spalloc’s configuration files is entirely optional as all configuration options may be presented as arguments to commands/methods at runtime.
By default, configuration files are read (in ascending order of priority) from
a system-wide configuration directory (e.g. /etc/xdg/spalloc
), user
configuration file (e.g. $HOME/.config/spalloc
) and finally the current
working directory (in a file named .spalloc
). The default search paths on
your system can be discovered by running:
$ python -m spalloc.config
Config files use the Python configparser
INI-format with a single
section, spalloc
, like so:
[spalloc]
hostname = localhost
owner = jdh@cs.man.ac.uk
Though most users will only wish to specify the hostname
and owner
options (as in the example above), the following enumerates the complete set of
options available (and the default value).
hostname
The hostname or IP address of the spalloc-server to connect to.
owner
The name of the owner of created jobs. By convention the user’s email address.
port
The port used by the spalloc-server. (Default: 22244)
keepalive
The keepalive interval, in seconds, to use when creating jobs. If the spalloc-server does not receive a keepalive command for this interval the job is automatically destroyed. May be set to None to disable this feature. (Default: 60.0)
reconnect_delay
The time, in seconds, to wait between reconnection attempts to the server if disconnected. (Default 5.0)
timeout
The time, in seconds, to wait before giving up waiting for a response from the server or None to wait forever. (Default 5.0)
machine
The name of a specific machine on which to run all jobs or None to use any available machine. (Default: None)
tags
The set of tags, comma seperated, to require a machine to have when allocating jobs. (Default: default)
min_ratio
Require that when allocating a number of boards the allocation is at least as square as this aspect ratio. (Default: 0.333)
max_dead_boards
The maximum number of dead boards which may be present in an allocated set of boards or None to allow any number of dead boards. (Default: 0)
max_dead_links
The maximum number of dead links which may be present in an allocated set of boards or None to allow any number of dead links. (Default: None)
require_torus
If True, require that an allocation have wrap-around links. This typically requires the allocation of a whole machine. If False, wrap-around links may or may-not be present in allocated machines. (Default: False)
spalloc
: Allocate SpiNNaker machines¶
A command-line utility for creating jobs.
Note
In the examples below, it is assumed that the spalloc server hostname and a
suitable owner name have been specified in a config
file.
Basic usage¶
By default, the spalloc
command allocates a job according to the
command-line specification and then waits for boards to be allocated and
powered on.
Note
By default, allocated machines are powered on but not booted. If
Rig
is installed, spalloc
provides a --boot
option which also boots
the allocated machine once it has been powered on. This dependency can be
installed using the [boot]
option at install time for spalloc.
The spalloc
command can be called in one of the following styles though
most users will probably only require the first two.
Invocation |
Allocation |
---|---|
|
A single SpiNN-5 board |
|
A machine with at least 5 boards |
|
A 4x2 triad machine. |
|
A single SpiNN-5 board at logical position (3, 4, 0) |
A range of additional command-line arguments are available to control various
aspects of Job allocation, run spalloc --help
for a complete listing.
Wrapping other commands¶
The spalloc
command can alternatively wrap an existing command, calling it
once a SpiNNaker machine is allocated and cleaning up the job when the command
exits:
$ spalloc 24 -c "rig-boot {} {w} {h} && python my_app.py {}"
The example above attempts to allocate a 24-board machine and, once allocated and powered on, calls the command above, with the arguments in curly braces substituted for details of the allocated machine.
The following substitutions are available:
Token |
Substitution |
---|---|
|
Chip 0, 0’s hostname |
|
Chip 0, 0’s hostname |
|
Width of the system (in chips) |
|
Width of the system (in chips) |
|
Height of the system (in chips) |
|
Height of the system (in chips) |
|
Filename of a CSV of Ethernet IPs |
|
The job ID |
Ethernet-connected chip hostname CSV Format¶
Hostnames for all Ethernet-connected SpiNNaker chips in a machine are provided
in a CSV file with three columns:: x, y and hostname. The CSV file is newline
(\n
) delimited and the first row is a header.
Disconnecting and resuming jobs¶
Warning
This functionality is intended for advanced users only.
By default, when the spalloc
command exits, the job will be destroyed and
any allocated boards freed. This behaviour can be disabled with the
--no-destroy
argument, leaving the job allocated after the command exits.
Such a job may be ‘resumed’ by calling spalloc
with the --resume
[JOB_ID]
option.
Note that by default, jobs require a ‘keepalive’ message to be sent to the
server at a regular interval. While the spalloc
command is running, these
messages are sent automatically but after exiting the commands are no longer
sent. Adding the --keepalive -1
option when creating a job disables this.
spalloc-job
: Manage and reset existing jobs and their boards¶
Command-line administrative job management interface.
spalloc-job
may be called with a job ID, or if no arguments supplied your
currently running job is shown by default. Various actions may be taken and
each is described below.
Displaying job information¶
By default, the command displays all known information about a job.
The --watch
option may be added which will cause the output to be updated
in real-time as a job’s state changes. For example:
$ spalloc-job --watch
Controlling board power¶
The boards allocated to a job may be reset or powered on/off on demand (by
anybody, at any time) by adding the --power-on
, --power-off
or
--reset
options. For example:
$ spalloc-job --reset
Note
This command blocks until the action is completed.
Listing board IP addresses¶
The hostnames of Ethernet-attached chips can be listed in CSV format by adding the –ethernet-ips argument:
$ spalloc-job --ethernet-ips
x,y,hostname
0,0,192.168.1.97
0,12,192.168.1.105
4,8,192.168.1.129
4,20,192.168.1.137
8,4,192.168.1.161
8,16,192.168.1.169
Destroying/Cancelling Jobs¶
Jobs can be destroyed (by anybody, at any time) using the --destroy
option
which optionally accepts a human-readable explanation:
$ spalloc-job --destroy "Your job is taking too long..."
Warning
That this “super power” should be used carefully since the user may not be notified that their job was destroyed and the first sign of this will be their boards being powered down and re-partitioned ready for another user.
spalloc-ps
: List all running jobs¶
An administrative command-line process listing utility.
By default, the spalloc-ps
command lists all running and queued jobs. For
a real-time monitor of queued and running jobs, the --watch
option may be
added.
This list may be filtered by owner or machine with the --owner
and
--machine
arguments.
spalloc-machine
: List available machines and their running jobs¶
Command-line administrative machine management interface.
When called with no arguments the spalloc-machine
command lists all
available machines and a summary of their current load.
If a specific machine is given as an argument, the current allocation of jobs to machines is displayed:
Adding the --detailed
option displays additional information about jobs
running on a machine.
If the --watch
option is given, the information displayed is updated in
real-time.
spalloc-where-is
: Query the server for the physical/logical locations of boards/chips¶
Command-line tool to find out where a particular chip or board resides.
The spalloc-where-is
command allows you to query boards by coordinate, by
physical location, by chip or by job. In response to a query, a standard set of
information is displayed as shown in the example below:
$ spalloc-where-is --job-chip 24 14, 3
Machine: my-machine
Physical Location: Cabinet 2, Frame 4, Board 7
Board Coordinate: (3, 4, 0)
Machine Chip Coordinates: (38, 51)
Coordinates within board: (2, 3)
Job using board: 24
Coordinates within job: (14, 3)
In this example we ask, ‘where is chip (14, 3) in job 24’? We discover that:
The chip is the machine named ‘my-machine’ on the board in cabinet 2, frame 4, board 7.
This board’s logical board coordinates are (3, 4, 0). These logical coordinates may be used to specifically request this board from Spalloc in the future.
If ‘my-machine’ were booted as a single large machine, the chip we queried would be chip (38, 51). This may be useful for cross-referencing with diagrams produced by SpiNNer.
The chip in question is chip (2, 3) its board. This may be useful when reporting faulty chips/replacing boards..
The job currently running on the board has ID 24. Obviously in this example we already knew this but this may be useful when querying by board.
Finally, we’re told that the queried chip has the coordinates (14, 3) in the machine allocated to job 24. Again, this information may be more useful when querying by board.
To query by logical board coordinate:
spalloc-where-is --board MACHINE X Y Z
To query by physical board location:
spalloc-where-is --physical MACHINE CABINET FRAME BOARD
To query by chip coordinate (as if the machine were booted as one large machine):
spalloc-where-is --chip MACHINE X Y
To query by chip coordinate of chips allocated to a job:
spalloc-where-is --job-chip JOB_ID X Y
Python library¶
Spalloc provides a pair of Python libraries which enable basic high- and
low-level interaction with a spalloc server. The high-level
Job
interface makes the task of creating jobs
(and keeping them alive) straight-forward but only facilitates basic job
management functions such as resetting boards and getting their IP addresses.
The low-level ProtocolClient
provides an RPC-like
interface to the spalloc server enabling any spalloc server command to be sent.
Note
These libraries are intentionally simplistic and may be unsuitable for very advanced applications. In such instances, users are encouraged to implement the spalloc server protocol in a manner better suited to their specific use-case.
High level interface (spalloc_client.Job
)¶
- class spalloc_client.Job(*args, **kwargs)[source]¶
A high-level interface for requesting and managing allocations of SpiNNaker boards.
Constructing a
Job
object connects to a spalloc-server and requests a number of SpiNNaker boards. See theconstructor
for details of the types of requests which may be made. The job object may then be used to monitor the state of the request, control the boards allocated and determine their IP addresses.In its simplest form, a
Job
can be used as a context manager like so:>>> from spalloc_client import Job >>> with Job(6) as j: ... my_boot(j.hostname, j.width, j.height) ... my_application(j.hostname)
In this example a six-board machine is requested and the
with
context is entered once the allocation has been made and the allocated boards are fully powered on. When control leaves the block, the job is destroyed and the boards shut down by the server ready for another job.For more fine-grained control, the same functionality is available via various methods:
>>> from spalloc_client import Job >>> j = Job(6) >>> j.wait_until_ready() >>> my_boot(j.hostname, j.width, j.height) >>> my_application(j.hostname) >>> j.destroy()
Note
More complex applications may wish to log the following attributes of their job to support later debugging efforts:
job.id
– May be used to query the state of the job and find out its fate if cancelled or destroyed. Thespalloc-job
command can be used to discover the state/fate of the job andspalloc-where-is
may be used to find out what boards problem chips reside on.job.machine_name
andjob.boards
together give a complete record of the hardware used by the job. Thespalloc-where-is
command may be used to find out the physical locations of the boards used.
Job
objects have the following attributes which describe the job and its allocated machines:- Attributes:
- job.idint or None
The job ID allocated by the server to the job.
- job.state
JobState
The current state of the job.
- job.powerbool or None
If boards have been allocated to the job, are they on (True) or off (False). None if no boards are allocated to the job.
- job.reasonstr or None
If the job has been destroyed, gives the reason (which may be None), or None if the job has not been destroyed.
- job.hostnamestr or None
The hostname of the SpiNNaker chip at (0, 0), or None if no boards have been allocated to the job.
- job.connections{(x, y): hostname, …} or None
The hostnames of all Ethernet-connected SpiNNaker chips, or None if no boards have been allocated to the job.
- job.widthint or None
The width of the SpiNNaker network in chips, or None if no boards have been allocated to the job.
- job.heightint or None
The height of the SpiNNaker network in chips, or None if no boards have been allocated to the job.
- job.machine_namestr or None
The name of the machine the boards are allocated in, or None if not yet allocated.
- job.boards[[x, y, z], …] or None
The logical coordinates allocated to the job, or None if not yet allocated.
- __init__(*args, **kwargs)[source]¶
Request a SpiNNaker machine.
A
Job
is constructed in one of the following styles:>>> # Any single (SpiNN-5) board >>> Job() >>> Job(1) >>> # Any machine with at least 4 boards >>> Job(4) >>> # Any 7-or-more board machine with an aspect ratio at least as >>> # square as 1:2 >>> Job(7, min_ratio=0.5) >>> # Any 4x5 triad segment of a machine (may or may-not be a >>> # torus/full machine) >>> Job(4, 5) >>> # Any torus-connected (full machine) 4x2 machine >>> Job(4, 2, require_torus=True) >>> # Board x=3, y=2, z=1 on the machine named "m" >>> Job(3, 2, 1, machine="m") >>> # Keep using (and keeping-alive) an existing allocation >>> Job(resume_job_id=123)
Once finished with a Job, the
destroy()
(or in unusual applicationsJob.close()
) method must be called to destroy the job, close the connection to the server and terminate the background keep-alive thread. Alternatively, a Job may be used as a context manager which automatically callsdestroy()
on exiting the block:>>> with Job() as j: ... # ...for example... ... my_boot(j.hostname, j.width, j.height) ... my_application(j.hostname)
The following keyword-only parameters below are used both to specify the server details as well as the job requirements. Most parameters default to the values supplied in the local
config
file allowing usage as in the examples above.- Parameters:
- hostnamestr
Required. The name of the spalloc server to connect to. (Read from config file if not specified.)
- portint
The port number of the spalloc server to connect to. (Read from config file if not specified.)
- reconnect_delayfloat
Number of seconds between attempts to reconnect to the server. (Read from config file if not specified.)
- timeoutfloat or None
Timeout for waiting for replies from the server. If None, will keep trying forever. (Read from config file if not specified.)
- config_filenames[str, …]
If given must be a list of filenames to read configuration options from. If not supplied, the default config file locations are searched. Set to an empty list to prevent using values from config files.
- Other Parameters:
- resume_job_idint or None
If supplied, rather than creating a new job, take on an existing one, keeping it alive as required by the original job. If this argument is used, all other requirements are ignored.
- ownerstr
Required. The name of the owner of the job. By convention this should be your email address. (Read from config file if not specified.)
- keepalivefloat or None
The number of seconds after which the server may consider the job dead if this client cannot communicate with it. If None, no timeout will be used and the job will run until explicitly destroyed. Use with extreme caution. (Read from config file if not specified.)
- machinestr or None
Specify the name of a machine which this job must be executed on. If None, the first suitable machine available will be used, according to the tags selected below. Must be None when tags are given. (Read from config file if not specified.)
- tags[str, …] or None
The set of tags which any machine running this job must have. If None is supplied, only machines with the “default” tag will be used. If machine is given, this argument must be None. (Read from config file if not specified.)
- min_ratiofloat
The aspect ratio (h/w) which the allocated region must be ‘at least as square as’. Set to 0.0 for any allowable shape, 1.0 to be exactly square etc. Ignored when allocating single boards or specific rectangles of triads.
- max_dead_boardsint or None
The maximum number of broken or unreachable boards to allow in the allocated region. If None, any number of dead boards is permitted, as long as the board on the bottom-left corner is alive. (Read from config file if not specified.)
- max_dead_linksint or None
The maximum number of broken links allow in the allocated region. When require_torus is True this includes wrap-around links, otherwise peripheral links are not counted. If None, any number of broken links is allowed. (Read from config file if not specified.).
- require_torusbool
If True, only allocate blocks with torus connectivity. In general this will only succeed for requests to allocate an entire machine. Must be False when allocating boards. (Read from config file if not specified.)
- __enter__()[source]¶
Convenience context manager for common case where a new job is to be created and then destroyed once some code has executed.
Waits for machine to be ready before the context enters and frees the allocation when the context exits.
Example:
>>> from spalloc_client import Job >>> with Job(6) as j: ... my_boot(j.hostname, j.width, j.height) ... my_application(j.hostname)
- destroy(reason=None)[source]¶
Destroy the job and disconnect from the server.
- Parameters:
- reasonstr or None
Optional. Gives a human-readable explanation for the destruction of the job.
- close()[source]¶
Disconnect from the server and stop keeping the job alive.
Warning
This method does not free the resources allocated by the job but rather simply disconnects from the server and ceases sending keep-alive messages. Most applications should use
destroy()
instead.
- set_power(power)[source]¶
Turn the boards allocated to the job on or off.
Does nothing if the job has not yet been allocated any boards.
The
wait_until_ready()
method may be used to wait for the boards to fully turn on or off.- Parameters:
- powerbool
True to power on the boards, False to power off. If the boards are already turned on, setting power to True will reset them.
- reset()[source]¶
Reset (power-cycle) the boards allocated to the job.
Does nothing if the job has not been allocated.
The
wait_until_ready()
method may be used to wait for the boards to fully turn on or off.
- property state¶
The current state of the job.
- property power¶
Are the boards powered/powering on or off?
- property reason¶
For what reason was the job destroyed (if any and if destroyed).
- property connections¶
The list of Ethernet connected chips and their IPs.
- Returns:
- {(x, y): hostname, …} or None
- property hostname¶
The hostname of chip 0, 0 (or None if not allocated yet).
- property width¶
The width of the allocated machine in chips (or None).
- property height¶
The height of the allocated machine in chips (or None).
- property machine_name¶
The name of the machine the job is allocated on (or None).
- property boards¶
The coordinates of the boards allocated for the job (or None).
- wait_for_state_change(old_state, timeout=None)[source]¶
Block until the job’s state changes from the supplied state.
- wait_until_ready(timeout=None)[source]¶
Block until the job is allocated and ready.
- Parameters:
- timeoutfloat or None
The number of seconds to wait before timing out. If None, wait forever.
- Raises:
- StateChangeTimeoutError
If the timeout expired before the ready state was entered.
- JobDestroyedError
If the job was destroyed before becoming ready.
- where_is_machine(chip_x, chip_y)[source]¶
Locates and returns cabinet, frame, board for a given chip in a machine allocated to this job.
- Parameters:
chip_x – chip x location
chip_y – chip y location
- Returns:
tuple of (cabinet, frame, board)
- __weakref__¶
list of weak references to the object (if defined)
- class spalloc_client.JobState(value)[source]¶
All the possible states that a job may be in.
- unknown = 0¶
The job ID requested was not recognised.
- queued = 1¶
The job is waiting in a queue for a suitable machine.
- power = 2¶
The boards allocated to the job are currently being powered on or powered off.
- ready = 3¶
The job has been allocated boards and the boards are not currently powering on or powering off.
- destroyed = 4¶
The job has been destroyed.
Lower level interface (spalloc_client.ProtocolClient
)¶
- class spalloc_client.ProtocolClient(hostname, port=22244, timeout=None)[source]¶
A simple (blocking) client implementation of the spalloc-server protocol.
This minimal implementation is intended to serve both simple applications and as an example implementation of the protocol for other applications. This implementation simply implements the protocol, presenting an RPC-like interface to the server. For a higher-level interface built on top of this client, see
spalloc.Job
.Usage examples:
# Connect to a spalloc_server with ProtocolClient("hostname") as c: # Call commands by name print(c.call("version")) # '0.1.0' # Call commands as if they were methods print(c.version()) # '0.1.0' # Wait an event to be received print(c.wait_for_notification()) # {"jobs_changed": [1, 3]} # Done!
- __init__(hostname, port=22244, timeout=None)[source]¶
Define a new connection.
Note
Does not connect to the server until
connect()
is called.- Parameters:
- hostnamestr
The hostname of the server.
- portstr or int
The port to use (default: 22244).
- connect(timeout=None)[source]¶
(Re)connect to the server.
- Raises:
- OSError, IOError
If a connection failure occurs.
- call(name, *args, **kwargs)[source]¶
Send a command to the server and return the reply.
- Parameters:
- namestr
The name of the command to send.
- timeoutfloat or None
The number of seconds to wait before timing out or None if this function should wait forever. (Default: None)
- Returns:
- object
The object returned by the server.
- Raises:
- ProtocolTimeoutError
If a timeout occurs.
- ProtocolError
If the connection is unavailable or is closed.
- wait_for_notification(timeout=None)[source]¶
Return the next notification to arrive.
- Parameters:
- namestr
The name of the command to send.
- timeoutfloat or None
The number of seconds to wait before timing out or None if this function should try again forever.
If negative only responses already-received will be returned. If no responses are available, in this case the function does not raise a ProtocolTimeoutError but returns None instead.
- Returns:
- object
The notification sent by the server.
- Raises:
- ProtocolTimeoutError
If a timeout occurs.
- ProtocolError
If the socket is unusable or becomes disconnected.
- __weakref__¶
list of weak references to the object (if defined)