This tutorial introduces seismological data handling with the squirrel command line tool.
squirrel command line tool is a front-end to the Squirrel data
access infrastructure. In this tutorial,
we will download seismic waveforms organize them into a local file structure
and investigate various properties of the assembled dataset.
For an introduction on how to use the Squirrel framework in your own code, head on over to Examples: Squirrel powered data access.
The Squirrel framework contains functionality to download seismic waveforms and
station metadata from FDSN web services. With an appropriate dataset
configuration this can happen in a just-in-time fashion during processing.
However, sometimes we may prefer to completely download a dataset in advance.
This is one of the tasks for which the
command line tool has been created.
In this part of the tutorial we will download a few days of long period seismic waveforms from the BGR’s FDSN web service. We have selected a time window including the ground motions of the 2021 Mw 8.2 Alaska Earthquake and some of its aftershocks as they were recorded on German broad-band seismometers. Of course, you may choose a different time window, set of stations or FDSN web service, but please be responsible and do not download huge amounts of data just for testing.
Our first step is to create a local Squirrel environment with
squirrel init, so that all the downloaded files
as well as the database are stored in the current directory (our project
.squirrel/. This will make it easier to clean up when
we are done (
rm -rf .squirrel/). If we omit this step, the user’s globally
shared Squirrel environment (
~/.pyrocko/cache/squirrel/) is used.
Create local environment (optional):
$ squirrel init
To use a remote data source we can create a dataset description file and pass
this to the
--dataset option of the various
squirrel subcommands. Examples of such dataset
description files are provided by the
squirrel template command. Running this command
without any further arguments will output a brief list of the available
examples. By chance there is one for accessing all LH channels from BGR’s FDSN
bgr-gr-lh.dataset. We can save the dataset description file
$ squirrel template bgr-gr-lh.dataset -w squirrel:psq.cli.template - INFO - File written: bgr-gr-lh.dataset.yaml
The dataset description is a nicely commented YAML file and we could modify it to our liking:
--- !squirrel.Dataset # All file paths given below are treated relative to the location of this # configuration file. Here we may give a common prefix. For example, if the # configuration file is in the sub-directory 'PROJECT/config/', set it to '..' # so that all paths are relative to 'PROJECT/'. path_prefix: '.' # Data sources to be added (LocalData, FDSNSource, CatalogSource, ...) sources: - !squirrel.FDSNSource # URL or alias of FDSN site. site: bgr # Uncomment to let metadata expire in 10 days: #expires: 10d # Waveforms can be optionally shared with other FDSN client configurations, # so that data is not downloaded multiple times. The downside may be that in # some cases more data than expected is available (if data was previously # downloaded for a different application). #shared_waveforms: true # FDSN query arguments to make metadata queries. # See http://www.fdsn.org/webservices/fdsnws-station-1.1.pdf # Time span arguments should not be added here, because they are handled # automatically by Squirrel. query_args: network: 'GR' channel: 'LH?'
Expert users can get a non-commented version of the file by adding
brief to the
squirrel template command.
Next, we must update the station meta-information for the time interval of
interest. This is done with the
squirrel update command. Channel information
intersecting with the given time interval will be downloaded (Fig. 1):
$ squirrel update --dataset bgr-gr-lh.dataset.yaml --tmin 2021-07-28 --tmax 2021-08-01 [...] squirrel update:psq.client.fdsn - INFO - FDSN "bgr" metadata: querying... squirrel update:psq.client.fdsn - INFO - FDSN "bgr" metadata: new (expires: never) [...] squirrel update:psq.cli.update - INFO - Squirrel stats: Number of files: 2 Total size of known files: 87 kB Number of index nuts: 160 Available content kinds: channel: 120 1991-09-01 00:00:00.000 - <none> station: 40 <none> - <none> Available codes: GR.AHRW..LHE GR.AHRW..LHN GR.AHRW..LHZ GR.AHRW.* GR.ASSE..LHE GR.ASSE..LHN GR.ASSE..LHZ GR.ASSE.* GR.BFO..LHE GR.BFO..LHN [140 more] GR.UBR..LHZ GR.UBR.* GR.WET..LHE GR.WET..LHN GR.WET..LHZ GR.WET.* GR.ZARR..LHE GR.ZARR..LHN GR.ZARR..LHZ GR.ZARR.* Sources: client:fdsn:b3ad21f2a866c178889cfdf4f493eba588a59543 Operators: <none>
After fetching the channel information from the FDSN web service, it prints a brief overview of the contents currently available in our data collection.
If we run the update command a second time, Squirrel informs us that cached metadata has been used:
$ squirrel update --dataset bgr-gr-lh.dataset.yaml --tmin 2021-07-28 --tmax 2021-08-01 [...] squirrel update:psq.client.fdsn - INFO - FDSN "bgr" metadata: using cached (expires: never) [...]
Only if we call the update command with a yet unknown time span, it will make
new queries. It is also possible to set an expiration date for metadata from
this data-source in the dataset configuration
By default, only channel information is made available with
update. If we later need the instrument response information of the seismic
stations of the data selection, we can add the
--responses option to
squirrel update (Fig. 2):
$ squirrel update --responses --dataset bgr-gr-lh.dataset.yaml --tmin 2021-07-28 --tmax 2021-08-01 [...] Available content kinds: channel: 120 1991-09-01 00:00:00.000 - <none> response: 150 1991-01-01 00:00:00.000 - <none> station: 40 <none> - <none> [...]
So now we also have response information containing details about how the seismometers convert physical ground motion into measurement records.
Next we must give permission to Squirrel to download data given certain
constraints. Squirrel will only download waveform data when it has a so-called
promise for a given time span and channel. These promises must be explicitly
created with the
--promises option of
squirrel update. We are only interested in
vertical component seismograms at this point, so we restrict promise creation
to channels ending in ‘Z’ (Fig. 3):
$ squirrel update --promises --dataset bgr-gr-lh.dataset.yaml --tmin 2021-07-28 --tmax 2021-08-01 --codes '*.*.*.??Z' [...] Available content kinds: channel: 120 1991-09-01 00:00:00.000 - <none> station: 40 <none> - <none> waveform_promise: 40 2021-07-28 00:00:00.000 - 2021-08-01 00:00:00.000 [...]
Why do we need a concept involving “promises” you may ask. Well, besides giving us a tight leash on what Squirrel will eventually download, it solves a bookkeeping problem: normally, when resolving a promise and if the download succeeds, the promise is simply removed. When it fails because of a temporary problem (e.g. connectivity), it is kept so that the download can be tried again later. If it however fails permanently, maybe because the waveform is not available on the server, the promise is deleted, so that we do not repeatedly query the server for non-existent data. Finally, if we want to freeze the dataset, we can just remove all remaining promises and no further download attempts will be made. We think that these benefits outweigh the conceptual complexity added with the promises.
After setting up the promises, to actually download the waveforms, we can now
squirrel summon command (Fig. 4):
$ squirrel summon --dataset bgr-gr-lh.dataset.yaml --tmin 2021-07-28 --tmax 2021-08-01
Finally, let’s have a look at the waveforms. We can use an experimental Squirrel-powered version of the Snuffler application to interactively explore the dataset (Fig. 5):
$ squirrel snuffler --dataset bgr-gr-lh.dataset.yaml
The downloaded waveforms include the signals from an Mw 8.2 earthquake which occurred on 2021-07-29 at 06:15 UTC (Fig. 6).
Waveforms are always downloaded in blocks of reasonable size, therefore the
downloaded time frame may be slightly larger than the requested time span. The
downloaded dataset can be incrementally extended by running
squirrel update and
squirrel summon multiple times. Only missing
data blocks are downloaded when running
squirrel summon. Other waveforms available
through the current Squirrel data collection are also considered to avoid
So far the waveforms have been downloaded into a special cache directory maintained by Squirrel. Using the data from there is useful if we will later want to extend the dataset. However, sometimes we want to have full control and so want to create our own waveform archive in a portable form.
To copy the data downloaded in the previous section into a handy directory
structure, we can use the
command. With its
--out-sds-path a standard SDS data directory with
day-files in MSEED format is created:
$ squirrel jackseis --dataset bgr-gr-lh.dataset.yaml --out-sds-path data/sds $ tree data/ # Use `ls`, if `tree` is not installed. data/ └── sds └── 2021 └── GR ├── BFO │ └── LHZ.D │ ├── GR.BFO..LHZ.D.2021.208 │ ├── GR.BFO..LHZ.D.2021.209 │ ├── GR.BFO..LHZ.D.2021.210 │ ├── GR.BFO..LHZ.D.2021.211 │ ├── GR.BFO..LHZ.D.2021.212 │ └── GR.BFO..LHZ.D.2021.213 ├── ...
Station metadata is exported when adding the
--out-meta-path option to
squirrel jackseis. By default, this exports the
metadata in StationXML format to the given file path:
$ squirrel jackseis --dataset bgr-gr-lh.dataset.yaml --out-meta-path meta/stations.xml
We will use the dataset consisting of the waveforms in
data/sds and the
station meta-data in
meta/stations.xml as a “local dataset” in the
To inspect some local data holdings, we can use the Snuffler application by calling
squirrel snuffler. Files and directories given
--add option are made available. File formats are usually
autodetected and directories are recursively scanned for any readable files.
To look at the dataset that we have created in the previous section of the tutorial, use:
$ squirrel snuffler --add data/sds meta/stations.xml
--add option is part of a group of standardized options to configure the run-time data collection of
Squirrel based programs. If we find ourselves repeatedly specifying the same
file paths over and over again, it may be a good idea to tie them together in a
dataset description file. An example of such a file for local datasets can be
squirrel template local.dataset. For a nicely organized
project directory, we may want to place the dataset description file into a
$ mkdir config $ squirrel template local.dataset > config/alaska.dataset.yaml $ nano config/alaska.dataset.yaml # or use your favourite text editor
Let’s modify the file so that our precious waveforms and metadata are found:
--- !squirrel.Dataset # All file paths given below are treated relative to the location of this # configuration file. Here we may give a common prefix. For example, if the # configuration file is in the sub-directory 'PROJECT/config/', set it to '..' # so that all paths are relative to 'PROJECT/'. path_prefix: '..' # Data sources to be added (LocalData, FDSNSource, CatalogSource, ...) sources: - !squirrel.LocalData # This data source is for local files. # These paths are scanned for waveforms, stations, events. paths: - 'data/sds' - 'meta/stations.xml' # Select file format or 'detect' for autodetection. format: 'detect'
The paths in the dataset description file are relative to the location of this
file itself. The value of
path_prefix is prepended to all paths. Because
alaska.dataset.yaml is in the projects subdirectory
config, we have set
'..'. With this, the rest of the paths can be given
relative to the project directory root.
Now we can look at our waveforms by just passing the dataset description file
$ squirrel snuffler --dataset config/alaska.dataset.yaml
With an appropriate configuration of the dataset, local and remote data sources
can be combined. It is also possible to add multiple datasets to a Squirrel
program or to combine
--add. Like this, the runtime data
collection can be flexibly composed at program startup. Squirrel maintains
indexes of known files, so that repeated program startups are extremely
efficient. This approach works well with datasets of up to about 100k - 1M
files. For larger data archives, it is possible to create persistent
selections, which we will cover later.
In this part of the tutorial, we will explore some more
squirrel subcommands useful when checking data
availability or to hunt down data problems.
$ squirrel scan --dataset config/alaska.dataset.yaml
To obtain a visual representation of the data availability over time on the
$ squirrel coverage --dataset config/alaska.dataset.yaml
--tmax to narrow down the displayed time span.
To get all data codes identifying the various stations/channels available in a data collection, run:
$ squirrel codes --dataset config/alaska.dataset.yaml
The returned codes (aka channel IDs / stream IDs / NSLC codes) have the form
NET.STA.LOC.CHA.EXTRA, where the first four follow the FDSN conventions and
EXTRA code is for derived data streams within the Squirrel
squirrel subcommands allow querying
for channels using patterns given to the
--codes option. For example
squirrel nuts lists index entries. Nuts are the
smallest units of information in the Squirrel framework. To obtain an inventory
listing of everything related to the vertical component of station BFO, we may
$ squirrel nuts --dataset config/alaska.dataset.yaml --codes '*.BFO.*.??Z'
Or, to find out what files in our collection contain information about station BFO, run:
$ squirrel files --dataset config/alaska.dataset.yaml --codes '*.BFO.*.*'
Similarly, it is possible to query by time span (
content kind (
--kind), ie. waveform, channel, response, etc.
Conceptually, we should remember that the collection options build up a data collection and the query options are used to query information from that collection. The query options never change the collection itself.
Squirrel can also be used to retrieve and incrementally update earthquake catalog information from a few selected online catalogs.
--- !squirrel.Dataset path_prefix: '..' sources: - !squirrel.LocalData # This data source is for local files. paths: - 'data/sds' - 'meta/stations.xml' - !squirrel.CatalogSource catalog: geofon query_args: magmin: 7.0
To make sure that the local excerpt of the catalog is up to date for a given
time span, we must call
squirrel update with
the dataset description and the desired time span:
$ squirrel update --dataset config/alaska.dataset.yaml --tmin 2021-07-28 --tmax 2021-08-01
Again, as we have seen with waveforms and station metadata, Squirrel is lazy
and tries to avoid duplicate downloads of event information. It uses the
locally cached information when possible. To make our dataset aware of updates
in the upstream catalog, we can to set an expiration time for the cached
expires) or a time
period for which new data is considered unreliable
So far, the runtime data selection used in each squirrel command has been
composed at each startup. For example when running
squirrel snuffler --add
data/sds meta/stations.xml, a temporary database is created with all the
content given to
--add. This temporary database is deleted again when
squirrel snuffler exits. The advantage of this approach is that we can very
flexibly combine what data should be available in each processing step. The
disadvantage is that the creation of the temporary database takes some time and
leads to slow program startup for large datasets. To use a persistent instead
of a temporary database, use the
--persistent option. This option takes the
name of the persistent selection which will be created or used as an argument.
For example, to create a persistent selection named
alaska, and add all
$ squirrel snuffler --persistent alaska --add data/sds
To look at the newly created selection:
$ squirrel snuffler --persistent alaska
We can also add further data to the selection:
$ squirrel snuffler --persistent alaska --add meta/stations.xml
It is possible to create multiple persistent selections but each one adds some internal bookkeeping overhead which can impact the overall performance of the database.
Existing persistent selections can be listed:
$ squirrel persistent list
To remove again the persistent selection
$ squirrel persistent delete alaska
Persistent selections trade flexibility against program startup time.
Squirrel framework provides a unified
interface to query and access seismic waveforms, station meta-data and event
information from local file collections and remote data sources. For prompt
responses, a database setup is used under the hood. To speed up assemblage of
ad-hoc data selections, files are indexed on first use and the extracted
meta-data is remembered for subsequent accesses.
squirrel tool provides some of the features of the Squirrel
framework on the command line. In this tutorial, we have seen how we can use it
to perform some every day seismological tasks such as downloading data from
FDSN web services, dataset conversion and inspection.
For an introduction on how to use the Squirrel framework in your own code, see Tutorial: writing a Squirrel based tool to calculate hourly RMS values.