Data Structures
Pandas DataFrames
Within this software package, we store timeseries data in pandas DataFrames, with timestamps as the index and one or more time-aligned datasets as columns. A typical dataframe for GTSM gauge data would contain four columns, representing the different strain gauges (CH0, CH1, CH2, CH3). By applying a calibration matrix to these data, we can convert the four gauges into a three column dataframe of areal (Eee+Enn), shear (2Ene), and differential (Eee-Enn) strains.
Example DataFrame containing 300s gauge microstrain data
CH0 CH1 CH2 CH3
time
2023-01-01 00:00:00 -145.455340 -22.864945 86.758526 -5.665800
2023-01-01 00:05:00 -145.455340 -22.865036 86.758526 -5.665800
2023-01-01 00:10:00 -145.455421 -22.864764 86.758052 -5.666071
2023-01-01 00:15:00 -145.455421 -22.864221 86.757293 -5.666791
2023-01-01 00:20:00 -145.455421 -22.863768 86.756819 -5.667332
... ... ... ... ...
2023-01-31 23:40:00 -145.635846 -22.826812 86.902057 -5.653822
2023-01-31 23:45:00 -145.635119 -22.825997 86.901582 -5.653912
2023-01-31 23:50:00 -145.634393 -22.825001 86.901108 -5.654002
2023-01-31 23:55:00 -145.634150 -22.824457 86.900349 -5.654542
2023-02-01 00:00:00 -145.633666 -22.823823 86.899685 -5.654722
8929 rows × 4 columns
New DataFrames are created during each processing step. Calculated corrections are stored in their own DataFrame(s) as well, which can then be applied to raw data via simple combination.
On Disk
CSV
DataFrames can be saved to CSV (with optional compression), and that functionality is supported by the software.
However, it is often desireable to store more than just the data itself, for example discription of the specific data, or quality flags allow keeping track of any data that is known to be bad or missing or interpolated. We also may want to be able to version the data. Therefore, we have been developing a three dimensional array structure using TileDB to store processed strain data, as well as a python Class earthscopestraintools.timeseries.Timeseries to handle
TileDB arrays
Processed data will be stored with a Tiledb array per station, and indexed along the following three dimensions. Implementation of this is still under development, but the schema has been defined as follows.
Dimensions |
|
|---|---|
data_type |
variable length string. defines channel (i.e. ‘CH0’) or strain (i.e. ‘Eee-Enn’). may also describe the calibration matrix used (i.e. ‘Eee+Enn.ER2010’) if choosing a calibration other than the default ‘lab’ |
timeseries |
variable length string. used to define whether the data is a measurement or a correction. Options include [‘counts’, ‘microstrain’, ‘offset_c’, ‘tide_c’, ‘trend_c’, ‘atmp_c’] |
time |
int64 unix milliseconds since 1970. |
Each cell in the multi-dimensional array will also have four attributes.
Attributes |
|
|---|---|
data |
(float64) the actual data value |
quality |
(char) single character quality flag (i.e. ‘g’=good, ‘b’=bad, ‘m’=missing, ‘i’=interpolated) |
level |
(str) one/two character level flag (i.e. ‘0’,’1’,’2a’,’2b’) |
version |
versioning is intended to be used to identify processing metadata which may change with time. not yet well implemented. |
Timeseries Objects
We have created a class earthscopestraintools.timeseries.Timeseries, which is designed to capture all this various extra information and support writing to/reading from TileDB arrays. Using these Timeseries objects is recommended, as it simplifies the processing workflow and provides built-in stats around missing/bad data.
Each Timeseries object contains the following attributes:
Attributes |
|
|---|---|
data |
(pd.DataFrame) as described above, with datetime index and one or more columns of timeseries data |
quality_df |
(pd.DataFrame) autogenerated with same shape as data, but with a character mapped to each data point. flags include “g”=good, “m”=missing, “i”=interpolated, “b”=bad |
series |
(str) timeseries dimension for TileDB schema, ie ‘raw’, ‘microstrain’, ‘atmp_c’, ‘tide_c’, ‘offset_c’, ‘trend_c’ |
units |
(str) units of data |
level |
(str) level of data. ie. ‘0’,’1’,’2a’,’2b’ |
period |
(float) sample period of data |
name |
(str) optional name of timeseries, used for showing stats and plotting. defaults to network.station |
network |
(str) FDSN two character network code |
station |
(str) FDSN four character station code |
DataFrame data can be initially loaded into a Timeseries object either directly i.e.
from earthscopestraintools.timeseries import Timeseries
strain_raw = Timeseries(data=your_data_df,
series="raw",
units="counts",
level="0",
period=1,
name="PB.B004.raw",
network="PB",
station="B004")
or (Recommended) by using the function mseed_to_ts(), which will call FDSN-DataSelect web service and load the requested data from the miniseed archive i.e.
from earthscopestraintools.mseed_tools import ts_from_mseed
start="2023-01-01T00:00:00"
end = "2023-02-01T00:00:00"
strain_raw = ts_from_mseed(network="PB",
station="B004",
location='T0',
channel='LS*',
start=start,
end=end)
Timeseries objects contain a number of processing methods, which build and return new timeseries objects. For example, decimation of 1s data to the typical 300s data is performed by the following method, which returns a new Timeseries object
decimated_counts = strain_raw.decimate_1s_to_300s()
They also contain a built-in method stats() which displays a summary of the Timeseries object, including information on missing/interpolated data. An Epoch is defined as a single row in the data, while a Sample is an individual value.
strain_raw.stats()
PB.B004.T0.LS*
| Channels: ['CH0', 'CH1', 'CH2', 'CH3']
| TimeRange: 2023-01-01 00:00:00 - 2023-02-01 00:00:00 | Period: 1s
| Series: raw| Units: counts| Level: 0| Gaps: 0.06%
| Epochs: 2678401| Good: 2676756.25| Missing: 1644.75| Interpolated: 0.0
| Samples: 10713604| Good: 10707025| Missing: 6579| Interpolated: 0
Another built-in method plot() is useful for visualization of Timeseries data.
strain_raw.plot()
See the api docs for more details on available methods and options, and the example notebooks for introductory usage.