bw_processing package#
Subpackages#
Submodules#
bw_processing.array_creation module#
- bw_processing.array_creation.chunked(iterable, chunk_size)#
- bw_processing.array_creation.create_array(iterable, nrows=None, dtype=<class 'numpy.float32'>)#
Create a numpy array data
iterable. Returns a filepath of a created file (iffilepathis provided, or the array.iterablecan be data already in memory, or a generator.nrowscan be supplied, if known. Ifiterablehas a length, it will be determined automatically. Ifnrowsis not known, this function generates chunked arrays untiliterableis exhausted, and concatenates them.Either
nrowsorncolsmust be specified.
- bw_processing.array_creation.create_chunked_array(iterable, ncols, dtype=<class 'numpy.float32'>, bucket_size=500)#
Create a numpy array from an iterable of indeterminate length.
Needed when we can’t determine the length of the iterable ahead of time (e.g. for a generator or a database cursor), so can’t create the complete array in memory in on step
Creates a list of arrays with
bucket_sizerows untiliterableis exhausted, then concatenates them.- Parameters:
iterable – Iterable of data used to populate the array.
ncols – Number of columns in the created array.
dtype – Numpy dtype of the created array
bucket_size – Number of rows in each intermediate array.
- Returns:.
Returns the created array. Will return a zero-length array if
iterablehas no data.
- bw_processing.array_creation.create_chunked_structured_array(iterable, dtype, bucket_size=20000)#
Create a numpy structured array from an iterable of indeterminate length.
Needed when we can’t determine the length of the iterable ahead of time (e.g. for a generator or a database cursor), so can’t create the complete array in memory in on step
Creates a list of arrays with
bucket_sizerows untiliterableis exhausted, then concatenates them.- Parameters:
iterable – Iterable of data used to populate the array.
dtype – Numpy dtype of the created array
format_function – If provided, this function will be called on each row of
iterablebefore insertion in the array.bucket_size – Number of rows in each intermediate array.
- Returns:.
Returns the created array. Will return a zero-length array if
iterablehas no data.
- bw_processing.array_creation.create_structured_array(iterable, dtype, nrows=None, sort=False, sort_fields=None)#
Create a numpy structured array for data
iterable. Returns a filepath of a created file (iffilepathis provided, or the array.iterablecan be data already in memory, or a generator.nrowscan be supplied, if known. Ifiterablehas a length, it will be determined automatically. Ifnrowsis not known, this function generates chunked arrays untiliterableis exhausted, and concatenates them.
- bw_processing.array_creation.get_ncols(iterator)#
- bw_processing.array_creation.peek(iterator)#
bw_processing.constants module#
bw_processing.datapackage module#
- class bw_processing.datapackage.Datapackage#
Bases:
DatapackageBaseInterface for creating, loading, and using numerical datapackages for Brightway.
Note that there are two entry points to using this class, both separate functions:
create_datapackage()andload_datapackage(). Do not create an instance of the class withDatapackage(), unless you like playing with danger :)Data packages can be stored in memory, in a directory, or in a zip file. When creating data packages for use later, don’t forget to call
.finalize_serialization(), or the metadata won’t be written and the data package won’t be usable.Potential gotchas:
There is currently no way to modify a zipped data package once it is finalized.
Resources that are interfaces to external data sources (either in Python or other) can’t be saved, but must be recreated each time a data package is used.
- add_csv_metadata(*, dataframe, valid_for, name=None, **kwargs)#
Add an iterable metadata object to be stored as a CSV file.
The purpose of storing metadata is to enable data exchange; therefore, this method assumes that data is written to disk.
The normal use case of this method is to link integer indices from either structured or presample arrays to a set of fields that uniquely identifies each object. This allows for matching based on object attributes from computer to computer, where database ids or other computer-generated codes might not be consistent.
Uses pandas to store and load data; therefore, metadata must already be a pandas dataframe.
In contrast with presamples arrays,
iterable_data_sourcecannot be an infinite generator. We need a finite set of data to build a matrix.In contrast to
self.create_structured_array, this always stores the dataframe inself.data; no proxies are used.- Parameters:
dataframe (*) – Dataframe to be persisted to disk.
valid_for (*) – List of resource names that this metadata is valid for; must be either structured or presample indices arrays. Each item in
valid_forhas the form("resource_name", "rows" or "cols").resource_nameshould be either a structured or a presamples indices array.name (*) – The name of this resource. Names must be unique in a given data package
extra (*) – Dict of extra metadata
- Returns:
Nothing, but appends objects to
self.metadata['resources']andself.data.- Raises:
* AssertionError – If inputs are not in correct form
* AssertionError – If
valid_forrefers to unavailable resources
- Return type:
None
- add_dynamic_array(*, matrix, interface, indices_array, name=None, flip_array=None, keep_proxy=False, matrix_serialize_format_type=None, **kwargs)#
interface must support the presamples API.
- Parameters:
matrix (str) –
interface (Any) –
indices_array (ndarray) –
name (str | None) –
flip_array (ndarray | None) –
keep_proxy (bool) –
matrix_serialize_format_type (MatrixSerializeFormat | None) –
- Return type:
None
- add_dynamic_vector(*, matrix, interface, indices_array, name=None, flip_array=None, keep_proxy=False, matrix_serialize_format_type=None, **kwargs)#
- Parameters:
matrix (str) –
interface (Any) –
indices_array (ndarray) –
name (str | None) –
flip_array (ndarray | None) –
keep_proxy (bool) –
matrix_serialize_format_type (MatrixSerializeFormat | None) –
- Return type:
None
- add_json_metadata(*, data, valid_for, name=None, **kwargs)#
Add an iterable metadata object to be stored as a JSON file.
The purpose of storing metadata is to enable data exchange; therefore, this method assumes that data is written to disk.
The normal use case of this method is to provide names and other metadata for parameters whose values are stored as presamples arrays. The length of
datashould match the number of rows in the corresponding presamples array, anddatais just a list of string labels for the parameters. However, this method can also be used to store other metadata, e.g. for external data resources.In contrast to
self.create_structured_array, this always stores the dataframe inself.data; no proxies are used.- Parameters:
data (*) – Data to be persisted to disk.
valid_for (*) – Name of structured data or presample array that this metadata is valid for.
name (*) – The name of this resource. Names must be unique in a given data package
extra (*) – Dict of extra metadata
- Returns:
Nothing, but appends objects to
self.metadata['resources']andself.data.- Raises:
* AssertionError – If inputs are not in correct form
* AssertionError – If
valid_forrefers to unavailable resources
- Return type:
None
- add_persistent_array(*, matrix, data_array, indices_array, name=None, flip_array=None, keep_proxy=False, matrix_serialize_format_type=None, **kwargs)#
- Parameters:
matrix (str) –
data_array (ndarray) –
indices_array (ndarray) –
name (str | None) –
flip_array (ndarray | None) –
keep_proxy (bool) –
matrix_serialize_format_type (MatrixSerializeFormat | None) –
- Return type:
None
- add_persistent_vector(*, matrix, indices_array, name=None, data_array=None, flip_array=None, distributions_array=None, keep_proxy=False, matrix_serialize_format_type=None, **kwargs)#
- Parameters:
matrix (str) –
indices_array (ndarray) –
name (str | None) –
data_array (ndarray | None) –
flip_array (ndarray | None) –
distributions_array (ndarray | None) –
keep_proxy (bool) –
matrix_serialize_format_type (MatrixSerializeFormat | None) –
- Return type:
None
- add_persistent_vector_from_iterator(*, matrix=None, name=None, dict_iterator=None, nrows=None, matrix_serialize_format_type=None, **kwargs)#
Create a persistant vector from an iterator. Uses the utility function
resolve_dict_iterator.This is the only array creation method which produces sorted arrays.
- Parameters:
matrix (str | None) –
name (str | None) –
dict_iterator (Any | None) –
nrows (int | None) –
matrix_serialize_format_type (MatrixSerializeFormat | None) –
- Return type:
None
- finalize_serialization()#
- Return type:
None
- write_modified()#
Write the data in modified files to the filesystem (if allowed).
- class bw_processing.datapackage.DatapackageBase#
Bases:
ABCBase class for datapackages. Not for normal use - you should use either Datapackage or FilteredDatapackage.
- dehydrated_interfaces()#
Return a list of the resource groups which have dehydrated interfaces
- Return type:
List[str]
- del_resource(name_or_index)#
Remove a resource, and delete its data file, if any.
- Parameters:
name_or_index (str | int) –
- Return type:
None
- del_resource_group(name)#
Remove a resource group, and delete its data files, if any.
Use
exclude_resource_groupif you want to keep the underlying resource in the filesystem.- Parameters:
name (str) –
- Return type:
None
- exclude(filters)#
Filter a datapackage to exclude resources matching a filter.
Usage cases:
Filter out a given resource:
exclude_generic({“matrix’: “some_label”})
Filter out a resource group with a given kind:
exclude_generic({“group’: “some_group”, “kind”: “some_kind”})
- Parameters:
filters (Dict[str, str]) –
- Return type:
- filter_by_attribute(key, value)#
Create a new
FilteredDatapackagewhich satisfies the filterresource[key] == value.All included objects are the same as in the original data package, i.e. no copies are made. No checks are made to ensure consistency with modifications to the original datapackage after the creation of this filtered datapackage.
This method was introduced to allow for the efficient construction of matrices; each datapackage can have data for multiple matrices, and we can then create filtered datapackages which exclusively have data for the matrix of interest. As such, they should be considered read-only, though this is not enforced.
- Parameters:
key (str) –
value (Any) –
- Return type:
- get_resource(name_or_index)#
Return data and metadata for
name_or_index.- Parameters:
name_or_index (*) – Name (str) or index (int) of a resource in the existing metadata.
- Raises:
* IndexError – Integer index out of range of given metadata
* ValueError – String name not present in metadata
* NonUnique – String name present in two resource metadata sections
- Returns:
(data object, metadata dict)
- Return type:
(Any, <class ‘dict’>)
- property groups: dict#
Return a dictionary of
{group label: filtered datapackage}in the same order as the group labels are first encountered in the datapackage metadata.Ignores resources which don’t have group labels.
- rehydrate_interface(name_or_index, resource, initialize_with_config=False)#
Substitute the undefined interface in this datapackage with the actual interface resource
resource. Loading a datapackage with an interface loads an instance ofUndefinedInterface, which should be substituted (rehydrated) with an actual interface instance.If
initialize_with_configis true, theresourceis initialized (i.e.resource(**config_data)) with the resource data under the keyconfig. Ifconfigis missing, aKeyErroris raised.name_or_indexshould be the data source name. If this value is a string and doesn’t end with.data,.datais automatically added.- Parameters:
name_or_index (str | int) –
resource (Any) –
initialize_with_config (bool) –
- Return type:
None
- property resources: list#
- class bw_processing.datapackage.FilteredDatapackage#
Bases:
DatapackageBaseA subset of a datapackage. Used in matrix construction or other data manipulation operations.
Should be treated as read-only.
- bw_processing.datapackage.create_datapackage(fs=None, name=None, id_=None, metadata=None, combinatorial=False, sequential=False, seed=None, sum_intra_duplicates=True, sum_inter_duplicates=False, matrix_serialize_format_type=MatrixSerializeFormat.NUMPY)#
Create a new data package.
All arguments are optional; if a PyFilesystem2 filesystem is not provided, a MemoryFS will be used.
All metadata elements should follow the datapackage specification.
Licenses are specified as a list in
metadata. The default license is the Open Data Commons Public Domain Dedication and License v1.0.- Parameters:
fs (*) – A Filesystem, optional. A new MemoryFS is used if not provided.
name (*) – str, optional. A new uuid is used if not provided.
str (* id.) –
provided. (optional. A new uuid is used if not) –
dict (* metadata.) –
above. (optional. Metadata dictionary following datapackage specification; see) –
bool (* sum_inter_duplicates.) – Policy on how to sample columns across multiple data arrays; see readme.
`False. (default) –
Policy on how to sample columns across multiple data arrays; see readme.
bool – Policy on how to sample columns in data arrays; see readme.
`False. –
Policy on how to sample columns in data arrays; see readme.
int (* seed.) –
generator. (optional. Seed to use in random number) –
bool –
together (default False. Should duplicate elements in across data resources be summed) –
values. (or should the last value replace previous) –
bool –
together –
package. (or should the last value replace previous values. Order of data resources is given by the order they are added to the data) –
MatrixSerializeFormat (* matrix_serialize_format_type.) –
type. (default MatrixSerializeFormat.NUMPY. Matrix serialization format) –
id_ (str | None) –
metadata (dict | None) –
combinatorial (bool) –
sequential (bool) –
seed (int | None) –
sum_intra_duplicates (bool) –
sum_inter_duplicates (bool) –
matrix_serialize_format_type (MatrixSerializeFormat) –
- Returns:
A Datapackage instance.
- Return type:
- bw_processing.datapackage.load_datapackage(fs_or_obj, mmap_mode=None, proxy=False)#
Load an existing datapackage.
Can load proxies to data instead of the data itself, which can be useful when interacting with large arrays or large packages where only a subset of the data will be accessed.
Proxies use something similar to functools.partial to create a callable class instead of returning the raw data (see https://github.com/brightway-lca/bw_processing/issues/9 for why we can’t just use partial). datapackage access methods (i.e. .get_resource) will automatically resolve proxies when needed.
- Parameters:
DatapackageBase. (* fs_or_obj. A Filesystem or an instance of) –
str (* mmap_mode.) –
arrays. (optional. Define memory mapping mode to use when loading Numpy) –
bool (* proxy.) –
above. (default False. Load proxies instead of complete Numpy arrays; see) –
fs_or_obj (DatapackageBase | FS) –
mmap_mode (str | None) –
proxy (bool) –
- Returns:
A Datapackage instance.
- Return type:
- bw_processing.datapackage.simple_graph(data, fs=None, **metadata)#
Easy creation of simple datapackages with only persistent vectors.
datais a dictionary with the form:..code-block:: python
- matrix_name (str): [
(row id (int), col id (int), value (float), flip (bool, default False))
]
fsis a filesystem.metadataare passed as kwargs tocreate_datapackage().Returns the datapackage.
- Parameters:
data (dict) –
fs (FS | None) –
bw_processing.errors module#
- exception bw_processing.errors.BrightwayProcessingError#
Bases:
Exception
- exception bw_processing.errors.Closed#
Bases:
BrightwayProcessingErrorDatapackage closed, can’t be written to anymore.
- exception bw_processing.errors.FileIntegrityError#
Bases:
BrightwayProcessingErrorMD5 hash does not agree with file contents
- exception bw_processing.errors.InconsistentFields#
Bases:
BrightwayProcessingErrorGiven fields not the same for each element
- exception bw_processing.errors.InvalidMimetype#
Bases:
BrightwayProcessingErrorProvided mimetype missing or not understood
- exception bw_processing.errors.InvalidName#
Bases:
BrightwayProcessingErrorName fails datapackage requirements:
A short url-usable (and preferably human-readable) name of the package. This MUST be lower-case and contain only alphanumeric characters along with “.”, “_” or “-” characters.
- exception bw_processing.errors.LengthMismatch#
Bases:
BrightwayProcessingErrorNumber of resources doesn’t match the number of data objects
- exception bw_processing.errors.NonUnique#
Bases:
BrightwayProcessingErrorNonunique elements when uniqueness is required
- exception bw_processing.errors.PotentialInconsistency#
Bases:
BrightwayProcessingErrorGiven operation could cause inconsistent data
- exception bw_processing.errors.ShapeMismatch#
Bases:
BrightwayProcessingErrorArray shapes in a resource group are not consistent
- exception bw_processing.errors.WrongDatatype#
Bases:
BrightwayProcessingErrorWrong type of data written to a resource
bw_processing.filesystem module#
- bw_processing.filesystem.clean_datapackage_name(name)#
Clean string
nameof characters not allowed in data package names.Replaces with underscores, and drops multiple underscores.
- Parameters:
name (str) –
- Return type:
str
- bw_processing.filesystem.md5(filepath, blocksize=65536)#
Generate MD5 hash for file at filepath
- Parameters:
filepath (str | Path) –
blocksize (int) –
- Return type:
str
- bw_processing.filesystem.safe_filename(string, add_hash=True, full=False)#
Convert arbitrary strings to make them safe for filenames. Substitutes strange characters, and uses unicode normalization.
if add_hash, appends hash of string to avoid name collisions.
From http://stackoverflow.com/questions/295135/turn-a-string-into-a-valid-filename-in-python
- Parameters:
string (str | bytes) –
add_hash (bool) –
full (bool) –
- Return type:
str
bw_processing.indexing module#
- bw_processing.indexing.reindex(datapackage, metadata_name, data_iterable, fields=None, id_field_datapackage='id', id_field_destination='id')#
Use the metadata to set the integer indices in
datapackageto those used indata_iterable.Used in data exchange. Often, the integer ids provided in the data package are arbitrary, and need to be mapped to the values present in your database.
Updates the datapackage in place.
- Parameters:
datapackage (*) – datapackage of Filesystem. Input to load_datapackage function.
metadata_name (*) – Name identifying a CSV metadata resource in
datapackagedata_iterable (*) – Iterable which returns objects that support
.get().fields (*) – Optional list of fields to use while matching
id_field_datapackage (*) – String identifying the column providing an integer id in the datapackage
id_field_destination (*) – String identifying the column providing an integer id in
data_iterable
- Raises:
* KeyError –
data_iterableis missingid_field_destinationfield* KeyError –
metadata_nameis missingid_field_datapackagefield* NonUnique – Multiple objects found in
data_iterablewhich matches fields indatapackage* KeyError –
metadata_nameis not indatapackage* KeyError – No object found in
data_iterablewhich matches fields indatapackage* ValueError –
metadata_nameis not CSV metadata.* ValueError – The resources given for
metadata_nameare not present in thisdatapackage* AttributeError –
data_iterabledoesn’t support field retrieval using.get().
- Returns:
Datapackage instance with modified data
- Return type:
None
- bw_processing.indexing.reset_index(datapackage, metadata_name)#
Reset the numerical indices in
datapackageto sequential integers starting from zero.Updates the datapackage in place.
- Parameters:
datapackage (*) – datapackage or Filesystem. Input to load_datapackage function.
metadata_name (*) – Name identifying a CSV metadata resource in
datapackage
- Returns:
Datapackage instance with modified data
- Return type:
bw_processing.io_helpers module#
- bw_processing.io_helpers.file_reader(*, fs, resource, mimetype, proxy=False, mmap_mode=None, **kwargs)#
- Parameters:
fs (FS) –
resource (str) –
mimetype (str) –
proxy (bool) –
mmap_mode (str | None) –
- Return type:
Any
- bw_processing.io_helpers.file_writer(*, data, fs, resource, mimetype, matrix_serialize_format_type=MatrixSerializeFormat.NUMPY, meta_object=None, meta_type=None, **kwargs)#
- Parameters:
data (Any) –
fs (FS) –
resource (str) –
mimetype (str) –
matrix_serialize_format_type (MatrixSerializeFormat) –
meta_object (str | None) –
meta_type (str | None) –
- Return type:
None
- bw_processing.io_helpers.generic_directory_filesystem(*, dirpath)#
- Parameters:
dirpath (Path) –
- Return type:
OSFS
- bw_processing.io_helpers.generic_zipfile_filesystem(*, dirpath, filename, write=True)#
- Parameters:
dirpath (Path) –
filename (str) –
write (bool) –
- Return type:
ZipFS
bw_processing.io_parquet_helpers module#
bw_processing.io_pyarrow_helpers module#
bw_processing.merging module#
- bw_processing.merging.add_resource_suffix(metadata, suffix)#
Update the
name,path, andgroupvalues to includesuffix. The suffix comes after the basename but after the data type suffix (e.g. indices, data).Given the suffix _foo” and the metadata:
- {
“name”: “sa-data-vector-from-dict.indices”, “path”: “sa-data-vector-from-dict.indices.npy”, “group”: “sa-data-vector-from-dict”,
}
- Returns:
- {
“name”: “sa-data-vector-from-dict_foo.indices”, “path”: “sa-data-vector-from-dict_foo.indices.npy”, “group”: “sa-data-vector-from-dict_foo”,
}
- Parameters:
metadata (dict) –
suffix (str) –
- Return type:
dict
- bw_processing.merging.mask_resource(obj, mask)#
- Parameters:
obj (Any) –
mask (ndarray) –
- Return type:
Any
- bw_processing.merging.merge_datapackages_with_mask(first_dp, first_resource_group_label, second_dp, second_resource_group_label, mask_array, output_fs=None, metadata=None)#
Merge two resources using a Numpy boolean mask. Returns elements from
first_dpwhere the mask isTrue, otherwisesecond_dp.Both resource arrays, and the filter mask, must have the same length.
Both datapackages must be static, i.e. not interfaces. This is because we don’t yet have the functionality to select only some of the values in a resource group in
matrix_utils.This function currently will not mask or filter JSON or CSV metadata.
- Parameters:
first_dp (*) – The datapackage from whom values will be taken when
mask_arrayisTrue.first_resource_group_label (*) – Label of the resource group in
first_dpto select values from.second_dp (*) – The datapackage from whom values will be taken when
mask_arrayisFalse.second_resource_group_label (*) – Label of the resource group in
second_dpto select values from.mask_array (*) – Boolean numpy array
output_fs (*) – Filesystem to write new datapackage to, if any.
metadata (*) – Metadata for new datapackage, if any.
- Returns:
A Datapackage instance. Will write the resulting datapackage to
output_fsif provided.- Return type:
- bw_processing.merging.update_nrows(resource, data)#
- Parameters:
resource (dict) –
data (Any) –
- Return type:
dict
- bw_processing.merging.write_data_to_fs(resource, data, fs)#
- Parameters:
resource (dict) –
data (Any) –
fs (FS) –
- Return type:
None
bw_processing.proxies module#
- class bw_processing.proxies.Proxy(func, label, kwargs)#
Bases:
object
- class bw_processing.proxies.UndefinedInterface#
Bases:
objectAn interface to external data that isn’t saved to disk.
bw_processing.unique_fields module#
- bw_processing.unique_fields.as_unique_attributes(data, exclude=None, include=None, raise_error=False)#
Format
dataas unique set of attributes and values for use increate_processed_datapackage.Each element in
datamust have the attributeid, and it must be unique. However, the field “id” is not used in selecting the unique set of attributes.If no set of attributes is found that uniquely identifies all features is found, all fields are used. To have this case raise an error, pass
raise_error=True.- data = [
{},
]
- Parameters:
data (iterable) – List of dictionaries with the same fields.
exclude (iterable) – Fields to exclude during search for uniqueness.
idis Always excluded.include (iterable) – Fields to include when returning, even if not unique
- Returns:
(list of field names as strings, dictionary of data ids to values for given field names)
- Raises:
InconsistentFields – Not all features provides all fields.
- bw_processing.unique_fields.as_unique_attributes_dataframe(df, exclude=None, include=None, raise_error=False)#
- bw_processing.unique_fields.greedy_set_cover(data, exclude=None, raise_error=True)#
Find unique set of attributes that uniquely identifies each element in
data.Feature selection is a well known problem, and is analogous to the set cover problem, for which there is a well known heuristic.
Example
- data = [
{‘a’: 1, ‘b’: 2, ‘c’: 3}, {‘a’: 2, ‘b’: 2, ‘c’: 3}, {‘a’: 1, ‘b’: 2, ‘c’: 4},
] greedy_set_cover(data) >>> {‘a’, ‘c’}
- Parameters:
data (iterable) – List of dictionaries with the same fields.
exclude (iterable) – Fields to exclude during search for uniqueness.
idis Always excluded.
- Returns:
Set of attributes (strings)
- Raises:
NonUnique – The given fields are not enough to ensure uniqueness.
Note that
NonUniqueis not raised ifraise_erroris false.
bw_processing.utils module#
- bw_processing.utils.as_uncertainty_type(row)#
- Parameters:
row (dict) –
- Return type:
int
- bw_processing.utils.check_name(name)#
- Parameters:
name (str) –
- Return type:
None
- bw_processing.utils.check_suffix(path, suffix=<class 'str'>)#
Add
suffix, if not already inpath.- Parameters:
path (str | Path) –
- Return type:
str
- bw_processing.utils.dictionary_formatter(row)#
Format processed array row from dictionary input
- Parameters:
row (dict) –
- Return type:
tuple
- bw_processing.utils.load_bytes(obj)#
- Parameters:
obj (Any) –
- Return type:
Any
- bw_processing.utils.resolve_dict_iterator(iterator, nrows=None)#
Note that this function produces sorted arrays.
- Parameters:
iterator (Any) –
nrows (int | None) –
- Return type:
tuple
Module contents#
- class bw_processing.Datapackage#
Bases:
DatapackageBaseInterface for creating, loading, and using numerical datapackages for Brightway.
Note that there are two entry points to using this class, both separate functions:
create_datapackage()andload_datapackage(). Do not create an instance of the class withDatapackage(), unless you like playing with danger :)Data packages can be stored in memory, in a directory, or in a zip file. When creating data packages for use later, don’t forget to call
.finalize_serialization(), or the metadata won’t be written and the data package won’t be usable.Potential gotchas:
There is currently no way to modify a zipped data package once it is finalized.
Resources that are interfaces to external data sources (either in Python or other) can’t be saved, but must be recreated each time a data package is used.
- add_csv_metadata(*, dataframe, valid_for, name=None, **kwargs)#
Add an iterable metadata object to be stored as a CSV file.
The purpose of storing metadata is to enable data exchange; therefore, this method assumes that data is written to disk.
The normal use case of this method is to link integer indices from either structured or presample arrays to a set of fields that uniquely identifies each object. This allows for matching based on object attributes from computer to computer, where database ids or other computer-generated codes might not be consistent.
Uses pandas to store and load data; therefore, metadata must already be a pandas dataframe.
In contrast with presamples arrays,
iterable_data_sourcecannot be an infinite generator. We need a finite set of data to build a matrix.In contrast to
self.create_structured_array, this always stores the dataframe inself.data; no proxies are used.- Parameters:
dataframe (*) – Dataframe to be persisted to disk.
valid_for (*) – List of resource names that this metadata is valid for; must be either structured or presample indices arrays. Each item in
valid_forhas the form("resource_name", "rows" or "cols").resource_nameshould be either a structured or a presamples indices array.name (*) – The name of this resource. Names must be unique in a given data package
extra (*) – Dict of extra metadata
- Returns:
Nothing, but appends objects to
self.metadata['resources']andself.data.- Raises:
* AssertionError – If inputs are not in correct form
* AssertionError – If
valid_forrefers to unavailable resources
- Return type:
None
- add_dynamic_array(*, matrix, interface, indices_array, name=None, flip_array=None, keep_proxy=False, matrix_serialize_format_type=None, **kwargs)#
interface must support the presamples API.
- Parameters:
matrix (str) –
interface (Any) –
indices_array (ndarray) –
name (str | None) –
flip_array (ndarray | None) –
keep_proxy (bool) –
matrix_serialize_format_type (MatrixSerializeFormat | None) –
- Return type:
None
- add_dynamic_vector(*, matrix, interface, indices_array, name=None, flip_array=None, keep_proxy=False, matrix_serialize_format_type=None, **kwargs)#
- Parameters:
matrix (str) –
interface (Any) –
indices_array (ndarray) –
name (str | None) –
flip_array (ndarray | None) –
keep_proxy (bool) –
matrix_serialize_format_type (MatrixSerializeFormat | None) –
- Return type:
None
- add_json_metadata(*, data, valid_for, name=None, **kwargs)#
Add an iterable metadata object to be stored as a JSON file.
The purpose of storing metadata is to enable data exchange; therefore, this method assumes that data is written to disk.
The normal use case of this method is to provide names and other metadata for parameters whose values are stored as presamples arrays. The length of
datashould match the number of rows in the corresponding presamples array, anddatais just a list of string labels for the parameters. However, this method can also be used to store other metadata, e.g. for external data resources.In contrast to
self.create_structured_array, this always stores the dataframe inself.data; no proxies are used.- Parameters:
data (*) – Data to be persisted to disk.
valid_for (*) – Name of structured data or presample array that this metadata is valid for.
name (*) – The name of this resource. Names must be unique in a given data package
extra (*) – Dict of extra metadata
- Returns:
Nothing, but appends objects to
self.metadata['resources']andself.data.- Raises:
* AssertionError – If inputs are not in correct form
* AssertionError – If
valid_forrefers to unavailable resources
- Return type:
None
- add_persistent_array(*, matrix, data_array, indices_array, name=None, flip_array=None, keep_proxy=False, matrix_serialize_format_type=None, **kwargs)#
- Parameters:
matrix (str) –
data_array (ndarray) –
indices_array (ndarray) –
name (str | None) –
flip_array (ndarray | None) –
keep_proxy (bool) –
matrix_serialize_format_type (MatrixSerializeFormat | None) –
- Return type:
None
- add_persistent_vector(*, matrix, indices_array, name=None, data_array=None, flip_array=None, distributions_array=None, keep_proxy=False, matrix_serialize_format_type=None, **kwargs)#
- Parameters:
matrix (str) –
indices_array (ndarray) –
name (str | None) –
data_array (ndarray | None) –
flip_array (ndarray | None) –
distributions_array (ndarray | None) –
keep_proxy (bool) –
matrix_serialize_format_type (MatrixSerializeFormat | None) –
- Return type:
None
- add_persistent_vector_from_iterator(*, matrix=None, name=None, dict_iterator=None, nrows=None, matrix_serialize_format_type=None, **kwargs)#
Create a persistant vector from an iterator. Uses the utility function
resolve_dict_iterator.This is the only array creation method which produces sorted arrays.
- Parameters:
matrix (str | None) –
name (str | None) –
dict_iterator (Any | None) –
nrows (int | None) –
matrix_serialize_format_type (MatrixSerializeFormat | None) –
- Return type:
None
- finalize_serialization()#
- Return type:
None
- write_modified()#
Write the data in modified files to the filesystem (if allowed).
- class bw_processing.DatapackageBase#
Bases:
ABCBase class for datapackages. Not for normal use - you should use either Datapackage or FilteredDatapackage.
- dehydrated_interfaces()#
Return a list of the resource groups which have dehydrated interfaces
- Return type:
List[str]
- del_resource(name_or_index)#
Remove a resource, and delete its data file, if any.
- Parameters:
name_or_index (str | int) –
- Return type:
None
- del_resource_group(name)#
Remove a resource group, and delete its data files, if any.
Use
exclude_resource_groupif you want to keep the underlying resource in the filesystem.- Parameters:
name (str) –
- Return type:
None
- exclude(filters)#
Filter a datapackage to exclude resources matching a filter.
Usage cases:
Filter out a given resource:
exclude_generic({“matrix’: “some_label”})
Filter out a resource group with a given kind:
exclude_generic({“group’: “some_group”, “kind”: “some_kind”})
- Parameters:
filters (Dict[str, str]) –
- Return type:
- filter_by_attribute(key, value)#
Create a new
FilteredDatapackagewhich satisfies the filterresource[key] == value.All included objects are the same as in the original data package, i.e. no copies are made. No checks are made to ensure consistency with modifications to the original datapackage after the creation of this filtered datapackage.
This method was introduced to allow for the efficient construction of matrices; each datapackage can have data for multiple matrices, and we can then create filtered datapackages which exclusively have data for the matrix of interest. As such, they should be considered read-only, though this is not enforced.
- Parameters:
key (str) –
value (Any) –
- Return type:
- get_resource(name_or_index)#
Return data and metadata for
name_or_index.- Parameters:
name_or_index (*) – Name (str) or index (int) of a resource in the existing metadata.
- Raises:
* IndexError – Integer index out of range of given metadata
* ValueError – String name not present in metadata
* NonUnique – String name present in two resource metadata sections
- Returns:
(data object, metadata dict)
- Return type:
(Any, <class ‘dict’>)
- property groups: dict#
Return a dictionary of
{group label: filtered datapackage}in the same order as the group labels are first encountered in the datapackage metadata.Ignores resources which don’t have group labels.
- rehydrate_interface(name_or_index, resource, initialize_with_config=False)#
Substitute the undefined interface in this datapackage with the actual interface resource
resource. Loading a datapackage with an interface loads an instance ofUndefinedInterface, which should be substituted (rehydrated) with an actual interface instance.If
initialize_with_configis true, theresourceis initialized (i.e.resource(**config_data)) with the resource data under the keyconfig. Ifconfigis missing, aKeyErroris raised.name_or_indexshould be the data source name. If this value is a string and doesn’t end with.data,.datais automatically added.- Parameters:
name_or_index (str | int) –
resource (Any) –
initialize_with_config (bool) –
- Return type:
None
- property resources: list#
- class bw_processing.FilteredDatapackage#
Bases:
DatapackageBaseA subset of a datapackage. Used in matrix construction or other data manipulation operations.
Should be treated as read-only.
- class bw_processing.MatrixSerializeFormat(value)#
Bases:
str,EnumEnum with the serializing formats for the vectors and matrices.
- NUMPY = 'numpy'#
- PARQUET = 'parquet'#
- class bw_processing.UndefinedInterface#
Bases:
objectAn interface to external data that isn’t saved to disk.
- bw_processing.as_unique_attributes(data, exclude=None, include=None, raise_error=False)#
Format
dataas unique set of attributes and values for use increate_processed_datapackage.Each element in
datamust have the attributeid, and it must be unique. However, the field “id” is not used in selecting the unique set of attributes.If no set of attributes is found that uniquely identifies all features is found, all fields are used. To have this case raise an error, pass
raise_error=True.- data = [
{},
]
- Parameters:
data (iterable) – List of dictionaries with the same fields.
exclude (iterable) – Fields to exclude during search for uniqueness.
idis Always excluded.include (iterable) – Fields to include when returning, even if not unique
- Returns:
(list of field names as strings, dictionary of data ids to values for given field names)
- Raises:
InconsistentFields – Not all features provides all fields.
- bw_processing.as_unique_attributes_dataframe(df, exclude=None, include=None, raise_error=False)#
- bw_processing.clean_datapackage_name(name)#
Clean string
nameof characters not allowed in data package names.Replaces with underscores, and drops multiple underscores.
- Parameters:
name (str) –
- Return type:
str
- bw_processing.create_array(iterable, nrows=None, dtype=<class 'numpy.float32'>)#
Create a numpy array data
iterable. Returns a filepath of a created file (iffilepathis provided, or the array.iterablecan be data already in memory, or a generator.nrowscan be supplied, if known. Ifiterablehas a length, it will be determined automatically. Ifnrowsis not known, this function generates chunked arrays untiliterableis exhausted, and concatenates them.Either
nrowsorncolsmust be specified.
- bw_processing.create_datapackage(fs=None, name=None, id_=None, metadata=None, combinatorial=False, sequential=False, seed=None, sum_intra_duplicates=True, sum_inter_duplicates=False, matrix_serialize_format_type=MatrixSerializeFormat.NUMPY)#
Create a new data package.
All arguments are optional; if a PyFilesystem2 filesystem is not provided, a MemoryFS will be used.
All metadata elements should follow the datapackage specification.
Licenses are specified as a list in
metadata. The default license is the Open Data Commons Public Domain Dedication and License v1.0.- Parameters:
fs (*) – A Filesystem, optional. A new MemoryFS is used if not provided.
name (*) – str, optional. A new uuid is used if not provided.
str (* id.) –
provided. (optional. A new uuid is used if not) –
dict (* metadata.) –
above. (optional. Metadata dictionary following datapackage specification; see) –
bool (* sum_inter_duplicates.) – Policy on how to sample columns across multiple data arrays; see readme.
`False. (default) –
Policy on how to sample columns across multiple data arrays; see readme.
bool – Policy on how to sample columns in data arrays; see readme.
`False. –
Policy on how to sample columns in data arrays; see readme.
int (* seed.) –
generator. (optional. Seed to use in random number) –
bool –
together (default False. Should duplicate elements in across data resources be summed) –
values. (or should the last value replace previous) –
bool –
together –
package. (or should the last value replace previous values. Order of data resources is given by the order they are added to the data) –
MatrixSerializeFormat (* matrix_serialize_format_type.) –
type. (default MatrixSerializeFormat.NUMPY. Matrix serialization format) –
id_ (str | None) –
metadata (dict | None) –
combinatorial (bool) –
sequential (bool) –
seed (int | None) –
sum_intra_duplicates (bool) –
sum_inter_duplicates (bool) –
matrix_serialize_format_type (MatrixSerializeFormat) –
- Returns:
A Datapackage instance.
- Return type:
- bw_processing.create_structured_array(iterable, dtype, nrows=None, sort=False, sort_fields=None)#
Create a numpy structured array for data
iterable. Returns a filepath of a created file (iffilepathis provided, or the array.iterablecan be data already in memory, or a generator.nrowscan be supplied, if known. Ifiterablehas a length, it will be determined automatically. Ifnrowsis not known, this function generates chunked arrays untiliterableis exhausted, and concatenates them.
- bw_processing.generic_directory_filesystem(*, dirpath)#
- Parameters:
dirpath (Path) –
- Return type:
OSFS
- bw_processing.generic_zipfile_filesystem(*, dirpath, filename, write=True)#
- Parameters:
dirpath (Path) –
filename (str) –
write (bool) –
- Return type:
ZipFS
- bw_processing.load_datapackage(fs_or_obj, mmap_mode=None, proxy=False)#
Load an existing datapackage.
Can load proxies to data instead of the data itself, which can be useful when interacting with large arrays or large packages where only a subset of the data will be accessed.
Proxies use something similar to functools.partial to create a callable class instead of returning the raw data (see https://github.com/brightway-lca/bw_processing/issues/9 for why we can’t just use partial). datapackage access methods (i.e. .get_resource) will automatically resolve proxies when needed.
- Parameters:
DatapackageBase. (* fs_or_obj. A Filesystem or an instance of) –
str (* mmap_mode.) –
arrays. (optional. Define memory mapping mode to use when loading Numpy) –
bool (* proxy.) –
above. (default False. Load proxies instead of complete Numpy arrays; see) –
fs_or_obj (DatapackageBase | FS) –
mmap_mode (str | None) –
proxy (bool) –
- Returns:
A Datapackage instance.
- Return type:
- bw_processing.md5(filepath, blocksize=65536)#
Generate MD5 hash for file at filepath
- Parameters:
filepath (str | Path) –
blocksize (int) –
- Return type:
str
- bw_processing.merge_datapackages_with_mask(first_dp, first_resource_group_label, second_dp, second_resource_group_label, mask_array, output_fs=None, metadata=None)#
Merge two resources using a Numpy boolean mask. Returns elements from
first_dpwhere the mask isTrue, otherwisesecond_dp.Both resource arrays, and the filter mask, must have the same length.
Both datapackages must be static, i.e. not interfaces. This is because we don’t yet have the functionality to select only some of the values in a resource group in
matrix_utils.This function currently will not mask or filter JSON or CSV metadata.
- Parameters:
first_dp (*) – The datapackage from whom values will be taken when
mask_arrayisTrue.first_resource_group_label (*) – Label of the resource group in
first_dpto select values from.second_dp (*) – The datapackage from whom values will be taken when
mask_arrayisFalse.second_resource_group_label (*) – Label of the resource group in
second_dpto select values from.mask_array (*) – Boolean numpy array
output_fs (*) – Filesystem to write new datapackage to, if any.
metadata (*) – Metadata for new datapackage, if any.
- Returns:
A Datapackage instance. Will write the resulting datapackage to
output_fsif provided.- Return type:
- bw_processing.reindex(datapackage, metadata_name, data_iterable, fields=None, id_field_datapackage='id', id_field_destination='id')#
Use the metadata to set the integer indices in
datapackageto those used indata_iterable.Used in data exchange. Often, the integer ids provided in the data package are arbitrary, and need to be mapped to the values present in your database.
Updates the datapackage in place.
- Parameters:
datapackage (*) – datapackage of Filesystem. Input to load_datapackage function.
metadata_name (*) – Name identifying a CSV metadata resource in
datapackagedata_iterable (*) – Iterable which returns objects that support
.get().fields (*) – Optional list of fields to use while matching
id_field_datapackage (*) – String identifying the column providing an integer id in the datapackage
id_field_destination (*) – String identifying the column providing an integer id in
data_iterable
- Raises:
* KeyError –
data_iterableis missingid_field_destinationfield* KeyError –
metadata_nameis missingid_field_datapackagefield* NonUnique – Multiple objects found in
data_iterablewhich matches fields indatapackage* KeyError –
metadata_nameis not indatapackage* KeyError – No object found in
data_iterablewhich matches fields indatapackage* ValueError –
metadata_nameis not CSV metadata.* ValueError – The resources given for
metadata_nameare not present in thisdatapackage* AttributeError –
data_iterabledoesn’t support field retrieval using.get().
- Returns:
Datapackage instance with modified data
- Return type:
None
- bw_processing.reset_index(datapackage, metadata_name)#
Reset the numerical indices in
datapackageto sequential integers starting from zero.Updates the datapackage in place.
- Parameters:
datapackage (*) – datapackage or Filesystem. Input to load_datapackage function.
metadata_name (*) – Name identifying a CSV metadata resource in
datapackage
- Returns:
Datapackage instance with modified data
- Return type:
- bw_processing.safe_filename(string, add_hash=True, full=False)#
Convert arbitrary strings to make them safe for filenames. Substitutes strange characters, and uses unicode normalization.
if add_hash, appends hash of string to avoid name collisions.
From http://stackoverflow.com/questions/295135/turn-a-string-into-a-valid-filename-in-python
- Parameters:
string (str | bytes) –
add_hash (bool) –
full (bool) –
- Return type:
str
- bw_processing.simple_graph(data, fs=None, **metadata)#
Easy creation of simple datapackages with only persistent vectors.
datais a dictionary with the form:..code-block:: python
- matrix_name (str): [
(row id (int), col id (int), value (float), flip (bool, default False))
]
fsis a filesystem.metadataare passed as kwargs tocreate_datapackage().Returns the datapackage.
- Parameters:
data (dict) –
fs (FS | None) –