Global CO2 from Cement Production Dataset
This repository downloads the Andrews dataset on global CO2 emissions from cement production from Zenodo.
The dataset is converted to the PRIMAP2 format and provided in both the csv-based interchange format and the netCDF-based native primap2 format. Several version of the dataset are available within this repository.
Description
This repository downloads data on global CO2 emissions from cement production from Zenodo.
The downloaded dataset can then be converted into CSV (.csv file extension) or NetCDF (.nc file extension) format. Converted data are available for the following versions:
The data management tool DataLad is used to version control the data sets.
Commands to manage the data are executed via the pydoit package.
DataLad datasets and how to use them
This repository is a DataLad dataset. It provides
fine-grained data access down to the level of individual files, and allows for
tracking future updates. In order to use this repository for data retrieval,
DataLad is required. It is a free and open source
command line tool, available for all major operating systems, and builds upon
Git and git-annex to allow sharing,
synchronizing, and version controlling collections of large files.
Installation
Note that for simply downloading the dataset, Python and pydoit are not required.
- Install datalad according to the DataLad handbook. We recommend installing datalad globally as managing it from within the venv is not something we do for you.
- Install Python
- Install pydoit. Like with DataLad, we recommend installing pydoit globally as managing it from within the venv is not something we do for you.
Getting Started
Clone the repository
A DataLad dataset can be cloned
by running
datalad clone
Do not use git clone to download the repository! If you use plain git clone, DataLad will not have the necessary
information to manage the dataset. Once the repository is cloned, it is like using a standard light-weight repository on your local machine.
At this point, the repository contains only small metadata and information on the identity
of the files in the dataset, but not the actual content of the (sometimes large)
files.
Easy access
Users who simply want to retrieve the dataset have the option to access both the
original and extracted files with
dataland get <filename>
This command will trigger a download of the files, directories, or subdatasets
you have specified.
For example, the CSV file for the 13-Sep-2023 release can be downloaded with
datalad get extracted_data/v230913/Robbie_Andrew_Cement_Production_CO2_230913.csv
Stay up-to-date
DataLad datasets can be updated. The command datalad update
will fetch
updates and store them on a different branch to the one you're currently working on (by default
remotes/origin/master
). Running
datalad update --merge
will fetch available updates and integrate them in one go.
Find out what has been done
DataLad datasets contain their history in the git log
. By running git
log
(or a tool that displays Git history) in the dataset or on specific
files, you can find out what has been done to the dataset or to individual
files by whom, and when.
Contributing
For those who wish to contribute to the repository, below we go through the key commands you will need to use.
Set up the virtual environment with doit
doit setup_env
Download the version from the command line.
This will download all files from Zenodo as they are for a specific version (note this version must already be in versions.py
, if you want to add a new version, see the section on adding a new version below).
doit download_version version=<vYYMMDD>
For example, the following command will download all files from Zenodo for the 16-Sep-2022 release
doit download_version version=v220916
Read the version from the command line.
Reading data refers to the conversion of the downloaded files into CSV and NetCDF format. Similarly to the download
command, the data is read for a specific version with
doit read_version version=<vYYMMDD>
For example, the following command will read the 16-Sep-2022 release
doit read_version version=v220916
How to add a new version
To add a new version go to src/versions.py
in the src
directory and create a new value in the
versions
dictionary. Fill all the required information similar to the previous entries.
For example, the value for key "v230913"
in the versions
dictionary describes the 13-Sep-2023 release.
versions = {
"v230913": {
'date': '13-Sep-2023',
'ver_str_long': 'version 230913',
'ver_str_short': '230913',
"folder": "v230913",
"transpose": False,
"filename": "0. GCP-CEM.csv",
'ref': '10.5281/zenodo.8339353',
'ref2': '10.5194/essd-11-1675-2019',
'title': 'Global CO2 emissions from cement production',
'institution': "CICERO - Center for International Climate Research",
'filter_keep': {},
'filter_remove': {},
'contact': "johannes.guetschow@climate-resource.com",
'comment': ("Published by Robbie Andrew, converted to PRIMAP2 format by "
"Johannes Gütschow"),
'unit': 'kt * CO2 / year',
'country_code': True,
},
}
Then run the two commands, read_version
and download_version
as described in Contributing for your newly added version.
Help
Show all doit commands
doit help
See a list with possible doit commands specific to this repository
doit list
Get help on a specific command
doit help <command>
Repository structure
.datalad/
contains config file for datalad
downloaded_data/
contains original data from Zenodo.
extracted_data/
contains data in .csv and .nc format
literature/
contains link to publication by Robbie M. Andrew. Can be downloaded with datalad get command
src/
download_version.py
downloads files from zenodo for a given version. The version to read will be taken from the command line using argparse.
download_version_datalad.py
calls datalad to run the data reading function.
helper_functions.py
contains a function to map country codes.
read_version.py
reads the data for a given version and saves to PRIMAP2 native and
interchange format.
read_version_datalad.py
calls datalad to run the data reading function.
version.py
is a dictionary that contains metadata for each release. This file should be updated when adding a new version
dodo.py
defines pydoit commands.
pyproject.toml
configuration file
requirements.txt
requirements
requirements_dev.txt
development requirements
setup.cfg
requirements
setup.py
installs python packages
Make sure to correctly set up the DataLad siblings
Git repositories can configure clones of a dataset as remotes in order to fetch, pull, or push from and to them. A datalad sibling
is the equivalent of a git clone that is configured as a remote.
Query information about about all known siblings with
datalad siblings
Add a sibling to allow pushing to github
datalad siblings add --dataset . --name <name> --url git@github.com:JGuetschow/Global_CO2_from_cement_production.git
SSH-access is needed to run this command. Note that name
can be freely chosen (we tend to just use "github" for GitHub siblings)
Push to the github repository
datalad push --to <name>
where name
should match the name you used above
Issues
There always issues open regarding coding, some of them easy to resolve, some harder.
Your ideas
Contributing is ouf course not limited to the categories above. If you have ideas for improvements just open an issue or a discussion page to discuss your idea with the community.
Technical HowTo for contributors
As we have a datalad repository using github and gin, the process of contributing code and data is a bit different from
pure git repositories. As the data is only stored on gin, the gin repository is the source to start
from. As gin currently has a problem with forks (the annexed data is not
forked) we have to use branches for development and, thus, to contribute you
first need to contact the maintainers to get write access to the gin repository.
You have to clone the repository using ssh to be able to push to it.
For that you first need to store your public ssh key on the gin server
(settings -> SSH Keys).
Instructions for merge requests
Once you have everything set up you can create a new branch branch and work there.
When you're done, create a pull request to integrate your work into the main
branch. This should be done first on github to allow for discussions and review (gin servers don't have the same review features). Afterwards the changes
can be actually merged on gin (so that the annex is merged properly too).