|
@@ -1,6 +1,6 @@
|
|
# Global CO2 from Cement Production Dataset
|
|
# Global CO2 from Cement Production Dataset
|
|
|
|
|
|
-This repository downloads the Andrew dataset on global CO2 emissions from cement production from Zenodo.
|
|
|
|
|
|
+This repository downloads the Andrew dataset on global CO2 emissions from cement production from [Zenodo](https://zenodo.org/records/10008931).
|
|
|
|
|
|
## Description
|
|
## Description
|
|
|
|
|
|
@@ -9,7 +9,17 @@ The downloaded dataset can then be converted into CSV (.csv file extension) or N
|
|
The data management tool [DataLad](http://docs.datalad.org/en/stable/) is used to version control the data sets.
|
|
The data management tool [DataLad](http://docs.datalad.org/en/stable/) is used to version control the data sets.
|
|
Commands to run the scripts are executed via the pydoit package.
|
|
Commands to run the scripts are executed via the pydoit package.
|
|
|
|
|
|
-### Installation
|
|
|
|
|
|
+## DataLad datasets and how to use them
|
|
|
|
+
|
|
|
|
+This repository is a [DataLad](https://www.datalad.org/) dataset. It provides
|
|
|
|
+fine-grained data access down to the level of individual files, and allows for
|
|
|
|
+tracking future updates. In order to use this repository for data retrieval,
|
|
|
|
+[DataLad](https://www.datalad.org/) is required. It is a free and open source
|
|
|
|
+command line tool, available for all major operating systems, and builds up on
|
|
|
|
+Git and [git-annex](https://git-annex.branchable.com/) to allow sharing,
|
|
|
|
+synchronizing, and version controlling collections of large files.
|
|
|
|
+
|
|
|
|
+## Installation
|
|
|
|
|
|
- Install datalad according to the [DataLad handbook](https://handbook.datalad.org/en/latest/intro/installation.html). It is recommended to install globally.
|
|
- Install datalad according to the [DataLad handbook](https://handbook.datalad.org/en/latest/intro/installation.html). It is recommended to install globally.
|
|
- DataLad is based on Git. [Git](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git) needs to be installed to run DataLad.
|
|
- DataLad is based on Git. [Git](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git) needs to be installed to run DataLad.
|
|
@@ -18,39 +28,63 @@ Commands to run the scripts are executed via the pydoit package.
|
|
|
|
|
|
## Getting Started
|
|
## Getting Started
|
|
|
|
|
|
-### 1. Clone the repository
|
|
|
|
|
|
+### Clone the repository
|
|
|
|
|
|
-Download the repository using the following command.
|
|
|
|
|
|
+A DataLad dataset can be `cloned` by running
|
|
```
|
|
```
|
|
datalad clone
|
|
datalad clone
|
|
```
|
|
```
|
|
Do not use **git clone** to download the repository! This way DataLad will not have the necessary
|
|
Do not use **git clone** to download the repository! This way DataLad will not have the necessary
|
|
-information to run the program.
|
|
|
|
|
|
+information to run the program. Once a dataset is cloned, it is a light-weight directory on your local machine.
|
|
|
|
+At this point, it contains only small metadata and information on the identity
|
|
|
|
+of the files in the dataset, but not actual *content* of the (sometimes large)
|
|
|
|
+data files.
|
|
|
|
+
|
|
|
|
|
|
-### 2. Easy Access
|
|
|
|
-Users who simply want to download the dataset have the option to access both the
|
|
|
|
-original and extracted files with the following command.
|
|
|
|
|
|
+### Easy Access
|
|
|
|
+Users who simply want to retrieve the dataset have the option to access both the
|
|
|
|
+original and extracted files with
|
|
```
|
|
```
|
|
dataland get <filename>
|
|
dataland get <filename>
|
|
```
|
|
```
|
|
-For example, the CSV file for the 2023/09/13 release can be downloaded with:
|
|
|
|
|
|
+This command will trigger a download of the files, directories, or subdatasets
|
|
|
|
+you have specified.
|
|
|
|
+
|
|
|
|
+For example, the CSV file for the 2023/09/13 release can be downloaded with
|
|
```
|
|
```
|
|
datalad get extracted_data/v230913/Robbie_Andrew_Cement_Production_CO2_230913.csv
|
|
datalad get extracted_data/v230913/Robbie_Andrew_Cement_Production_CO2_230913.csv
|
|
```
|
|
```
|
|
|
|
+### Stay up-to-date
|
|
|
|
+
|
|
|
|
+DataLad datasets can be updated. The command `datalad update` will *fetch*
|
|
|
|
+updates and store them on a different branch (by default
|
|
|
|
+`remotes/origin/master`). Running
|
|
|
|
+
|
|
|
|
+```
|
|
|
|
+datalad update --merge
|
|
|
|
+```
|
|
|
|
+
|
|
|
|
+will *pull* available updates and integrate them in one go.
|
|
|
|
+
|
|
|
|
+### Find out what has been done
|
|
|
|
+
|
|
|
|
+DataLad datasets contain their history in the ``git log``. By running ``git
|
|
|
|
+log`` (or a tool that displays Git history) in the dataset or on specific
|
|
|
|
+files, you can find out what has been done to the dataset or to individual
|
|
|
|
+files by whom, and when.
|
|
|
|
|
|
|
|
+## Executing the program
|
|
|
|
|
|
-### 3. Executing the program
|
|
|
|
-
|
|
|
|
-#### 3.1 Set up the virtual environment with doit
|
|
|
|
|
|
+#### Set up the virtual environment with doit
|
|
```
|
|
```
|
|
doit setup_env
|
|
doit setup_env
|
|
```
|
|
```
|
|
-#### <a name="download"></a> 3.2 Download the version from the command line.
|
|
|
|
|
|
+#### <a name="download"></a>Download the version from the command line.
|
|
This will download all files from Zenodo as they are.
|
|
This will download all files from Zenodo as they are.
|
|
```
|
|
```
|
|
doit download_version --version <YYMMDD>
|
|
doit download_version --version <YYMMDD>
|
|
```
|
|
```
|
|
-#### <a name="convert"></a> 3.3 Convert the data sets into CSV and NetCDF files.
|
|
|
|
|
|
+#### <a name="convert"></a>Convert the data sets into CSV and NetCDF files.
|
|
```
|
|
```
|
|
doit read_version --version <YYMMDD>
|
|
doit read_version --version <YYMMDD>
|
|
```
|
|
```
|
|
@@ -58,7 +92,7 @@ doit read_version --version <YYMMDD>
|
|
## <a name="newversion"></a> How to add a new version
|
|
## <a name="newversion"></a> How to add a new version
|
|
|
|
|
|
|
|
|
|
-1. To add a new version go to **versions.py** in the **src** directory and create a new value in the
|
|
|
|
|
|
+To add a new version go to **versions.py** in the **src** directory and create a new value in the
|
|
dictionary. Fill all the required information similar to the previous entries.
|
|
dictionary. Fill all the required information similar to the previous entries.
|
|
For example, the value _v230913_ in the _versions_ dictionary describes the 13-Sep-2023 release.
|
|
For example, the value _v230913_ in the _versions_ dictionary describes the 13-Sep-2023 release.
|
|
````python
|
|
````python
|
|
@@ -84,8 +118,7 @@ versions = {
|
|
},
|
|
},
|
|
}
|
|
}
|
|
````
|
|
````
|
|
-
|
|
|
|
-2. Then run the two commands as described in [3.2] and [3.3].
|
|
|
|
|
|
+Then run the two commands `read_version` and `download_version` as described in **Executing the program**.
|
|
|
|
|
|
## Help
|
|
## Help
|
|
Show all doit commands
|
|
Show all doit commands
|
|
@@ -103,7 +136,7 @@ Get help on a specific command
|
|
doit help <command>
|
|
doit help <command>
|
|
```
|
|
```
|
|
|
|
|
|
-## For developers
|
|
|
|
|
|
+## Contributing
|
|
### Repository structure
|
|
### Repository structure
|
|
- **.datalad/** contains config file for datalad
|
|
- **.datalad/** contains config file for datalad
|
|
- **downloaded_data/** contains original data from Zenodo.
|
|
- **downloaded_data/** contains original data from Zenodo.
|
|
@@ -124,15 +157,15 @@ doit help <command>
|
|
- **setup.cfg** requirements
|
|
- **setup.cfg** requirements
|
|
- **setup.py** installs python packages
|
|
- **setup.py** installs python packages
|
|
|
|
|
|
-### Make sure to connect with your siblings
|
|
|
|
|
|
+### Make sure to correctly set up the DataLad siblings
|
|
Git repositories can configure clones of a dataset as _remotes_ in order to fetch, pull, or push from and to them. A `datalad sibling` is the equivalent of a git clone that is configured as a remote.
|
|
Git repositories can configure clones of a dataset as _remotes_ in order to fetch, pull, or push from and to them. A `datalad sibling` is the equivalent of a git clone that is configured as a remote.
|
|
|
|
|
|
-**Query information** about about all known siblings with:
|
|
|
|
|
|
+**Query information** about about all known siblings with
|
|
```
|
|
```
|
|
datalad siblings
|
|
datalad siblings
|
|
```
|
|
```
|
|
|
|
|
|
-**Add a sibling** to allow pushing to github:
|
|
|
|
|
|
+**Add a sibling** to allow pushing to github
|
|
```
|
|
```
|
|
datalad siblings add --dataset . --name <name> --url git@github.com:JGuetschow/Global_CO2_from_cement_production.git
|
|
datalad siblings add --dataset . --name <name> --url git@github.com:JGuetschow/Global_CO2_from_cement_production.git
|
|
```
|
|
```
|
|
@@ -144,79 +177,24 @@ datalad push --to <name>
|
|
|
|
|
|
```
|
|
```
|
|
|
|
|
|
-### instructions for merge requests
|
|
|
|
-# About this dataset
|
|
|
|
-
|
|
|
|
-## General information
|
|
|
|
-
|
|
|
|
-This is a DataLad dataset (id: 24f90b12-e4a9-4e2c-995d-a54ed4cd49c7).
|
|
|
|
-
|
|
|
|
-## DataLad datasets and how to use them
|
|
|
|
-
|
|
|
|
-This repository is a [DataLad](https://www.datalad.org/) dataset. It provides
|
|
|
|
-fine-grained data access down to the level of individual files, and allows for
|
|
|
|
-tracking future updates. In order to use this repository for data retrieval,
|
|
|
|
-[DataLad](https://www.datalad.org/) is required. It is a free and open source
|
|
|
|
-command line tool, available for all major operating systems, and builds up on
|
|
|
|
-Git and [git-annex](https://git-annex.branchable.com/) to allow sharing,
|
|
|
|
-synchronizing, and version controlling collections of large files.
|
|
|
|
-
|
|
|
|
-More information on how to install DataLad and [how to install](http://handbook.datalad.org/en/latest/intro/installation.html)
|
|
|
|
-it can be found in the [DataLad Handbook](https://handbook.datalad.org/en/latest/index.html).
|
|
|
|
-
|
|
|
|
-### Get the dataset
|
|
|
|
-
|
|
|
|
-A DataLad dataset can be `cloned` by running
|
|
|
|
-
|
|
|
|
-```
|
|
|
|
-datalad clone <url>
|
|
|
|
-```
|
|
|
|
-
|
|
|
|
-Once a dataset is cloned, it is a light-weight directory on your local machine.
|
|
|
|
-At this point, it contains only small metadata and information on the identity
|
|
|
|
-of the files in the dataset, but not actual *content* of the (sometimes large)
|
|
|
|
-data files.
|
|
|
|
-
|
|
|
|
-### Retrieve dataset content
|
|
|
|
-
|
|
|
|
-After cloning a dataset, you can retrieve file contents by running
|
|
|
|
-
|
|
|
|
-```
|
|
|
|
-datalad get <path/to/directory/or/file>
|
|
|
|
-```
|
|
|
|
-
|
|
|
|
-This command will trigger a download of the files, directories, or subdatasets
|
|
|
|
-you have specified.
|
|
|
|
-
|
|
|
|
-DataLad datasets can contain other datasets, so called *subdatasets*. If you
|
|
|
|
-clone the top-level dataset, subdatasets do not yet contain metadata and
|
|
|
|
-information on the identity of files, but appear to be empty directories. In
|
|
|
|
-order to retrieve file availability metadata in subdatasets, run
|
|
|
|
-
|
|
|
|
-```
|
|
|
|
-datalad get -n <path/to/subdataset>
|
|
|
|
-```
|
|
|
|
-
|
|
|
|
-Afterwards, you can browse the retrieved metadata to find out about subdataset
|
|
|
|
-contents, and retrieve individual files with `datalad get`. If you use
|
|
|
|
-`datalad get <path/to/subdataset>`, all contents of the subdataset will be
|
|
|
|
-downloaded at once.
|
|
|
|
|
|
+### Issues
|
|
|
|
+There always issues open regarding coding, some of them easy to resolve, some harder.
|
|
|
|
|
|
-### Stay up-to-date
|
|
|
|
|
|
+### Your ideas
|
|
|
|
+Contributing is ouf course not limited to the categories above. I you have ideas for improvements just open an issue or a discussion page to discuss you idea with the community.
|
|
|
|
|
|
-DataLad datasets can be updated. The command `datalad update` will *fetch*
|
|
|
|
-updates and store them on a different branch (by default
|
|
|
|
-`remotes/origin/master`). Running
|
|
|
|
|
|
+### Technical HowTo for contributors
|
|
|
|
+As we have a datalad repository using github and gin the process of contributing code and data is a bit different from
|
|
|
|
+pure git repositories. As the data is only stored on gin, the gin repository is the source to start
|
|
|
|
+from. As gin currently has a problem with forks (the annexed data is not
|
|
|
|
+forked) we have to use branches for development and, thus, to contribute you
|
|
|
|
+first need to contact the maintainers to get write access to the repository.
|
|
|
|
+You have to clone the repository using ssh to be able to push to it.
|
|
|
|
+For that you first need to store your public ssh key on the gin server
|
|
|
|
+(settings -> SSH Keys).
|
|
|
|
|
|
-```
|
|
|
|
-datalad update --merge
|
|
|
|
-```
|
|
|
|
-
|
|
|
|
-will *pull* available updates and integrate them in one go.
|
|
|
|
-
|
|
|
|
-### Find out what has been done
|
|
|
|
-
|
|
|
|
-DataLad datasets contain their history in the ``git log``. By running ``git
|
|
|
|
-log`` (or a tool that displays Git history) in the dataset or on specific
|
|
|
|
-files, you can find out what has been done to the dataset or to individual
|
|
|
|
-files by whom, and when.
|
|
|
|
|
|
+### Instructions for merge requests
|
|
|
|
+Once you have everything set up you can create a new branch branch and work there.
|
|
|
|
+When you're done create a pull request to integrate your work into the main
|
|
|
|
+branch. This should be done first github to allow for discussions and review. Afterwards the changes
|
|
|
|
+can be implemented on gin.
|