瀏覽代碼

add readme

Daniel Busch 1 年之前
父節點
當前提交
c361d43bc8
共有 2 個文件被更改,包括 254 次插入1 次删除
  1. 2 1
      .gitignore
  2. 252 0
      README.md

+ 2 - 1
.gitignore

@@ -1,6 +1,7 @@
 __pycache__
 venv
-.doit.db*
+.doit.db.*
+.DS_Store
 .idea
 *.ipynb
 .ipynb_checkpoints

+ 252 - 0
README.md

@@ -0,0 +1,252 @@
+# Global CO2 from Cement Production Dataset
+
+This repository downloads the Andrews dataset on global CO2 emissions from cement production from [Zenodo](https://zenodo.org/records/831454).
+The dataset is converted to the PRIMAP2 format and provided in both the csv-based interchange format and the netCDF-based native primap2 format. Several version of the dataset are available within this repository.
+
+## Description
+
+This repository downloads data on global CO2 emissions from cement production from [Zenodo](https://zenodo.org/records/831454).
+The downloaded dataset can then be converted into CSV (.csv file extension) or NetCDF (.nc file extension) format. Converted data are available for the following versions:
+
+- | v231016 |[Zenodo](https://zenodo.org/records/10008931) |
+- | v230913 |[Zenodo](https://zenodo.org/records/8339353) |
+- | v230428 |[Zenodo](https://zenodo.org/records/7875557) |
+- | v220915 |[Zenodo](https://zenodo.org/records/7081360) |
+- | v220516 |[Zenodo](https://zenodo.org/records/6553090) |
+
+The data management tool [DataLad](http://docs.datalad.org/en/stable/) is used to version control the data sets.
+Commands to manage the data are executed via the [pydoit](https://pydoit.org/) package.
+
+## DataLad datasets and how to use them  
+  
+This repository is a [DataLad](https://www.datalad.org/) dataset. It provides  
+fine-grained data access down to the level of individual files, and allows for  
+tracking future updates. In order to use this repository for data retrieval,  
+[DataLad](https://www.datalad.org/) is required. It is a free and open source  
+command line tool, available for all major operating systems, and builds upon  
+Git and [git-annex](https://git-annex.branchable.com/) to allow sharing,  
+synchronizing, and version controlling collections of large files.  
+
+## Installation
+
+Note that for [simply downloading the dataset,](#1-easy-access) Python and pydoit are not required.
+
+- Install datalad according to the [DataLad handbook](https://handbook.datalad.org/en/latest/intro/installation.html). We recommend installing datalad globally as managing it from within the venv is not something we do for you.
+- Install [Python](https://www.python.org)
+- Install [pydoit](https://pydoit.org/install.html). Like with DataLad, we recommend installing pydoit globally as managing it from within the venv is not something we do for you.
+
+## Getting Started
+
+### Clone the repository
+
+A DataLad dataset can be `cloned` by running  
+
+```sh
+datalad clone
+```
+
+Do not use git clone to download the repository! If you use plain git clone, DataLad will not have the necessary
+information to manage the dataset. Once the repository is cloned, it is like using a standard light-weight repository on your local machine.  
+At this point, the repository contains only small metadata and information on the identity  
+of the files in the dataset, but not the actual *content* of the (sometimes large)  
+files.  
+
+
+<h3 id="1-easy-access"> Easy access </h3>
+Users who simply want to retrieve the dataset have the option to access both the
+original and extracted files with
+
+```sh
+dataland get <filename>
+```
+
+This command will trigger a download of the files, directories, or subdatasets  
+you have specified.
+
+For example, the CSV file for the 13-Sep-2023 release can be downloaded with
+
+```sh
+datalad get extracted_data/v230913/Robbie_Andrew_Cement_Production_CO2_230913.csv 
+```
+
+### Stay up-to-date  
+  
+DataLad datasets can be updated. The command `datalad update` will *fetch*  
+updates and store them on a different branch to the one you're currently working on (by default  
+`remotes/origin/master`). Running  
+  
+```  
+datalad update --merge  
+```  
+  
+will *fetch* available updates and integrate them in one go.  
+  
+### Find out what has been done  
+  
+DataLad datasets contain their history in the `git log`.  By running `git  
+log` (or a tool that displays Git history) in the dataset or on specific  
+files, you can find out what has been done to the dataset or to individual  
+files by whom, and when.
+
+<h2 id="2-contributing"> Contributing </h2>
+
+For those who wish to contribute to the repository, below we go through the key commands you will need to use. 
+
+#### Set up the virtual environment with doit
+
+```sh
+doit setup_env
+```
+
+#### <a name="download"></a>Download the version from the command line.
+
+This will download all files from Zenodo as they are for a specific version (note this version must already be in `versions.py`, if you want to add a new version, see the section on adding a new version below).
+
+```sh
+doit download_version version=<vYYMMDD>
+```
+
+For example, the following command will download all files from Zenodo for the 16-Sep-2022 release
+
+```sh
+doit download_version version=v220916
+```
+
+#### <a name="convert"></a> Read the version from the command line.
+
+Reading data refers to the conversion of the downloaded files into CSV and NetCDF format. Similarly to the download
+command, the data is read for a specific version with
+
+```sh
+doit read_version version=<vYYMMDD>
+```
+
+For example, the following command will read the 16-Sep-2022 release
+
+```sh
+doit read_version version=v220916
+```
+
+## <a name="newversion"></a> How to add a new version
+
+To add a new version go to `src/versions.py` in the `src` directory and create a new value in the
+`versions` dictionary. Fill all the required information similar to the previous entries.
+For example, the value for key `"v230913"` in the `versions` dictionary describes the 13-Sep-2023 release.
+
+````python
+versions = {
+    "v230913": {
+        'date': '13-Sep-2023',
+        'ver_str_long': 'version 230913',
+        'ver_str_short': '230913',
+        "folder": "v230913",
+        "transpose": False,
+        "filename": "0. GCP-CEM.csv",
+        'ref': '10.5281/zenodo.8339353',
+        'ref2': '10.5194/essd-11-1675-2019',
+        'title': 'Global CO2 emissions from cement production',
+        'institution': "CICERO - Center for International Climate Research",
+        'filter_keep': {},
+        'filter_remove': {},
+        'contact': "johannes.guetschow@climate-resource.com",
+        'comment': ("Published by Robbie Andrew, converted to PRIMAP2 format by "
+                    "Johannes Gütschow"),
+        'unit': 'kt * CO2 / year',
+        'country_code': True,
+    },
+}
+````
+
+Then run the two commands, `read_version` and `download_version` as described in [Contributing](#2-contributing) for your newly added version. 
+
+## Help
+
+Show all doit commands
+
+```sh
+doit help
+```
+
+See a list with possible doit commands specific to this repository
+
+```sh
+doit list
+```
+
+Get help on a specific command
+
+```
+doit help <command>
+```
+
+### Repository structure
+
+- `.datalad/` contains config file for datalad
+- `downloaded_data/` contains original data from Zenodo.
+- `extracted_data/` contains data in .csv and .nc format
+- `literature/` contains link to publication by Robbie M. Andrew. Can be downloaded with _datalad get_ command
+- `src/` 
+  - `download_version.py` downloads files from zenodo for a given version. The version to read will be taken from the command line using _argparse_.
+  - `download_version_datalad.py` calls datalad to run the data reading function.
+  - `helper_functions.py` contains a function to map country codes.
+  - `read_version.py` reads the data for a given version and saves to [PRIMAP2](https://primap2.readthedocs.io/en/stable/) native and
+    interchange format.
+  - `read_version_datalad.py` calls datalad to run the data reading function.
+  - `version.py` is a dictionary that contains metadata for each release. This file should be updated when [adding a new version](#a-namenewversiona-how-to-add-a-new-version) 
+- `dodo.py` defines pydoit commands.
+- `pyproject.toml` configuration file
+- `requirements.txt` requirements
+- `requirements_dev.txt` development requirements
+- `setup.cfg` requirements
+- `setup.py` installs python packages
+
+### Make sure to correctly set up the DataLad siblings
+
+Git repositories can configure clones of a dataset as remotes in order to fetch, pull, or push from and to them. A `datalad sibling` is the equivalent of a git clone that is configured as a remote. 
+
+**Query information** about about all known siblings with
+
+```sh
+datalad siblings
+```
+
+**Add a sibling** to allow pushing to github
+
+```sh
+datalad siblings add --dataset . --name <name> --url git@github.com:JGuetschow/Global_CO2_from_cement_production.git
+```
+
+SSH-access is needed to run this command. Note that `name` can be freely chosen (we tend to just use "github" for GitHub siblings) 
+
+**Push to the github repository**
+
+```
+datalad push --to <name>
+```
+
+where `name` should match the name you used above
+### Issues
+
+There always issues open regarding coding, some of them easy to resolve, some harder.
+
+### Your ideas
+
+Contributing is ouf course not limited to the categories above. If you have ideas for improvements just open an issue or a discussion page to discuss your idea with the community.
+
+### Technical HowTo for contributors
+
+As we have a datalad repository using github and gin, the process of contributing code and data is a bit different from 
+pure git repositories. As the data is only stored on gin, the gin repository is the source to start 
+from. As gin currently has a problem with forks (the annexed data is not 
+forked) we have to use branches for development and, thus, to contribute you
+first need to contact the maintainers to get write access to the gin repository.
+You have to clone the repository using ssh to be able to push to it. 
+For that you first need to store your public ssh key on the gin server 
+(settings -> SSH Keys). 
+
+### Instructions for merge requests
+
+Once you have everything set up you can create a new branch branch and work there. 
+When you're done, create a pull request to integrate your work into the main 
+branch. This should be done first on github to allow for discussions and review (gin servers don't have the same review features). Afterwards the changes 
+can be actually merged on gin (so that the annex is merged properly too).