1 年間前 · 9960b71805
--- a/README.md
+++ b/README.md
@@ -1,6 +1,6 @@
 
															 # Global CO2 from Cement Production Dataset
														
 
															-This repository downloads the Andrew dataset on global CO2 emissions from cement production from Zenodo.
														
 
															+This repository downloads the Andrew dataset on global CO2 emissions from cement production from [Zenodo](https://zenodo.org/records/10008931).
														
 
															 ## Description
														
@@ -9,7 +9,17 @@ The downloaded dataset can then be converted into CSV (.csv file extension) or N
 
															 The data management tool [DataLad](http://docs.datalad.org/en/stable/) is used to version control the data sets.
														
 
															 Commands to run the scripts are executed via the pydoit package.
														
 
															-### Installation
														
 
															+## DataLad datasets and how to use them  
														
 
															+  
														
 
															+This repository is a [DataLad](https://www.datalad.org/) dataset. It provides  
														
 
															+fine-grained data access down to the level of individual files, and allows for  
														
 
															+tracking future updates. In order to use this repository for data retrieval,  
														
 
															+[DataLad](https://www.datalad.org/) is required. It is a free and open source  
														
 
															+command line tool, available for all major operating systems, and builds up on  
														
 
															+Git and [git-annex](https://git-annex.branchable.com/) to allow sharing,  
														
 
															+synchronizing, and version controlling collections of large files.  
														
 
															+
														
 
															+## Installation
														
 
															 - Install datalad according to the [DataLad handbook](https://handbook.datalad.org/en/latest/intro/installation.html). It is recommended to install globally. 
														
 
															 - DataLad is based on Git. [Git](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git) needs to be installed to run DataLad. 
														
@@ -18,39 +28,63 @@ Commands to run the scripts are executed via the pydoit package.
 
															 ## Getting Started
														
 
															-### 1. Clone the repository
														
 
															+### Clone the repository
														
 
															-Download the repository using the following command.
														
 
															+A DataLad dataset can be `cloned` by running  
														
 
															 ```
														
 
															 datalad clone
														
 
															 ```
														
 
															 Do not use **git clone** to download the repository! This way DataLad will not have the necessary
														
 
															-information to run the program.
														
 
															+information to run the program. Once a dataset is cloned, it is a light-weight directory on your local machine.  
														
 
															+At this point, it contains only small metadata and information on the identity  
														
 
															+of the files in the dataset, but not actual *content* of the (sometimes large)  
														
 
															+data files.  
														
 
															+
														
 
															-### 2. Easy Access
														
 
															-Users who simply want to download the dataset have the option to access both the
														
 
															-original and extracted files with the following command.
														
 
															+### Easy Access
														
 
															+Users who simply want to retrieve the dataset have the option to access both the
														
 
															+original and extracted files with
														
 
															 ```
														
 
															 dataland get <filename>
														
 
															 ```
														
 
															-For example, the CSV file for the 2023/09/13 release can be downloaded with:
														
 
															+This command will trigger a download of the files, directories, or subdatasets  
														
 
															+you have specified.
														
 
															+
														
 
															+For example, the CSV file for the 2023/09/13 release can be downloaded with
														
 
															 ```
														
 
															 datalad get extracted_data/v230913/Robbie_Andrew_Cement_Production_CO2_230913.csv 
														
 
															 ```
														
 
															+### Stay up-to-date  
														
 
															+  
														
 
															+DataLad datasets can be updated. The command `datalad update` will *fetch*  
														
 
															+updates and store them on a different branch (by default  
														
 
															+`remotes/origin/master`). Running  
														
 
															+  
														
 
															+```  
														
 
															+datalad update --merge  
														
 
															+```  
														
 
															+  
														
 
															+will *pull* available updates and integrate them in one go.  
														
 
															+  
														
 
															+### Find out what has been done  
														
 
															+  
														
 
															+DataLad datasets contain their history in the ``git log``.  By running ``git  
														
 
															+log`` (or a tool that displays Git history) in the dataset or on specific  
														
 
															+files, you can find out what has been done to the dataset or to individual  
														
 
															+files by whom, and when.
														
 
															+## Executing the program
														
 
															-### 3. Executing the program
														
 
															-
														
 
															-#### 3.1 Set up the virtual environment with doit
														
 
															+#### Set up the virtual environment with doit
														
 
															 ```
														
 
															 doit setup_env
														
 
															 ```
														
 
															-#### <a name="download"></a> 3.2 Download the version from the command line.
														
 
															+#### <a name="download"></a>Download the version from the command line.
														
 
															 This will download all files from Zenodo as they are.
														
 
															 ```
														
 
															 doit download_version --version <YYMMDD>
														
 
															 ```
														
 
															-#### <a name="convert"></a> 3.3 Convert the data sets into CSV and NetCDF files.
														
 
															+#### <a name="convert"></a>Convert the data sets into CSV and NetCDF files.
														
 
															 ```
														
 
															 doit read_version --version <YYMMDD>
														
 
															 ```
														
@@ -58,7 +92,7 @@ doit read_version --version <YYMMDD>
 
															 ## <a name="newversion"></a> How to add a new version
														
 
															-1. To add a new version go to **versions.py** in the **src** directory and create a new value in the
														
 
															+To add a new version go to **versions.py** in the **src** directory and create a new value in the
														
 
															 dictionary. Fill all the required information similar to the previous entries.
														
 
															 For example, the value _v230913_ in the _versions_ dictionary describes the 13-Sep-2023 release.
														
 
															 ````python
														
@@ -84,8 +118,7 @@ versions = {
 
															     },
														
 
															 }
														
 
															 ````
														
 
															-
														
 
															-2. Then run the two commands as described in [3.2] and [3.3].
														
 
															+Then run the two commands `read_version` and `download_version` as described in **Executing the program**. 
														
 
															 ## Help
														
 
															 Show all doit commands
														
@@ -103,7 +136,7 @@ Get help on a specific command
 
															 doit help <command>
														
 
															 ```
														
 
															-## For developers
														
 
															+## Contributing
														
 
															 ### Repository structure
														
 
															 - **.datalad/** contains config file for datalad
														
 
															 - **downloaded_data/** contains original data from Zenodo.
														
@@ -124,15 +157,15 @@ doit help <command>
 
															 - **setup.cfg** requirements
														
 
															 - **setup.py** installs python packages
														
 
															-### Make sure to connect with your siblings
														
 
															+### Make sure to correctly set up the DataLad siblings
														
 
															 Git repositories can configure clones of a dataset as _remotes_ in order to fetch, pull, or push from and to them. A `datalad sibling` is the equivalent of a git clone that is configured as a remote. 
														
 
															-**Query information** about about all known siblings with: 
														
 
															+**Query information** about about all known siblings with
														
 
															 ```
														
 
															 datalad siblings
														
 
															 ```
														
 
															-**Add a sibling** to allow pushing to github:
														
 
															+**Add a sibling** to allow pushing to github
														
 
															 ```
														
 
															 datalad siblings add --dataset . --name <name> --url git@github.com:JGuetschow/Global_CO2_from_cement_production.git
														
 
															 ```
														
@@ -144,79 +177,24 @@ datalad push --to <name>
 
															 ```
														
 
															-### instructions for merge requests
														
 
															-# About this dataset
														
 
															-
														
 
															-## General information
														
 
															-
														
 
															-This is a DataLad dataset (id: 24f90b12-e4a9-4e2c-995d-a54ed4cd49c7).
														
 
															-
														
 
															-## DataLad datasets and how to use them
														
 
															-
														
 
															-This repository is a [DataLad](https://www.datalad.org/) dataset. It provides
														
 
															-fine-grained data access down to the level of individual files, and allows for
														
 
															-tracking future updates. In order to use this repository for data retrieval,
														
 
															-[DataLad](https://www.datalad.org/) is required. It is a free and open source
														
 
															-command line tool, available for all major operating systems, and builds up on
														
 
															-Git and [git-annex](https://git-annex.branchable.com/) to allow sharing,
														
 
															-synchronizing, and version controlling collections of large files.
														
 
															-
														
 
															-More information on how to install DataLad and [how to install](http://handbook.datalad.org/en/latest/intro/installation.html)
														
 
															-it can be found in the [DataLad Handbook](https://handbook.datalad.org/en/latest/index.html).
														
 
															-
														
 
															-### Get the dataset
														
 
															-
														
 
															-A DataLad dataset can be `cloned` by running
														
 
															-
														
 
															-```
														
 
															-datalad clone <url>
														
 
															-```
														
 
															-
														
 
															-Once a dataset is cloned, it is a light-weight directory on your local machine.
														
 
															-At this point, it contains only small metadata and information on the identity
														
 
															-of the files in the dataset, but not actual *content* of the (sometimes large)
														
 
															-data files.
														
 
															-
														
 
															-### Retrieve dataset content
														
 
															-
														
 
															-After cloning a dataset, you can retrieve file contents by running
														
 
															-
														
 
															-```
														
 
															-datalad get <path/to/directory/or/file>
														
 
															-```
														
 
															-
														
 
															-This command will trigger a download of the files, directories, or subdatasets
														
 
															-you have specified.
														
 
															-
														
 
															-DataLad datasets can contain other datasets, so called *subdatasets*.  If you
														
 
															-clone the top-level dataset, subdatasets do not yet contain metadata and
														
 
															-information on the identity of files, but appear to be empty directories. In
														
 
															-order to retrieve file availability metadata in subdatasets, run
														
 
															-
														
 
															-```
														
 
															-datalad get -n <path/to/subdataset>
														
 
															-```
														
 
															-
														
 
															-Afterwards, you can browse the retrieved metadata to find out about subdataset
														
 
															-contents, and retrieve individual files with `datalad get`.  If you use
														
 
															-`datalad get <path/to/subdataset>`, all contents of the subdataset will be
														
 
															-downloaded at once.
														
 
															+### Issues
														
 
															+There always issues open regarding coding, some of them easy to resolve, some harder.
														
 
															-### Stay up-to-date
														
 
															+### Your ideas
														
 
															+Contributing is ouf course not limited to the categories above. I you have ideas for improvements just open an issue or a discussion page to discuss you idea with the community.
														
 
															-DataLad datasets can be updated. The command `datalad update` will *fetch*
														
 
															-updates and store them on a different branch (by default
														
 
															-`remotes/origin/master`). Running
														
 
															+### Technical HowTo for contributors
														
 
															+As we have a datalad repository using github and gin the process of contributing code and data is a bit different from 
														
 
															+pure git repositories. As the data is only stored on gin, the gin repository is the source to start 
														
 
															+from. As gin currently has a problem with forks (the annexed data is not 
														
 
															+forked) we have to use branches for development and, thus, to contribute you
														
 
															+first need to contact the maintainers to get write access to the repository.
														
 
															+You have to clone the repository using ssh to be able to push to it. 
														
 
															+For that you first need to store your public ssh key on the gin server 
														
 
															+(settings -> SSH Keys). 
														
 
															-```
														
 
															-datalad update --merge
														
 
															-```
														
 
															-
														
 
															-will *pull* available updates and integrate them in one go.
														
 
															-
														
 
															-### Find out what has been done
														
 
															-
														
 
															-DataLad datasets contain their history in the ``git log``.  By running ``git
														
 
															-log`` (or a tool that displays Git history) in the dataset or on specific
														
 
															-files, you can find out what has been done to the dataset or to individual
														
 
															-files by whom, and when.
														
 
															+### Instructions for merge requests
														
 
															+Once you have everything set up you can create a new branch branch and work there. 
														
 
															+When you're done create a pull request to integrate your work into the main 
														
 
															+branch. This should be done first github to allow for discussions and review. Afterwards the changes 
														
 
															+can be implemented on gin.