Repository structure
The repository is structured by folders. Here we list the folders in order of processing.
- datasets This folder contains generated composite datasets in PRIMAP2 interchange and native format. Each dataset has it's own subfolder which should contain a dataset name, a version, and publication date (e.g. year). TODO: add details
- downloaded_data This folder contains data downloaded from the UNFCCC website and other sources. For Biannual Update Reports (BUR), national Communications (NC), Biannial Update Reports (BTR), National Inventory Submissions (CRF, SEF, etc.), and Nationally Determined Contributions (NDC) an automatic dowloadeder exists (folder UNFCCC). Within the UNFCCC folder the data is organized in a <country>/<submission> structure. NDC submissions are often revised. To be able to keep track of the targets and emissions inventories we store each NDC revision in a time-stamped folder. The NDC downloader is currently not working due to a format change, so the NDC submissions are not up to date. The non-UNFCCC folder contains official country inventories not (yet) submitted to the UNFCCC. The internal structure is the same as for the UNFCCC folder.
- extracted_data This folder holds all extracted datasets in PRIMAP2 interchange and native format. The datasets are organized in country subfolders. The naming convention for the datasets is the following:
- For non-AnnexI submissions: <iso>_<sub>_<year>_<term> where <iso> is the countries 3 letter iso code, <sub> is the submissions, e.g. BUR1, NC5, or inventory2020 (for a non-UNFCCC inventory), <year> is the year of publication, and <term> is the main category terminology e.g.
IPCC2006
or IPCC1996
. Often a processed and an unprocessed version of a submission exist. In most cases these have different category terminologies. The processed versions always have IPCC2006
or IPCC2006_PRIMAP
terminology. Where the original submission is already in one of these terminologies we use the suffix _raw
to mark the unprocessed version. Processing includes sector aggregation, mapping to IPCC2006(_PRIMAP)
sectors, downscaling, and addition of gas baskets.
- Non-UNFCCC data use the same naming scheme as non-AnnexI submissions.
- For AnnexI CRF data: <iso>_CRF<year>_<crf_date> where // is the submission year, and <crf_data>\ is the submission time code which enables us to distinguish between original submissions and resubmissions for the same year. CRF data are always unprocessed.
- UNFCCC DI data: <iso>_DI_<downloaddate><processing> where <download_date> is the date of accessing the DI data and <proceesing> is either
raw
for unprocessed data or a the date of the processing. These DI data files are only symlinks to the actual data files which contain a hash of the contained data in the filename. As DI data only change for a few countries each year which we only know after downloading it, we only physically save a new version of the data if data have been updated for the country, or the processing has been updated. For unchanged data we set a symlink.
- For data from Biannial Transparency Reports (BTRs) the data format will be similar to that of the CRF data. The exact format will be determined when enough countries have submitted to get an overview over the format and the submission process (e.g. updates of submissions).
legacy_data This folder contains data from older non-AnnexI submissions where we don't have code in this repository. For some submissions the code will be added in the future (it just has to be cleaned up and harmonized with the code in the repository), for other submissions we don't have code as they have been read by hand or from machine readable versions of the submissions (xls(x), csv) which are not publicly available. The datasets are organized in country subfolders. The naming convention for the datasets is the following: <iso>_<sub>_<year>_<term>_<extra> where <iso> is the countries 3 letter iso code, <sub> is the submissions, e.g. BUR1, NC5, or inventory2020 (for a non-UNFCCC inventory), <year> is the year of publication, <term> is the main sector terminology e.g. IPCC2006 or IPCC1996, and <extra> is a free identifier to distinguish several files for the same submission (in some cases data for e.g. fluorinated gases are in a separate file). Our aim is to reduce data in this folder to zero and to create fully open source processes for all datasets such that they can be included in the main folder.
log Log files from reading data will be stored here. Files in this folder are only available locally and are not part of the repository.
src/unfccc_ghg_data This folder holds all code to download and read the data. It is organized in subfolders. for more information please consult the usage docs and the api docs TODO: links
- helper Functions and definitions used by several of the downloaders and readers.
- unfccc_crf_reader Functions, scripts, and configuration to read CRF data from xlsx files. The code will also be used to read CRT tables in the future
- unfccc_di_reader Functions, scripts, and configuration to read UNFCCC DI portal data. As the portal's API is not accessible to the general public any more we use the data available on zenodo as a basis (so there's no use in running this reader to try to get more recent data than the data on zenodo)
- unfccc_downloader Functions and scripts to download submissions data from the UNFCCC website.
- unfccc_reader Scripts, configurations, and a few common functions to read individual submissions. Code is organized in country subfolders.
tests no tests yet. Will be added in the future