contribution.md 5.1 KB

Data reading FAQ

How to choose which tables to read from a PDF?

What's behind gwp_to_use?

How do I know what's the correct GWP (gwp_to_use) for a report?

The report should mention which GWP conversion was used. Search for gwp in the report.

How to choose the tolerance when merging datasets?

Should be at 0.1. The idea is to separate rounding error from actual inconsistencies in the data.

How to deal with inconsistencies in a data source?

It is possible that a data source contains conflicting pieces of information. Sometimes it is obvious that one of the values was wrongly added to the data. In other cases, it is impossible which value is more trustworthy. In that case it is better to leave it out completely.

Sometimes the reports hold incorrect values, which become obvious when trying to merge tables, for example the main table's value for 1.A.2 for CO2 is 0.003 and the sector table's value for 1.A.2 and CO2 is 0.006. Johannes: As A rule I would say: If it's rounding errors use pr.merge and tolerance. If the data is actually wrong either correct manually if we know the correct value or set to nan manually if we don't know the correct value.

How to aggregate categories

The aggregation of categories means combining existing categories from the source, for example the PDF document, into a new category and adding up their values. Some categories help to interpret the data, but are rarely included in the reports. For example, it can be helpful to show all emissions without the mostly negative emissions from the LULUCF sector in one category. For this we use M.0.EL - National total emissions excluding LULUCF. There is a set of categories that must be present in primap-hist (where can I look it up?)

It is not immediately obvious which categories need to be aggregated. In principle, there should be a parent category for each category. For example, if category 3.A.1 has been imported, category 3.A should also be present in the final data set. The parent categories are often already contained in the tables in the PDFs. In addition, the following extra categories should also be created, if not already in the report:

  • 0 - National total emissions
  • M.0.EL - National total emissions excluding LULUCF
  • M.LULUCF - LULUCF
  • M.3.D.LU - Other (LULUCF)
  • M.AG - Agriculture
  • M.AG.ELV - Agriculture excluding livestock
  • M.3.C.AG - Aggregate sources and non-CO2 emissions sources on land
  • M.3.D.AG - Other (Agriculture)

To aggregate the categories, we use the process_data_for_country function. The parameter processing_info_country defines the categories and their subcategories to be aggregated. The following example shows all additional categories and their subcategories. Another place to look up the additional categories (the ones that start with M) and their children-categories is ???. The parameter can be passed into the function in this format.

country_processing_step1 = {
    "aggregate_cats" : {
        "M.3.C.AG" : {
            "sources" : [
                "3.C.1",
                "3.C.2",
                "3.C.3",
                "3.C.4",
                "3.C.5",
                "3.C.6",
                "3.C.7",
                "3.C.8",
            ],
        }
        "M.3.D.AG" : {
            "sources" : [
                "3.D.2"
            ],
            "M.AG.ELV" : {
                "sources" : [
                    "M.3.C.AG",
                    "M.3.D.AG"
                ],
            },
            "M.AG" : {
                "sources" : [
                    "3.A",
                    "M.AG.ELV"
                ]
            },
            "M.3.D.LU" : {
                "sources" : [
                    "3.D.1"
                ]
            },
            "M.LULUCF" : {
                "sources" : [
                    "3.B",
                    "M.3.D.LU"
                ],
            },
            "M.0.EL" : {
                "sources" : [
                    "1",
                    "2",
                    "M.AG",
                    "4"
                ],
            },
        },
    }

We can use the concept of aggregation not only to create new categories, but also to perform consistency checks. We know that the sum of categories 1, 2, 3 and 4 must equal the value of category 0. If the category is already contained in the data and we nevertheless aggregate a new category 0 from the subcategories, the original category 0 and the aggregated category 0 are merged. However, this only works if the difference of the values does not exceed a certain tolerance. We take 1% as the default value for the rounding error. This step can be incorporated into the function as follows:

country_processing_step1 = {
    "aggregate_cats" : {
        "0" : {
            "sources" : [
                "1",
                "2",
                "3",
                "4"
            ]
        },
        "3" : {
            "sources" : [
                "M.AG",
                "M.LULUCF"
            ],
        },
    },
}

Note that all the specified sources must be either already present in the dataset or aggregated in the same function call.