Tutorial: Introduction to SDMX
Wrapping your head around the increasingly popular SDMX standard can be difficult. The latest user guide is 145 pages. But SDMX in its essence is not so complex. SDMX documentation will typically throw you into the deep-end, full of mysterious terminology, perplexing procedures and a baffling array of options. But the key concepts underlying SDMX are actually quite simple. This short tutorial will help you understand what SDMX is and why it is useful, and take you through a simple SDMX example. It will also explain the process of creating "queries" to retrieve specific data through an SDMX "API". A query is a request for data - like calling an online database, and asking for specific data from that database. A database API is something that receives those requests (or calls for data). It then sends back the data requested.
Overview
- SDMX: What is it and why is it useful
- Understanding SDMX - A simple example
- The key elements of an SDMX response
- Interpreting and creating SDMX database queries
- Further reading
SDMX: What it is and why it is useful
Finding, retrieving and consolidating data from different organizations can be very complex. And as a result, it's also time consuming and costly. The reason for this is that each organization typically has their own organization-specific formats, structures, nomenclature, codes, etc. Making sense of several sets of these can be like disentangling a Fettuccine al pomodoro and a Spaghetti alla puttanesca to create a Penne Bolognese. This is where SDMX - Statistical Data and Metadata eXchange - provides a solution. SDMX is sponsored by organizations such as the World Bank, OECD and IMF. It standardizes the formats, structures and coding of data and meta data, as well as data exchange processes. As a result, it greatly facilitates data exchange and consolidation of data from multiple sources. The more organizations using it, the more useful SDMX becomes. Since it has become the preferred standard for data and meta data exchange by the global statistical community, its potential is enormous.
SDMX is not only useful for data exchange. It can be used to improve the organization, coding, documentation and subsequent use of your own data and meta data. It also makes it more easy for external users and organizations to interpret and use your data. Last but not least, it is useful to benefit from the growing number of technologies which support or are built around SDMX.
Understanding SDMX - A simple example
One of the best ways to become acquainted with SDMX is to have a look at a simple SDMX structure. SDMX comes in several 'flavors' or sub-formats, including: SDMX-ML Generic, SDMX-ML Compact (also referred to as SDMX-ML Structure Specific), SDMX-JSON and SDMX-EDI. Of these, SDMX-JSON is the most compact and efficient. However, packing all that data as tightly as possible has led to a complex structure which is difficult to interpret (that is, for humans!).
This tutorial will look instead at one of the simplest and easiest to understand of the SDMX sub-formats, namely SDMX-ML Compact. As the name implies, it is a compact format (though not at compact as SDMX-JSON). It is suitable for the retrieval of fairly large amounts of data through a single database query.
A simple SDMX-ML example from OECD has the following structure (showing quarterly GDP - expenditure approach). The most important parts for understanding the structure are highlighted in bold:
<OECD:Series LOCATION="AUS" SUBJECT="B1_GE" MEASURE="VOBARSA" FREQUENCY="Q" TIME_FORMAT="P3M" UNIT="AUD" POWERCODE="6" REFERENCEPERIOD="2010">
<OECD:Obs TIME="2009-Q2" OBS_VALUE="1326162.856664"/>
<OECD:Obs TIME="2009-Q3" OBS_VALUE="1329814.001238"/>
<OECD:Obs TIME="2009-Q4" OBS_VALUE="1339180.312553"/>
<OECD:Obs TIME="2010-Q1" OBS_VALUE="1345799.681242"/>
</OECD:Series>
<OECD:Series LOCATION="AUT" SUBJECT="B1_GE" MEASURE="VOBARSA" FREQUENCY="Q" TIME_FORMAT="P3M" UNIT="EUR" POWERCODE="6" REFERENCEPERIOD="2010">
<OECD:Obs TIME="2009-Q2" OBS_VALUE="288669.552976"/>
<OECD:Obs TIME="2009-Q3" OBS_VALUE="289372.504978"/>
<OECD:Obs TIME="2009-Q4" OBS_VALUE="292514.368989"/>
<OECD:Obs TIME="2010-Q1" OBS_VALUE="290540.292982"/>
</OECD:Series>
As you can see in this example, it is a two-level structure:
- Level 1 is the Series, which in this example is geographic LOCATIONS. The example above displays the data for two LOCATIONS, Australia (AUT) and Austria (AUT).
- Level 2 is the set of numeric Observations or Obs within each Series, which usually refers to specific points in time. In the above example, the TIME points are by quarter (2009-Q2, 2009-Q3, etc).
The key elements of an SDMX response
SDMX can include a lot of 'meta data' such as the frequency of the data, the time format, indicator code, source database, maximum decimal places, etc. However, for your specific requirements (e.g., to find specific information or build a data visualization), most of these may be irrelevant. A typical use of SDMX requires mainly two things: (i) the data values or 'observations', and (ii) what those data values represent.
StatPlanet, for example, requires only the following pieces of information to create interactive dashboards with multiple visualization options:
- The Indicator name(s) and unit(s) (e.g., Population growth - annual %)
- The Series reference (in the above example, it is LOCATION)
- The Observation reference (in the above example, it is TIME)
- The Observation value reference (in the above example, it is OBS_VALUE)
This information is sufficient to allow many ways of visualizing data sets, such as:
- Bar charts: values and locations, for a specific time unit
- Statistical maps (choropleth maps): values and locations, for a specific time unit
- Scatter plots and bubble charts: values and locations, for a specific time unit, for two or three indicators (with multiple time units for animation / variation over time)
- Line or time series charts: values and locations, for multiple time units
- Dot plots: values and locations (or another variable), for multiple sub-groups (e.g. rural/urban, male/female, wealth quintile, etc.)
- Combination charts: any combination of the above with data values represented through points, bars or lines
Interpreting and creating SDMX database queries
Besides the structure of the SDMX sub-format, another key aspect to become acquainted with is the database query. A query is a web-link to an online database, for example an OECD database, which specifies what information you would like to retrieve. The online database will then return the results of your query in a specific format - for example, SDMX-ML format. It is like asking a question to Google, but in a more technical, structured format, and the response will be something like the OECD example above.
A database query typically includes at least the following:
- The link to the organization's database API (application programming interface), which can interpret queries and send back the results. For example, the UNESCO Institute for Statistics database API link is: https://api.uis.unesco.org/sdmx//data , and for OECD the API link is http://stats.oecd.org/restsdmx/sdmx.ashx/GetData/
- The specific database from which you wish to retrieve data, referring to the database code (e.g., "QNA" referring to OECD's Quarterly National Accounts database).
- The specific indicator(s) you wish to retrieve data for, such as GDP per capita, population growth, primary education net enrolment rate, CO2 emissions, unemployment rate, etc. This is specified in the query using the indicator code, rather than the indicator name.
There are also several optional parameters:
- The SDMX sub-format, for example, to specify you wish to retrieve data in SDMX-ML Compact format, it would include a parameter such as "format=compact". Without specifying this, the data will be returned in the API's default SDMX sub-format. This can be different for different database APIs.
- A list of location codes. By default, a query will return data for all locations, but if you wish to obtain only data for the USA and Mexico, for example, this can be specified in a parameter such as "USA+MEX". Locations could also refer to regions or districts with a country, or an average/aggregate of a set of countries such as "EU", or even a completely different kind of variable for non-geographical data sets.
- A time period range. By default, a query will return data for all available time periods. To specify a range, the query should include a start and end period. An example format is "startTime=2009&endTime=2011".
- A subscription key, for more restrictive APIs that require a subscription key to access.
The exact way of specifying the various parameters will vary from one SDMX API to another, and can be found in the organizations' API documentation.
Making the query more specific is useful to limit the amount of data which is returned. This is important to ensure you are looking only at data most relevant to your needs. It also speeds up the process, because more specific queries will return less data.
Using your knowledge of the types of parameters above, you will be able to "read" the SDMX queries used by different organizations:
Example 1 - UIS (UNESCO Institute for Statistics)
- Query: https://api.uis.unesco.org/sdmx//data/UNESCO,EDU_NON_FINANCE,1.0/NERA.PT... .KZ+KG+TJ+TM+UZ+BG+MD+RO+UA+AL+BA+MK+ME+RS+AM+AZ+GE?Z+KG+TJ+TM+UZ?format=sdmx-compact-2.1&subscription-key=12345&startPeriod=2006&endPeriod=2016
- Link to the API: https://api.uis.unesco.org/sdmx//data
- Database code: UNESCO,EDU_NON_FINANCE,1.0
- Indicator code: NERA.PT.L1..F.SCH_AGE_GROUP (this one is a bit more complex, as it specifies some additional indicator parameters, such as F for Female)
- Locations: KZ+KG+TJ+TM+UZ
- SDMX sub-format: format=sdmx-compact-2.1
- Subscription-key: 12345
- Time period: startPeriod=2006&endPeriod=2016
- To see how the data for such a query can be visualized, have a look at this Europe and Central Asia dashboard.
Example 2 - OECD (example used above)
- Query: http://stats.oecd.org/restsdmx/sdmx.ashx/GetData/QNA/AUS+AUT.GDP+B1_GE.C... /all?startTime=2009-Q2&endTime=2011-Q4&format=compact_v2
- Link to the API: http://stats.oecd.org/restsdmx/sdmx.ashx/GetData
- Database code: QNA
- Locations: AUS+AUT
- Indicator code: GDP+B1_GE.CUR+VOBARSA.Q (the last part Q specifies 'Quarterly')
- Time period: startTime=2009-Q2&endTime=2011-Q4
- SDMX sub-format: format=compact_v2
Armed with the above knowledge, you can start to interpret SDMX data, write your own SDMX queries, or even create your own dashboards or web applications using SDMX data.
Further reading
For further reading, please see: