Segregate by substance and region

For this example, we want to check metrics of the number of reports per region and differentiate between COVID-19 vaccination substances and other substances.

We need to answer a couple of questions to proceed: Can we filter Economic Area and others? For this we use the country column and group and count all the substance and product table entries.

Checking the country

SELECT 
  country,
  COUNT(DISTINCT local_number)
from import.substance GROUP BY country;

country	count
European Economic Area	5107493
Non European Economic Area	5078995
Not Specified	1271

SELECT
  country,
  COUNT(DISTINCT local_number)
from import.product GROUP BY country;

country	count
European Economic Area	1240840
Non European Economic Area	1865899
Not Specified	34

We see that most reports are received from Non European Economic Area for both substances and products. Given that the EMA is a European database, this is remarkable.

Checking for substances with covid in the name

Now, can we differentiate records that are reported for a substance that has covid in it's name?

The query

SELECT
  COUNT(DISTINCT local_number)
FROM import.substance
WHERE 
  substance_id = ANY(ARRAY(
    SELECT substance_id FROM import.substances
    WHERE name ILIKE '%covid%'
  ));

produces 1862731

and

SELECT
  COUNT(DISTINCT local_number)
FROM import.substance
WHERE 
  substance_id NOT IN (
    SELECT substance_id FROM import.substances 
    WHERE name ILIKE '%covid%'
  );

produces 8338644

When we add these two results together, we get 10201375, but we need to keep in mind that reports can have multiple versions in the database: there are duplicates in the column for local_number as reports can have various versions. This needs to be taken into account. So will our sum equal the total of all unique reports in the table?

SELECT
  COUNT(DISTINCT local_number)
from import.substance;

This query produces 10187759 records. There are multiple substance_id's for a report and reports can end up in the results for having a substance with covid in the name involved combined with other substances. Differentiation between substances with covid in the name and others is possible although the results overlap given that the substance_id can fall in both of the categories we are investigating. Remember, they are examples and results should be used carefully before drawing conclusions. These queries require further investigation.

Creating the required subsets

First, we create a table that groups by the local_number, representing the unique ICSR and stack all values for substance_id, report_date and country for later use.

CREATE TABLE
 analyses.unique_substance_icsr as 
  select
   local_number,
   array_agg(distinct substance_id) substance_ids,
   array_agg(distinct report_date) as report_dates,
   array_agg(distinct country) as regions
from import.substance GROUP BY local_number;

produces 10187759 results.

from this table, we then create a new table containing all reports that are related to a substance with covid in the name:

CREATE TABLE analyses.covid_icsr as
  select
    local_number,
    report_dates[1],
    regions[1],
    array_length(substance_ids,1) as substance_count 
  from analyses.unique_substance_icsr
  WHERE
    substance_ids && ARRAY(
      SELECT substance_id::numeric::integer FROM import.substances
      WHERE name ILIKE '%covid%'
    )
  order by array_length(substance_ids, 1);

produces 1862731results

A similar query can be used to invert the selection by selecting all substance_id's where the substance_id does not contain the term covid. This way, we get the results we define as others:

CREATE TABLE analyses.non_covid_icsr as
  select
    local_number,
    report_dates[1],
    regions[1],
    array_length(substance_ids,1) as substance_count
  from analyses.unique_substance_icsr
  WHERE NOT substance_ids && ARRAY(
    SELECT substance_id::numeric::integer FROM import.substances
    WHERE name ILIKE '%covid%'
  )
  order by array_length(substance_ids, 1);

produces 8325028 results

This gives us the base for determining the positions numbered 1 to 4 in the matrix:

	COVID-19	Other
European Economic Area	1	2
Other	3	4

Segment 1

European Economic Area, substances with covid in the name

CREATE TABLE
  analyses.eea_covid as
    select
      report_dates,
      count(*)
    from
      analyses.covid_icsr
    where regions = 'European Economic Area'
    GROUP BY report_dates order by report_dates;

produces 911 (days on which reports are filed or altered)

Segment 2

European Economic Area, other substances

CREATE TABLE 
  analyses.eea_other as 
    select
      report_dates,
      count(*)
    from analyses.non_covid_icsr 
    where regions = 'European Economic Area'
    GROUP BY report_dates order by report_dates;

produces 6918 (days on which reports are filed or altered)

Segment 3

Other Area, substances with covid in the name

CREATE TABLE
  analyses.non_eea_covid as
    select
      report_dates,
      count(*)
    from analyses.covid_icsr
    where regions != 'European Economic Area'
    GROUP BY report_dates order by report_dates;

produces 865 (days on which reports are filed or altered)

Segment 4

Other Area, other substances

CREATE TABLE
  analyses.non_eea_other as
    select
      report_dates,
      count(*)
    from analyses.non_covid_icsr
    where regions != 'European Economic Area'
    GROUP BY report_dates order by report_dates;

produces 7029 (days on which reports are filed or altered)

Now we can perform a query to ask the database; on what date were most reports recieved on substances with covid in their name?

SELECT * from analyses.eea_covid order by count desc limit 1;

Which results in:

report_dates	count
2022-06-16	15608

Showing that most reports on substances with covid in their name where received on the 16th of July in 2022.

Can you think of some more queries?

OpenEMA documentation