Segregate by substance and region
For this example, we want to check metrics of the number of reports per region and differentiate between COVID-19 vaccination substances and other substances.
We need to answer a couple of questions to proceed: Can we filter Economic Area and others? For this we use the country column and group and count all the substance and product table entries.
Checking the country
SELECT
country,
COUNT(DISTINCT local_number)
from import.substance GROUP BY country;
country | count |
---|---|
European Economic Area | 5107493 |
Non European Economic Area | 5078995 |
Not Specified | 1271 |
SELECT
country,
COUNT(DISTINCT local_number)
from import.product GROUP BY country;
country | count |
---|---|
European Economic Area | 1240840 |
Non European Economic Area | 1865899 |
Not Specified | 34 |
We see that most reports are received from Non European Economic Area for both substances and products. Given that the EMA is a European database, this is remarkable.
Checking for substances with covid in the name
Now, can we differentiate records that are reported for a substance that has covid in it's name?
The query
SELECT
COUNT(DISTINCT local_number)
FROM import.substance
WHERE
substance_id = ANY(ARRAY(
SELECT substance_id FROM import.substances
WHERE name ILIKE '%covid%'
));
produces 1862731
and
SELECT
COUNT(DISTINCT local_number)
FROM import.substance
WHERE
substance_id NOT IN (
SELECT substance_id FROM import.substances
WHERE name ILIKE '%covid%'
);
produces 8338644
When we add these two results together, we get 10201375
, but we need to keep in mind that reports can have multiple versions in the database: there are duplicates in the column for local_number as reports can have various versions. This needs to be taken into account. So will our sum equal the total of all unique reports in the table?
SELECT
COUNT(DISTINCT local_number)
from import.substance;
This query produces 10187759
records. There are multiple substance_id's for a report and reports can end up in the results for having a substance with covid in the name involved combined with other substances. Differentiation between substances with covid in the name and others is possible although the results overlap given that the substance_id
can fall in both of the categories we are investigating. Remember, they are examples and results should be used carefully before drawing conclusions. These queries require further investigation.
Creating the required subsets
First, we create a table that groups by the local_number, representing the unique ICSR and stack all values for substance_id, report_date and country for later use.
CREATE TABLE
analyses.unique_substance_icsr as
select
local_number,
array_agg(distinct substance_id) substance_ids,
array_agg(distinct report_date) as report_dates,
array_agg(distinct country) as regions
from import.substance GROUP BY local_number;
produces 10187759
results.
from this table, we then create a new table containing all reports that are related to a substance with covid in the name:
CREATE TABLE analyses.covid_icsr as
select
local_number,
report_dates[1],
regions[1],
array_length(substance_ids,1) as substance_count
from analyses.unique_substance_icsr
WHERE
substance_ids && ARRAY(
SELECT substance_id::numeric::integer FROM import.substances
WHERE name ILIKE '%covid%'
)
order by array_length(substance_ids, 1);
produces 1862731
results
A similar query can be used to invert the selection by selecting all substance_id's where the substance_id does not contain the term covid
. This way, we get the results we define as others:
CREATE TABLE analyses.non_covid_icsr as
select
local_number,
report_dates[1],
regions[1],
array_length(substance_ids,1) as substance_count
from analyses.unique_substance_icsr
WHERE NOT substance_ids && ARRAY(
SELECT substance_id::numeric::integer FROM import.substances
WHERE name ILIKE '%covid%'
)
order by array_length(substance_ids, 1);
produces 8325028
results
This gives us the base for determining the positions numbered 1 to 4 in the matrix:
COVID-19 | Other | |
---|---|---|
European Economic Area | 1 | 2 |
Other | 3 | 4 |
Segment 1
European Economic Area, substances with covid in the name
CREATE TABLE
analyses.eea_covid as
select
report_dates,
count(*)
from
analyses.covid_icsr
where regions = 'European Economic Area'
GROUP BY report_dates order by report_dates;
produces 911
(days on which reports are filed or altered)
Segment 2
European Economic Area, other substances
CREATE TABLE
analyses.eea_other as
select
report_dates,
count(*)
from analyses.non_covid_icsr
where regions = 'European Economic Area'
GROUP BY report_dates order by report_dates;
produces 6918
(days on which reports are filed or altered)
Segment 3
Other Area, substances with covid in the name
CREATE TABLE
analyses.non_eea_covid as
select
report_dates,
count(*)
from analyses.covid_icsr
where regions != 'European Economic Area'
GROUP BY report_dates order by report_dates;
produces 865
(days on which reports are filed or altered)
Segment 4
Other Area, other substances
CREATE TABLE
analyses.non_eea_other as
select
report_dates,
count(*)
from analyses.non_covid_icsr
where regions != 'European Economic Area'
GROUP BY report_dates order by report_dates;
produces 7029
(days on which reports are filed or altered)
Now we can perform a query to ask the database; on what date were most reports recieved on substances with covid in their name?
SELECT * from analyses.eea_covid order by count desc limit 1;
Which results in:
report_dates | count |
---|---|
2022-06-16 | 15608 |
Showing that most reports on substances with covid in their name where received on the 16th of July in 2022.
Can you think of some more queries?