Report versions
With a simple query, it is easy to see that reports when only looking at their local_number
have multiple versions. Let's select the top 10 reports with the most versions.
SELECT
local_number,
count(*)
FROM import.substance
GROUP BY local_number
ORDER BY count(*) DESC LIMIT 10;
local_number | count |
---|---|
EU-EC-2950025 | 133 |
EU-EC-10015259316 | 123 |
EU-EC-10015258987 | 106 |
EU-EC-12370933 | 100 |
EU-EC-10014572231 | 99 |
EU-EC-10010813941 | 98 |
EU-EC-10015268466 | 94 |
EU-EC-10015163949 | 94 |
EU-EC-10015261376 | 94 |
EU-EC-10014947659 | 93 |
(10 rows)
Can a report span multiple days?
Let's check:
SELECT
local_number,
COUNT(DISTINCT report_date)
FROM import.substance
GROUP BY local_number
ORDER BY COUNT(DISTINCT report_date) DESC LIMIT 10;
local_number | count |
---|---|
EU-EC-1000000 | 1 |
EU-EC-10000000 | 1 |
EU-EC-10000000011 | 1 |
EU-EC-10000000016 | 1 |
EU-EC-10000000019 | 1 |
EU-EC-10000000021 | 1 |
EU-EC-10000000022 | 1 |
EU-EC-10000000024 | 1 |
EU-EC-10000000025 | 1 |
EU-EC-10000 | 1 |
(10 rows)
it seems it can't.
So what is causing multiple versions of a report?
We find that the report with local_number "EU-EC-2950025" has 133 versions. We will look at this report in more detail. First of all, we check if our previous conclusion that the date does not change holds.
SELECT
local_number,
report_date,
COUNT(*)
FROM import.substance
WHERE
local_number = 'EU-EC-2950025'
GROUP BY
local_number,
report_date;
local_number | report_date | count |
---|---|---|
EU-EC-2950025 | 2009-07-13 | 133 |
It does, all the versions are from exactly the same date (day). Let's look at the is_child_report
boolean to see if at least one of the versions is marked as a parent.
SELECT
local_number,
report_date,
is_child_report,
COUNT(*)
FROM import.substance
WHERE
local_number = 'EU-EC-2950025'
GROUP BY
local_number,
report_date,
is_child_report;
It turns out they are all child reports.
local_number | report_date | is_child_report | count |
---|---|---|---|
EU-EC-2950025 | 2009-07-13 | t | 133 |
Could it have something to do with the substance_id
?
SELECT
local_number,
report_date,
substance_id,
COUNt(*)
FROM
import.substance
WHERE local_number = 'EU-EC-2950025'
GROUP BY
local_number,
report_date,
is_child_report,
substance_id;
local_number | report_date | substance_id | count |
---|---|---|---|
EU-EC-2950025 | 2009-07-13 | 14621 | 1 |
EU-EC-2950025 | 2009-07-13 | 14623 | 1 |
EU-EC-2950025 | 2009-07-13 | 14640 | 1 |
..... | ..... | ..... | ...... |
EU-EC-2950025 | 2009-07-13 | 76884 | 1 |
EU-EC-2950025 | 2009-07-13 | 201161 | 1 |
EU-EC-2950025 | 2009-07-13 | 228030 | 1 |
(133 rows) (truncated for readability)
As the table shows: It does. This report spans 133 substances. We can conclude for this particular report that it is a report that is related to multiple substances. Our scripts to get line listings from EMA loops all substance of product identifiers and thus will return a list of reports per substance and product. This means that where the local_number
should be the so called primary key across our database, it is impossible to use it that way in the downloading process but the import.product
and import.substance
tables should have a UNIQUE(substance_id, local_number)
combination.
Do we have unique combinations of id and local_number?
Let's see if the combination product_id and local_number is unique:
ALTER TABLE import.product ADD CONSTRAINT
product_unique_local_number_product_id
UNIQUE (product_id, local_number);
If this is the case, we should not get an error. It is indeed true. The output is
ALTER TABLE
We repeat the process for the substance table
ALTER TABLE import.substance ADD CONSTRAINT
substance_unique_local_number_substance_id
UNIQUE (substance_id, local_number);
Again, we should not get an error, so the output should be
ALTER TABLE
We can now say that the product and substance tables have the integrity we want, that all the records are unique for the combination between the identifier of the substance(substance_id
) or product(product_id
) and the report identifier (local_number
)