Product, Substance, Drug
Throughout the database, we find products and substances but drugs are also mentioned. We are curious about the overlap and the differences and would like to see if we can make head or tails from the different types.
We have tables for products and substances. Is it possible to make a table that contains both and add all entries found for drugs as well? Let's first look at the unique entries in the drug list tables.
SELECT
'product_suspect_drug_list' as source,
count(distinct drugs)
FROM
import.product_subject_drug_list
UNION
SELECT
'product_concomitant_drug_list' as source,
count(distinct drugs)
FROM
import.product_concomitant_drug_list
UNION
SELECT
'substance_suspect_drug_list' as source,
count(distinct drugs)
FROM
import.substance_subject_drug_list
UNION
SELECT
'substance_concomitant_drug_list' as source,
count(distinct drugs)
FROM
import.substance_concomitant_drug_list
UNION
SELECT
'products' as source,
count(*)
FROM
import.products
UNION
SELECT
'substances' as source,
count(*)
FROM
import.substances;
source | count |
---|---|
substances | 4333 |
substance_suspect_drug_list | 39350 |
product_suspect_drug_list | 18960 |
product_concomitant_drug_list | 34939 |
substance_concomitant_drug_list | 65836 |
products | 1469 |
As we can see there is a big difference between the amount of products, substances and drugs.
The drug lists contains textual information which seems to be a combination of medicine (not always present as product and sometimes omitted) and substances between square brackets.
We are going to try to get everything into a new table to see if we can find any relations.
CREATE TABLE analyses.temp_names as
SELECT distinct drugs as "words" from import.product_subject_drug_list;
insert into analyses.temp_names
SELECT distinct drugs from import.substance_subject_drug_list;
insert into analyses.temp_names
SELECT distinct drugs from import.product_concomitant_drug_list;
insert into analyses.temp_names
SELECT distinct drugs from import.substance_concomitant_drug_list;
insert into analyses.temp_names
SELECT distinct name from import.products;
insert into analyses.temp_names
SELECT distinct name from import.substances;
create table analyses.name as select distinct words from analyses.temp_names;
drop table analyses.temp_names;
SELECT count(*) from analyses.name;
count |
---|
84990 |
We find that the result of getting the distinct set of names, words or whatever we should call this is less then the sum of all found words from all tables that contain names or descriptions for substance, product or drug. Can we find intersections and relations?
At least we now have one long list to search through. Let's do a little experiment.
Mentions of a product
select * from analyses.name where words ilike '%comirnaty%';
words |
---|
COMIRNATY 10 MICROGRAMS/DOSE CONCENTRATE FOR DISPERSION FOR INJECTION COVID-19 MRNA VACCINE |
COMIRNATY 10 MICROGRAMS/DOSE CONCENTRATE FOR DISPERSION FOR INJECTION [TOZINAMERAN] |
COMIRNATY 10 MICROGRAMS/DOSE [TOZINAMERAN] |
COMIRNATY 30 MICROGRAMS/DOSE CONCENTRATE FOR DISPERSION FOR INJECTION COVID-19 MRNA VACCINE |
COMIRNATY 30 MICROGRAMS/DOSE CONCENTRATE FOR DISPERSION FOR INJECTION [TOZINAMERAN] |
COMIRNATY 30 MICROGRAMS/DOSE DISPERSION FOR INJECTION COVID-19 MRNA VACCINE |
COMIRNATY 30 MICROGRAMS/DOSE DISPERSION FOR INJECTION [TOZINAMERAN] |
COMIRNATY 30 MICROGRAMS/DOSE [TOZINAMERAN] |
COMIRNATY 3 MICROGRAMS/DOSE CONCENTRATE FOR DISPERSION FOR INJECTION COVID-19 MRNA VACCINE |
COMIRNATY CONCENTRATE FOR DISPERSION FOR INJECTION 10 MICROGRAMS/DOSE [TOZINAMERAN] |
COMIRNATY CONCENTRATE FOR DISPERSION FOR INJECTION 30 MICROGRAMS/DOSE [TOZINAMERAN] |
COMIRNATY CONCENTRATE FOR DISPERSION FOR INJECTION COVID-19 MRNA VACCINE |
COMIRNATY CONCENTRATE FOR DISPERSION FOR INJECTION [TOZINAMERAN] |
COMIRNATY COVID-19 MRNA VACCINE |
COMIRNATY DISPERSION FOR INJECTION [TOZINAMERAN] |
COMIRNATY ORIGINAL/OMICRON BA.1 |
COMIRNATY ORIGINAL/OMICRON BA.1 DISPERSION FOR INJECTION |
COMIRNATY ORIGINAL/OMICRON BA.1 [TOZINAMERAN, RILTOZINAMERAN] |
COMIRNATY ORIGINAL/OMICRON BA.4-5 |
COMIRNATY ORIGINAL/OMICRON BA.4-5 COVID-19 MRNA VACCINE |
COMIRNATY ORIGINAL/OMICRON BA.4-5 [TOZINAMERAN, 5'-CAPPED MRNA ENCODING SARS-COV-2, OMICRON VARIANTS BA.4 AND BA.5, SPIKE PROTEIN, PRE-FUSION STABILISED |
COMIRNATY ORIGINAL/OMICRON BA.4-5 [TOZINAMERAN, FAMTOZINAMERAN] |
COMIRNATY [TOZINAMERAN] |
As seen, there are products and substances mentioned in the drug lists, but also information about the dose and sometimes even unrelated terms.
Language specific information
One of the things we do not seem to be able to extract from the EMA database is the country a report originates from. But looking at the created analyses.name
table shows something interesting. There is information in the words that are language specific. If we look for instance for the text per dag
which is Dutch for per day, we find 3 rows.
select * from analyses.name where words ilike '%per dag%';
words |
---|
1000MG 3X PER DAG [NOT AVAILABLE] |
DE PIL 30 MG PER DAG [NOT AVAILABLE] |
KOELZALF 3X PER DAG OP DE HUID AANBRENGEN; [NOT AVAILABLE] |
Another particular Dutch word is prik let's see if we can find that too:
select * from analyses.name where words ilike '%prik%';
words |
---|
6DE PRIK DT [NOT AVAILABLE] |
BOOSTRIX POLIO INJSUSP WWSP 0,5ML / 4DE PRIK DKTP/HIB/HEPATITIS B [NOT AVAILABLE] |
EEN PRIK IN MIJN SCHOUDER TEGEN DE SLIJMBEURSONTSTEKING [NOT AVAILABLE] |
RUGGENPRIK [NOT AVAILABLE] |
RUGGEPRIK [NOT AVAILABLE] |
TETANUS PRIK [NOT AVAILABLE] |
Here we also discover the Dutch word tegen meaning against. Let's look for that too.
select * from analyses.name where words ilike '%tegen%';
words |
---|
2 GANGBARE MEDICIJNEN TEGEN HOGE BLOEDDRUK. [NOT AVAILABLE] |
AANSTIPMIDDEL TEGEN ACNE [NOT AVAILABLE] |
DIVERSE PUFJES TEGEN MIJN COPD [NOT AVAILABLE] |
EEN PRIK IN MIJN SCHOUDER TEGEN DE SLIJMBEURSONTSTEKING [NOT AVAILABLE] |
GENEESMIDDEL TEGEN HOOIKOORTS EN ANDERE ALLERGIEKLACHTEN [NOT AVAILABLE] |
IETS TEGEN MISSELIJKHEID [NOT AVAILABLE] |
MIDDEL TEGEN BOEZEM FIBRILEREN [NOT AVAILABLE] |
MIDDEL TEGEN TRAGE SCHILDKLIER [NOT AVAILABLE] |
ONTSTEKINGSREMMER TEGEN ARTRITIS PSORIATICA [NOT AVAILABLE] |
PUFJE TEGEN ASTMA [NOT AVAILABLE] |
TEGEN DE MISSELIJKHEID EN TEGEN EEN ALLERGISCHE REACTIE [NOT AVAILABLE] |
ZALFJES TEGEN PSORIASIS [NOT AVAILABLE] |
Although we cannot discover in detail from which country a report originates, we can determine that the Dutch language is used in specific reports. Maybe this method can also be used to investigate other languages?
Word cloud
To get insight in the most used words, we generate a table that can be used to create a word cloud.
WITH words AS (
SELECT unnest(string_to_array(lower(words), ' ')) AS word
FROM analyses.name
)
SELECT word, count(*) as frequency FROM words
GROUP BY word order by count(*) desc limit 25;
word | frequency |
---|---|
[not | 16887 |
available] | 16871 |
acid, | 5725 |
sodium | 5111 |
not | 4488 |
hydrochloride, | 4425 |
available | 4292 |
for | 3875 |
mg | 3647 |
chloride, | 3021 |
calcium | 2874 |
hydrochloride] | 2603 |
2424 | |
magnesium | 2259 |
nos] | 1974 |
solution | 1913 |
potassium | 1850 |
unspecified | 1850 |
chloride | 1742 |
medication | 1633 |
nos, | 1558 |
Maybe we should clean up special characters like ]
and ,
but for now this will give us an idea.
We generate a file containing all words from the previous query.
COPY (SELECT unnest(string_to_array(lower(words), ' ')) AS word
FROM analyses.name) TO '/var/lib/postgres/wordcloud.csv' (format CSV);
Uploading this file to Free World Cloud Generator renders a nice word cloud.
If we ignore the word "available" for now, it is interesting to see words like (hydro)chloride, aluminium and acid and another Dutch word: "pollen".
Can you detect the most used medicine in the word cloud?
You now should have a general idea about what is available by gathering textual information and stitching it together. Can you think of more advanced uses?