I’m sure everybody is aware you can have PostgreSQL fetch data live
across the internet (or locally of course) as part of your queries and
use the data directly. In fact there are a large number of
drivers
available to access different kinds of data for download. But in the
simplest case, we can also just use the file_fdw
wrapper that’s
included in the standard PostgreSQL packages, together with
everybody’s http-swiss-army-knife, curl
.
In attempting to adapt this post for the public, what more time-relevant dataset to work off in these pandemic-affected times than the open data provided by the ECDC, being both open data and very current. In particular for this example, they provide public datasets with COVID numbers from across the world (actual public data, requiring no registration to read, and actual data, not just a limited API).
So, let’s see how we can access this data from PostgreSQL:
CREATE EXTENSION file_fdw;
CREATE SERVER curly FOREIGN DATA WRAPPER file_fdw;
CREATE FOREIGN TABLE _rawdata (
daterep text not null,
day int not null,
month int not null,
year int not null,
cases int not null,
deaths int not null,
countries text not null,
geoid text not null,
countrycode text null,
popdata int null,
continent text not null,
cumulative14days float null
)
SERVER curly
OPTIONS (
PROGRAM 'curl -s https://opendata.ecdc.europa.eu/covid19/casedistribution/csv/',
FORMAT 'csv',
HEADER 'on'
);
And to use this we can simply query the foreign table, which will then trigger a HTTP GET of the data:
covid=# SELECT COUNT(*) FROM _rawdata;
- - - -
count
41837
(1 row)
However, there is an unfortunate interaction with LIMIT
:
covid=# SELECT * FROM _rawdata LIMIT 10;
ERROR: program "curl -s https://opendata.ecdc.europa.eu/covid19/casedistribution/csv/" failed
DETAIL: child process exited with exit code 23
So what actually causes this problem?
When the PostgreSQL query executes with LIMIT 10
it stops reading
after 10 rows, and closes the pipe from curl
. At which point curl
gives us an error. If you run the command manually with a pipe to
head
and without silent mode, this becomes clear:
$ curl https://opendata.ecdc.europa.eu/covid19/casedistribution/csv/ |head -1
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0dateRep,day,month,year,cases,deaths,countriesAndTerritories,geoId,countryterritoryCode,popData2019,continentExp,Cumulative_number_for_14_days_of_COVID-19_cases_per_100000
0 2797k 0 17109 0 0 556k 0 0:00:05 --:--:-- 0:00:05 556k
curl: (23) Failed writing body (723 != 1448)
It’s hard to complain about curl
putting out an error here, because,
well, it is an error. Unfortunately, while curl
has a crazy amount
of command-line parameters, it does not have one that would help with
this scenario. However, a little bit of shell trickery can actually
help us. Let’s instead create the foreign table like this:
CREATE FOREIGN TABLE _rawdata (
daterep text not null,
day int not null,
month int not null,
year int not null,
cases int not null,
deaths int not null,
countries text not null,
geoid text not null,
countrycode text null,
popdata int null,
continent text not null,
cumulative14days float null
)
SERVER curly
OPTIONS (
PROGRAM 'curl -s https://opendata.ecdc.europa.eu/covid19/casedistribution/csv/ || exit $(( $? == 23 ? 0 : $? ))',
FORMAT 'csv',
HEADER 'on'
);
Since the parameter for PROGRAM
in file_fdw
can be an arbitrary
shell expression, we can use this type of syntax to turn error 23 into
error 0 (no error), while at the same time still detecting any other
errors.
And once we have done this we can query with LIMIT
just fine
(wrapped in a sub-query to make the result shorter for this post), and
we can notice that it works correctly because it runs a lot faster
than across the whole table so clearly the curl
command aborted soon
after it started:
covid=# SELECT count(*) FROM (SELECT * FROM _rawdata LIMIT 10) x;
count
- - - -
10
(1 row)
Time: 54.818 ms
covid=# SELECT count(*) FROM _rawdata;
count
- - - -
41837
(1 row)
Time: 6563.012 ms (00:06.563)
This little hack can be useful in a lot of cases when we really just need to peek at the beginning of the data (perhaps to validate that something we’re doing is done the right way, such as column name mapping, or just because that’s where the interesting data is). But in the example data picked for this case, we know that the data is only published once a day, and that we are likely to want to look at more than just the first n rows of it. In this case, the obvious thing we want to do is to materialize the data so we can query it repeatedly, and also to convert some data to a more reasonable format:
CREATE MATERIALIZED VIEW covid AS
SELECT to_date(daterep, 'dd/mm/yyyy') AS date,
cases,
deaths,
countries,
geoid,
countrycode,
popdata,
continent,
cumulative14days
FROM _rawdata;
And of course much faster since the data is now local:
covid=# SELECT count(*) FROM covid;
count
- - - -
41837
(1 row)
Time: 34.267 ms
And for any actual analytics we might want to create a couple of local indexes:
CREATE UNIQUE INDEX idx_covid_datecountry ON covid(date, countrycode);
CREATE INDEX idx_covid_countrycode ON covid(countrycode);
CREATE INDEX idx_covid_countries ON covid(countries);
And then schedule a job that once per day refreshes our data so that we always have the latest:
covid=# REFRESH MATERIALIZED VIEW CONCURRENTLY covid;
REFRESH MATERIALIZED VIEW
Time: 6684.797 ms (00:06.685)
This obviously takes a bit of time as it re-downloads the data, but then we have all our data freshly indexed and available for whatever type of analytics we want to do.
In conclusion, using a MATERIALIZED VIEW
for this is clearly the
correct solution, and if we do that we can use LIMIT
or any other
query constructs without having to consider the underlying
implementation at all. However, there are many cases where this might
not make sense at all, such as when the freshness of the data is more
important. And when this is the case, using the exit code shell hack
can be quite useful to be able to run a LIMIT
query against the
foreign table.