Pgloader Tutorial
Loading CSV Data with pgloader
CSV means comma separated values and is often found with quite varying specifications. pgloader allows you to describe those specs in its command.
The Command
To load data with pgloader you need to define in a command the operations in some details. Here’s our example for loading CSV data:
LOAD CSV
FROM 'path/to/file.csv' (x, y, a, b, c, d)
INTO postgresql:///pgloader?csv (a, b, d, c)
WITH truncate,
skip header = 1,
fields optionally enclosed by '"',
fields escaped by double-quote,
fields terminated by ','
SET client_encoding to 'latin1',
work_mem to '12MB',
standard_conforming_strings to 'on'
BEFORE LOAD DO
$$ drop table if exists csv; $$,
$$ create table csv (
a bigint,
b bigint,
c char(2),
d text
);
$$;
The Data
This command allows loading the following CSV file content:
Header, with a © sign
"2.6.190.56","2.6.190.63","33996344","33996351","GB","United Kingdom"
"3.0.0.0","4.17.135.31","50331648","68257567","US","United States"
"4.17.135.32","4.17.135.63","68257568","68257599","CA","Canada"
"4.17.135.64","4.17.142.255","68257600","68259583","US","United States"
"4.17.143.0","4.17.143.15","68259584","68259599","CA","Canada"
"4.17.143.16","4.18.32.71","68259600","68296775","US","United States"
Loading the data
Here’s how to start loading the data. Note that the ouput here has been edited so as to facilitate its browsing online:
$ pgloader csv.load
... LOG Starting pgloader, log system is ready.
... LOG Parsing commands from file "/Users/dim/dev/pgloader/test/csv.load"
table name read imported errors time
----------------- --------- --------- --------- --------------
before load 2 2 0 0.039s
----------------- --------- --------- --------- --------------
csv 6 6 0 0.019s
----------------- --------- --------- --------- --------------
Total import time 6 6 0 0.058s
The result
As you can see, the command described above is filtering the input and only importing some of the columns from the example data file. Here’s what gets loaded in the PostgreSQL database:
pgloader# table csv;
a | b | c | d
----------+----------+----+----------------
33996344 | 33996351 | GB | United Kingdom
50331648 | 68257567 | US | United States
68257568 | 68257599 | CA | Canada
68257600 | 68259583 | US | United States
68259584 | 68259599 | CA | Canada
68259600 | 68296775 | US | United States
(6 rows)
Loading Fixed Width Data File with pgloader
Some data providers still use a format where each column is specified with a starting index position and a given length. Usually the columns are blank-padded when the data is shorter than the full reserved range.
The Command
To load data with pgloader you need to define in a command the operations in some details. Here’s our example for loading Fixed Width Data, using a file provided by the US census.
You can find more files from them at the [Census 2000 Gazetteer Files](http://www.census.gov/geo/maps-data/data/gazetteer2000.html).
Here’s our command:
LOAD ARCHIVE
FROM http://www2.census.gov/geo/docs/maps-data/data/gazetteer/places2k.zip
INTO postgresql:///pgloader
BEFORE LOAD DO
$$ drop table if exists places; $$,
$$ create table places
(
usps char(2) not null,
fips char(2) not null,
fips_code char(5),
loc_name varchar(64)
);
$$
LOAD FIXED
FROM FILENAME MATCHING ~/places2k.txt/
WITH ENCODING latin1
(
usps from 0 for 2,
fips from 2 for 2,
fips_code from 4 for 5,
"LocationName" from 9 for 64 [trim right whitespace],
p from 73 for 9,
h from 82 for 9,
land from 91 for 14,
water from 105 for 14,
ldm from 119 for 14,
wtm from 131 for 14,
lat from 143 for 10,
long from 153 for 11
)
INTO postgresql:///pgloader?places
(
usps, fips, fips_code, "LocationName"
);
The Data
This command allows loading the following file content, where we are only showing the first couple of lines:
AL0100124Abbeville city 2987 1353 40301945 120383 15.560669 0.046480 31.566367 -85.251300
AL0100460Adamsville city 4965 2042 50779330 14126 19.606010 0.005454 33.590411 -86.949166
AL0100484Addison town 723 339 9101325 0 3.514041 0.000000 34.200042 -87.177851
AL0100676Akron town 521 239 1436797 0 0.554750 0.000000 32.876425 -87.740978
AL0100820Alabaster city 22619 8594 53023800 141711 20.472605 0.054715 33.231162 -86.823829
AL0100988Albertville city 17247 7090 67212867 258738 25.951034 0.099899 34.265362 -86.211261
AL0101132Alexander City city 15008 6855 100534344 433413 38.816529 0.167342 32.933157 -85.936008
Loading the data
Let’s start the pgloader command with our census-places.load command file:
$ pgloader census-places.load
... LOG Starting pgloader, log system is ready.
... LOG Parsing commands from file "/Users/dim/dev/pgloader/test/census-places.load"
... LOG Fetching 'http://www2.census.gov/geo/docs/maps-data/data/gazetteer/places2k.zip'
... LOG Extracting files from archive '//private/var/folders/w7/9n8v8pw54t1gngfff0lj16040000gn/T/pgloader//places2k.zip'
table name read imported errors time
----------------- --------- --------- --------- --------------
download 0 0 0 1.494s
extract 0 0 0 1.013s
before load 2 2 0 0.013s
----------------- --------- --------- --------- --------------
places 25375 25375 0 0.499s
----------------- --------- --------- --------- --------------
Total import time 25375 25375 0 3.019s
We can see that pgloader did download the file from its HTTP URL location then unziped it before the loading itself.
Note that the output of the command has been edited to facilitate its browsing online.
Loading MaxMind Geolite Data with pgloader
MaxMind provides a free dataset for geolocation, which is quite popular. Using pgloader you can download the lastest version of it, extract the CSV files from the archive and load their content into your database directly.
The Command
To load data with pgloader you need to define in a command the operations in some details. Here’s our example for loading the Geolite data:
/*
* Loading from a ZIP archive containing CSV files. The full test can be
* done with using the archive found at
* http://geolite.maxmind.com/download/geoip/database/GeoLiteCity_CSV/GeoLiteCity-latest.zip
*
* And a very light version of this data set is found at
* http://pgsql.tapoueh.org/temp/foo.zip for quick testing.
*/
LOAD ARCHIVE
FROM http://geolite.maxmind.com/download/geoip/database/GeoLiteCity_CSV/GeoLiteCity-latest.zip
INTO postgresql:///ip4r
BEFORE LOAD DO
$$ create extension if not exists ip4r; $$,
$$ create schema if not exists geolite; $$,
$$ create table if not exists geolite.location
(
locid integer primary key,
country text,
region text,
city text,
postalcode text,
location point,
metrocode text,
areacode text
);
$$,
$$ create table if not exists geolite.blocks
(
iprange ip4r,
locid integer
);
$$,
$$ drop index if exists geolite.blocks_ip4r_idx; $$,
$$ truncate table geolite.blocks, geolite.location cascade; $$
LOAD CSV
FROM FILENAME MATCHING ~/GeoLiteCity-Location.csv/
WITH ENCODING iso-8859-1
(
locId,
country,
region null if blanks,
city null if blanks,
postalCode null if blanks,
latitude,
longitude,
metroCode null if blanks,
areaCode null if blanks
)
INTO postgresql:///ip4r?geolite.location
(
locid,country,region,city,postalCode,
location point using (format nil "(~a,~a)" longitude latitude),
metroCode,areaCode
)
WITH skip header = 2,
fields optionally enclosed by '"',
fields escaped by double-quote,
fields terminated by ','
AND LOAD CSV
FROM FILENAME MATCHING ~/GeoLiteCity-Blocks.csv/
WITH ENCODING iso-8859-1
(
startIpNum, endIpNum, locId
)
INTO postgresql:///ip4r?geolite.blocks
(
iprange ip4r using (ip-range startIpNum endIpNum),
locId
)
WITH skip header = 2,
fields optionally enclosed by '"',
fields escaped by double-quote,
fields terminated by ','
FINALLY DO
$$ create index blocks_ip4r_idx on geolite.blocks using gist(iprange); $$;
Note that while the Geolite data is using a pair of integers (start, end) to represent ipv4 data, we use the very poweful ip4r PostgreSQL Extension instead.
The transformation from a pair of integers into an IP is done dynamically by the pgloader process.
Also, the location is given as a pair of float columns for the longitude and the latitude where PostgreSQL offers the point datatype, so the pgloader command here will actually transform the data on the fly to use the appropriate data type and its input representation.
Loading the data
Here’s how to start loading the data. Note that the ouput here has been edited so as to facilitate its browsing online:
$ pgloader archive.load
... LOG Starting pgloader, log system is ready.
... LOG Parsing commands from file "/Users/dim/dev/pgloader/test/archive.load"
... LOG Fetching 'http://geolite.maxmind.com/download/geoip/database/GeoLiteCity_CSV/GeoLiteCity-latest.zip'
... LOG Extracting files from archive '//private/var/folders/w7/9n8v8pw54t1gngfff0lj16040000gn/T/pgloader//GeoLiteCity-latest.zip'
table name read imported errors time
----------------- --------- --------- --------- --------------
download 0 0 0 11.592s
extract 0 0 0 1.012s
before load 6 6 0 0.019s
----------------- --------- --------- --------- --------------
geolite.location 470387 470387 0 7.743s
geolite.blocks 1903155 1903155 0 16.332s
----------------- --------- --------- --------- --------------
finally 1 1 0 31.692s
----------------- --------- --------- --------- --------------
Total import time 2373542 2373542 0 1m8.390s
The timing of course includes the transformation of the 1.9 million pairs of integer into a single ipv4 range each. The finally step consists of creating the GiST specialized index as given in the main command:
CREATE INDEX blocks_ip4r_idx ON geolite.blocks USING gist(iprange);
That index will then be used to speed up queries wanting to find which recorded geolocation contains a specific IP address:
ip4r> select *
from geolite.location l
join geolite.blocks b using(locid)
where iprange >>= '8.8.8.8';
-[ RECORD 1 ]------------------
locid | 223
country | US
region |
city |
postalcode |
location | (-97,38)
metrocode |
areacode |
iprange | 8.8.8.8-8.8.37.255
Time: 0.747 ms
Loading dBase files with pgloader
The dBase format is still in use in some places as modern tools such as Filemaker and Excel offer some level of support for it. Speaking of support in modern tools, pgloader is right there on the list too!
The Command
To load data with pgloader you need to define in a command the operations in some details. Here’s our example for loading a dBase file, using a file provided by the french administration.
You can find more files from them at the Insee website.
Here’s our command:
LOAD DBF
FROM http://www.insee.fr/fr/methodes/nomenclatures/cog/telechargement/2013/dbf/historiq2013.zip
INTO postgresql:///pgloader
WITH truncate, create table
SET client_encoding TO 'latin1';
Note that here pgloader will benefit from the meta-data information found in the dBase file to create a PostgreSQL table capable of hosting the data as described, then load the data.
Loading the data
Let’s start the pgloader command with our dbf-zip.load command file:
$ pgloader dbf-zip.load
... LOG Starting pgloader, log system is ready.
... LOG Parsing commands from file "/Users/dim/dev/pgloader/test/dbf-zip.load"
... LOG Fetching 'http://www.insee.fr/fr/methodes/nomenclatures/cog/telechargement/2013/dbf/historiq2013.zip'
... LOG Extracting files from archive '//private/var/folders/w7/9n8v8pw54t1gngfff0lj16040000gn/T/pgloader//historiq2013.zip'
table name read imported errors time
----------------- --------- --------- --------- --------------
download 0 0 0 0.167s
extract 0 0 0 1.010s
create, truncate 0 0 0 0.071s
----------------- --------- --------- --------- --------------
historiq2013 9181 9181 0 0.658s
----------------- --------- --------- --------- --------------
Total import time 9181 9181 0 1.906s
We can see that pgloader did download the file from its HTTP URL location then unziped it before the loading itself.
Note that the output of the command has been edited to facilitate its browsing online.
Loading SQLite files with pgloader
The SQLite database is a respected solution to manage your data with. Its embeded nature makes it a source of migrations when a projects now needs to handle more concurrency, which PostgreSQL is very good at. pgloader can help you there.
In a Single Command Line
You can
$ createdb chinook
$ pgloader https://github.com/lerocha/chinook-database/raw/master/ChinookDatabase/DataSources/Chinook_Sqlite_AutoIncrementPKs.sqlite pgsql:///chinook
Done! All with the schema, data, constraints, primary keys and foreign keys, etc. We also see an error with the Chinook schema that contains several primary key definitions against the same table, which is not accepted by PostgreSQL:
2017-06-20T16:18:59.019000+02:00 LOG Data errors in '/private/tmp/pgloader/'
2017-06-20T16:18:59.236000+02:00 LOG Fetching 'https://github.com/lerocha/chinook-database/raw/master/ChinookDatabase/DataSources/Chinook_Sqlite_AutoIncrementPKs.sqlite'
2017-06-20T16:19:00.664000+02:00 ERROR Database error 42P16: multiple primary keys for table "playlisttrack" are not allowed
QUERY: ALTER TABLE playlisttrack ADD PRIMARY KEY USING INDEX idx_66873_sqlite_autoindex_playlisttrack_1;
2017-06-20T16:19:00.665000+02:00 LOG report summary reset
table name read imported errors total time
----------------------- --------- --------- --------- --------------
fetch 0 0 0 0.877s
fetch meta data 33 33 0 0.033s
Create Schemas 0 0 0 0.003s
Create SQL Types 0 0 0 0.006s
Create tables 22 22 0 0.043s
Set Table OIDs 11 11 0 0.012s
----------------------- --------- --------- --------- --------------
album 347 347 0 0.023s
artist 275 275 0 0.023s
customer 59 59 0 0.021s
employee 8 8 0 0.018s
invoice 412 412 0 0.031s
genre 25 25 0 0.021s
invoiceline 2240 2240 0 0.034s
mediatype 5 5 0 0.025s
playlisttrack 8715 8715 0 0.040s
playlist 18 18 0 0.016s
track 3503 3503 0 0.111s
----------------------- --------- --------- --------- --------------
COPY Threads Completion 33 33 0 0.313s
Create Indexes 22 22 0 0.160s
Index Build Completion 22 22 0 0.027s
Reset Sequences 0 0 0 0.017s
Primary Keys 12 0 1 0.013s
Create Foreign Keys 11 11 0 0.040s
Create Triggers 0 0 0 0.000s
Install Comments 0 0 0 0.000s
----------------------- --------- --------- --------- --------------
Total import time 15607 15607 0 1.669s
You may need to have special cases to take care of tho. In advanced case you can use the pgloader command.
The Command
To load data with pgloader you need to define in a command the operations in some details. Here’s our command:
load database
from 'sqlite/Chinook_Sqlite_AutoIncrementPKs.sqlite'
into postgresql:///pgloader
with include drop, create tables, create indexes, reset sequences
set work_mem to '16MB', maintenance_work_mem to '512 MB';
Note that here pgloader will benefit from the meta-data information found in the SQLite file to create a PostgreSQL database capable of hosting the data as described, then load the data.
Loading the data
Let’s start the pgloader command with our sqlite.load command file:
$ pgloader sqlite.load
... LOG Starting pgloader, log system is ready.
... LOG Parsing commands from file "/Users/dim/dev/pgloader/test/sqlite.load"
... WARNING Postgres warning: table "album" does not exist, skipping
... WARNING Postgres warning: table "artist" does not exist, skipping
... WARNING Postgres warning: table "customer" does not exist, skipping
... WARNING Postgres warning: table "employee" does not exist, skipping
... WARNING Postgres warning: table "genre" does not exist, skipping
... WARNING Postgres warning: table "invoice" does not exist, skipping
... WARNING Postgres warning: table "invoiceline" does not exist, skipping
... WARNING Postgres warning: table "mediatype" does not exist, skipping
... WARNING Postgres warning: table "playlist" does not exist, skipping
... WARNING Postgres warning: table "playlisttrack" does not exist, skipping
... WARNING Postgres warning: table "track" does not exist, skipping
table name read imported errors time
---------------------- --------- --------- --------- --------------
create, truncate 0 0 0 0.052s
Album 347 347 0 0.070s
Artist 275 275 0 0.014s
Customer 59 59 0 0.014s
Employee 8 8 0 0.012s
Genre 25 25 0 0.018s
Invoice 412 412 0 0.032s
InvoiceLine 2240 2240 0 0.077s
MediaType 5 5 0 0.012s
Playlist 18 18 0 0.008s
PlaylistTrack 8715 8715 0 0.071s
Track 3503 3503 0 0.105s
index build completion 0 0 0 0.000s
---------------------- --------- --------- --------- --------------
Create Indexes 20 20 0 0.279s
reset sequences 0 0 0 0.043s
---------------------- --------- --------- --------- --------------
Total streaming time 15607 15607 0 0.476s
We can see that pgloader did download the file from its HTTP URL location then unziped it before loading it.
Also, the WARNING messages we see here are expected as the PostgreSQL database is empty when running the command, and pgloader is using the SQL commands DROP TABLE IF EXISTS when the given command uses the include drop option.
Note that the output of the command has been edited to facilitate its browsing online.
Migrating from MySQL to PostgreSQL
If you want to migrate your data over to PostgreSQL from MySQL then pgloader is the tool of choice!
Most tools around are skipping the main problem with migrating from MySQL, which is to do with the type casting and data sanitizing that needs to be done. pgloader will not leave you alone on those topics.
In a Single Command Line
As an example, we will use the f1db database from <http://ergast.com/mrd/> which which provides a historical record of motor racing data for non-commercial purposes. You can either use their API or download the whole database at http://ergast.com/downloads/f1db.sql.gz. Once you’ve done that load the database in MySQL:
$ mysql -u root
> create database f1db;
> source f1db.sql
Now let’s migrate this database into PostgreSQL in a single command line:
$ createdb f1db
$ pgloader mysql://root@localhost/f1db pgsql:///f1db
Done! All with schema, table definitions, constraints, indexes, primary keys, auto_increment columns turned into bigserial , foreign keys, comments, and if you had some MySQL default values such as ON UPDATE CURRENT_TIMESTAMP they would have been translated to a PostgreSQL before update trigger automatically.
$ pgloader mysql://root@localhost/f1db pgsql:///f1db
2017-06-16T08:56:14.064000+02:00 LOG Main logs in '/private/tmp/pgloader/pgloader.log'
2017-06-16T08:56:14.068000+02:00 LOG Data errors in '/private/tmp/pgloader/'
2017-06-16T08:56:19.542000+02:00 LOG report summary reset
table name read imported errors total time
------------------------- --------- --------- --------- --------------
fetch meta data 33 33 0 0.365s
Create Schemas 0 0 0 0.007s
Create SQL Types 0 0 0 0.006s
Create tables 26 26 0 0.068s
Set Table OIDs 13 13 0 0.012s
------------------------- --------- --------- --------- --------------
f1db.constructorresults 11011 11011 0 0.205s
f1db.circuits 73 73 0 0.150s
f1db.constructors 208 208 0 0.059s
f1db.constructorstandings 11766 11766 0 0.365s
f1db.drivers 841 841 0 0.268s
f1db.laptimes 413578 413578 0 2.892s
f1db.driverstandings 31420 31420 0 0.583s
f1db.pitstops 5796 5796 0 2.154s
f1db.races 976 976 0 0.227s
f1db.qualifying 7257 7257 0 0.228s
f1db.seasons 68 68 0 0.527s
f1db.results 23514 23514 0 0.658s
f1db.status 133 133 0 0.130s
------------------------- --------- --------- --------- --------------
COPY Threads Completion 39 39 0 4.303s
Create Indexes 20 20 0 1.497s
Index Build Completion 20 20 0 0.214s
Reset Sequences 0 10 0 0.058s
Primary Keys 13 13 0 0.012s
Create Foreign Keys 0 0 0 0.000s
Create Triggers 0 0 0 0.001s
Install Comments 0 0 0 0.000s
------------------------- --------- --------- --------- --------------
Total import time 506641 506641 0 5.547s
You may need to have special cases to take care of tho, or views that you want to materialize while doing the migration. In advanced case you can use the pgloader command.
The Command
To load data with pgloader you need to define in a command the operations in some details. Here’s our example for loading the MySQL Sakila Sample Database.
Here’s our command:
load database
from mysql://root@localhost/sakila
into postgresql:///sakila
WITH include drop, create tables, no truncate,
create indexes, reset sequences, foreign keys
SET maintenance_work_mem to '128MB', work_mem to '12MB', search_path to 'sakila'
CAST type datetime to timestamptz
drop default drop not null using zero-dates-to-null,
type date drop not null drop default using zero-dates-to-null
MATERIALIZE VIEWS film_list, staff_list
-- INCLUDING ONLY TABLE NAMES MATCHING ~/film/, 'actor'
-- EXCLUDING TABLE NAMES MATCHING ~<ory>
BEFORE LOAD DO
$$ create schema if not exists sakila; $$;
Note that here pgloader will benefit from the meta-data information found in the MySQL database to create a PostgreSQL database capable of hosting the data as described, then load the data.
In particular, some specific casting rules are given here, to cope with date values such as 0000-00-00 that MySQL allows and PostgreSQL rejects for not existing in our calendar. It’s possible to add per-column casting rules too, which is useful is some of your tinyint are in fact smallint while some others are in fact boolean values.
Finaly note that we are using the MATERIALIZE VIEWS clause of pgloader: the selected views here will be migrated over to PostgreSQL with their contents.
It’s possible to use the MATERIALIZE VIEWS clause and give both the name and the SQL (in MySQL dialect) definition of view, then pgloader creates the view before loading the data, then drops it again at the end.
Loading the data
Let’s start the pgloader command with our sakila.load command file:
$ pgloader sakila.load
... LOG Starting pgloader, log system is ready.
... LOG Parsing commands from file "/Users/dim/dev/pgloader/test/sakila.load"
<WARNING: table "xxx" does not exists have been edited away>
table name read imported errors time
---------------------- --------- --------- --------- --------------
before load 1 1 0 0.007s
fetch meta data 45 45 0 0.402s
create, drop 0 36 0 0.208s
---------------------- --------- --------- --------- --------------
actor 200 200 0 0.071s
address 603 603 0 0.035s
category 16 16 0 0.018s
city 600 600 0 0.037s
country 109 109 0 0.023s
customer 599 599 0 0.073s
film 1000 1000 0 0.135s
film_actor 5462 5462 0 0.236s
film_category 1000 1000 0 0.070s
film_text 1000 1000 0 0.080s
inventory 4581 4581 0 0.136s
language 6 6 0 0.036s
payment 16049 16049 0 0.539s
rental 16044 16044 0 0.648s
staff 2 2 0 0.041s
store 2 2 0 0.036s
film_list 997 997 0 0.247s
staff_list 2 2 0 0.135s
Index Build Completion 0 0 0 0.000s
---------------------- --------- --------- --------- --------------
Create Indexes 41 41 0 0.964s
Reset Sequences 0 1 0 0.035s
Foreign Keys 22 22 0 0.254s
---------------------- --------- --------- --------- --------------
Total import time 48272 48272 0 3.502s
The WARNING messages we see here are expected as the PostgreSQL database is empty when running the command, and pgloader is using the SQL commands DROP TABLE IF EXISTS when the given command uses the include drop option.
Note that the output of the command has been edited to facilitate its browsing online.