Pgloader Tutorial

Loading CSV Data with pgloader

CSV means comma separated values and is often found with quite varying specifications. pgloader allows you to describe those specs in its command.

The Command

To load data with pgloader you need to define in a command the operations in some details. Here’s our example for loading CSV data:

LOAD CSV
     FROM 'path/to/file.csv' (x, y, a, b, c, d)
     INTO postgresql:///pgloader?csv (a, b, d, c)

     WITH truncate,
          skip header = 1,
          fields optionally enclosed by '"',
          fields escaped by double-quote,
          fields terminated by ','

      SET client_encoding to 'latin1',
          work_mem to '12MB',
          standard_conforming_strings to 'on'

   BEFORE LOAD DO
    $$ drop table if exists csv; $$,
    $$ create table csv (
        a bigint,
        b bigint,
        c char(2),
        d text
       );
  $$;

The Data

This command allows loading the following CSV file content:

Header, with a © sign
"2.6.190.56","2.6.190.63","33996344","33996351","GB","United Kingdom"
"3.0.0.0","4.17.135.31","50331648","68257567","US","United States"
"4.17.135.32","4.17.135.63","68257568","68257599","CA","Canada"
"4.17.135.64","4.17.142.255","68257600","68259583","US","United States"
"4.17.143.0","4.17.143.15","68259584","68259599","CA","Canada"
"4.17.143.16","4.18.32.71","68259600","68296775","US","United States"

Loading the data

Here’s how to start loading the data. Note that the ouput here has been edited so as to facilitate its browsing online:

$ pgloader csv.load
... LOG Starting pgloader, log system is ready.
... LOG Parsing commands from file "/Users/dim/dev/pgloader/test/csv.load"

       table name       read   imported     errors            time
-----------------  ---------  ---------  ---------  --------------
      before load          2          2          0          0.039s
-----------------  ---------  ---------  ---------  --------------
              csv          6          6          0          0.019s
-----------------  ---------  ---------  ---------  --------------
Total import time          6          6          0          0.058s

The result

As you can see, the command described above is filtering the input and only importing some of the columns from the example data file. Here’s what gets loaded in the PostgreSQL database:

pgloader# table csv;
    a     |    b     | c  |       d
----------+----------+----+----------------
 33996344 | 33996351 | GB | United Kingdom
 50331648 | 68257567 | US | United States
 68257568 | 68257599 | CA | Canada
 68257600 | 68259583 | US | United States
 68259584 | 68259599 | CA | Canada
 68259600 | 68296775 | US | United States
(6 rows)

Loading Fixed Width Data File with pgloader

Some data providers still use a format where each column is specified with a starting index position and a given length. Usually the columns are blank-padded when the data is shorter than the full reserved range.

The Command

To load data with pgloader you need to define in a command the operations in some details. Here’s our example for loading Fixed Width Data, using a file provided by the US census.

You can find more files from them at the [Census 2000 Gazetteer Files](http://www.census.gov/geo/maps-data/data/gazetteer2000.html).

Here’s our command:

LOAD ARCHIVE
   FROM http://www2.census.gov/geo/docs/maps-data/data/gazetteer/places2k.zip
   INTO postgresql:///pgloader

   BEFORE LOAD DO
     $$ drop table if exists places; $$,
     $$ create table places
       (
          usps      char(2)  not null,
          fips      char(2)  not null,
          fips_code char(5),
          loc_name  varchar(64)
       );
     $$

   LOAD FIXED
        FROM FILENAME MATCHING ~/places2k.txt/
             WITH ENCODING latin1
             (
                usps           from   0 for  2,
                fips           from   2 for  2,
                fips_code      from   4 for  5,
                "LocationName" from   9 for 64 [trim right whitespace],
                p              from  73 for  9,
                h              from  82 for  9,
                land           from  91 for 14,
                water          from 105 for 14,
                ldm            from 119 for 14,
                wtm            from 131 for 14,
                lat            from 143 for 10,
                long           from 153 for 11
             )
        INTO postgresql:///pgloader?places
             (
            usps, fips, fips_code, "LocationName"
             );

The Data

This command allows loading the following file content, where we are only showing the first couple of lines:

AL0100124Abbeville city                                                       2987     1353      40301945        120383   15.560669    0.046480 31.566367 -85.251300
AL0100460Adamsville city                                                      4965     2042      50779330         14126   19.606010    0.005454 33.590411 -86.949166
AL0100484Addison town                                                          723      339       9101325             0    3.514041    0.000000 34.200042 -87.177851
AL0100676Akron town                                                            521      239       1436797             0    0.554750    0.000000 32.876425 -87.740978
AL0100820Alabaster city                                                      22619     8594      53023800        141711   20.472605    0.054715 33.231162 -86.823829
AL0100988Albertville city                                                    17247     7090      67212867        258738   25.951034    0.099899 34.265362 -86.211261
AL0101132Alexander City city                                                 15008     6855     100534344        433413   38.816529    0.167342 32.933157 -85.936008

Loading the data

Let’s start the pgloader command with our census-places.load command file:

$ pgloader census-places.load
... LOG Starting pgloader, log system is ready.
... LOG Parsing commands from file "/Users/dim/dev/pgloader/test/census-places.load"
... LOG Fetching 'http://www2.census.gov/geo/docs/maps-data/data/gazetteer/places2k.zip'
... LOG Extracting files from archive '//private/var/folders/w7/9n8v8pw54t1gngfff0lj16040000gn/T/pgloader//places2k.zip'

       table name       read   imported     errors            time
-----------------  ---------  ---------  ---------  --------------
         download          0          0          0          1.494s
          extract          0          0          0          1.013s
      before load          2          2          0          0.013s
-----------------  ---------  ---------  ---------  --------------
           places      25375      25375          0          0.499s
-----------------  ---------  ---------  ---------  --------------
Total import time      25375      25375          0          3.019s

We can see that pgloader did download the file from its HTTP URL location then unziped it before the loading itself.

Note that the output of the command has been edited to facilitate its browsing online.

Loading MaxMind Geolite Data with pgloader

MaxMind provides a free dataset for geolocation, which is quite popular. Using pgloader you can download the lastest version of it, extract the CSV files from the archive and load their content into your database directly.

The Command

To load data with pgloader you need to define in a command the operations in some details. Here’s our example for loading the Geolite data:

/*
 * Loading from a ZIP archive containing CSV files. The full test can be
 * done with using the archive found at
 * http://geolite.maxmind.com/download/geoip/database/GeoLiteCity_CSV/GeoLiteCity-latest.zip
 *
 * And a very light version of this data set is found at
 * http://pgsql.tapoueh.org/temp/foo.zip for quick testing.
 */

LOAD ARCHIVE
   FROM http://geolite.maxmind.com/download/geoip/database/GeoLiteCity_CSV/GeoLiteCity-latest.zip
   INTO postgresql:///ip4r

   BEFORE LOAD DO
     $$ create extension if not exists ip4r; $$,
     $$ create schema if not exists geolite; $$,
     $$ create table if not exists geolite.location
       (
          locid      integer primary key,
          country    text,
          region     text,
          city       text,
          postalcode text,
          location   point,
          metrocode  text,
          areacode   text
       );
     $$,
     $$ create table if not exists geolite.blocks
       (
          iprange    ip4r,
          locid      integer
       );
     $$,
     $$ drop index if exists geolite.blocks_ip4r_idx; $$,
     $$ truncate table geolite.blocks, geolite.location cascade; $$

   LOAD CSV
        FROM FILENAME MATCHING ~/GeoLiteCity-Location.csv/
             WITH ENCODING iso-8859-1
             (
                locId,
                country,
                region     null if blanks,
                city       null if blanks,
                postalCode null if blanks,
                latitude,
                longitude,
                metroCode  null if blanks,
                areaCode   null if blanks
             )
        INTO postgresql:///ip4r?geolite.location
             (
                locid,country,region,city,postalCode,
                location point using (format nil "(~a,~a)" longitude latitude),
                metroCode,areaCode
             )
        WITH skip header = 2,
             fields optionally enclosed by '"',
             fields escaped by double-quote,
             fields terminated by ','

  AND LOAD CSV
        FROM FILENAME MATCHING ~/GeoLiteCity-Blocks.csv/
             WITH ENCODING iso-8859-1
             (
                startIpNum, endIpNum, locId
             )
        INTO postgresql:///ip4r?geolite.blocks
             (
                iprange ip4r using (ip-range startIpNum endIpNum),
                locId
             )
        WITH skip header = 2,
             fields optionally enclosed by '"',
             fields escaped by double-quote,
             fields terminated by ','

   FINALLY DO
     $$ create index blocks_ip4r_idx on geolite.blocks using gist(iprange); $$;

Note that while the Geolite data is using a pair of integers (start, end) to represent ipv4 data, we use the very poweful ip4r PostgreSQL Extension instead.

The transformation from a pair of integers into an IP is done dynamically by the pgloader process.

Also, the location is given as a pair of float columns for the longitude and the latitude where PostgreSQL offers the point datatype, so the pgloader command here will actually transform the data on the fly to use the appropriate data type and its input representation.

Loading the data

Here’s how to start loading the data. Note that the ouput here has been edited so as to facilitate its browsing online:

$ pgloader archive.load
... LOG Starting pgloader, log system is ready.
... LOG Parsing commands from file "/Users/dim/dev/pgloader/test/archive.load"
... LOG Fetching 'http://geolite.maxmind.com/download/geoip/database/GeoLiteCity_CSV/GeoLiteCity-latest.zip'
... LOG Extracting files from archive '//private/var/folders/w7/9n8v8pw54t1gngfff0lj16040000gn/T/pgloader//GeoLiteCity-latest.zip'

       table name       read   imported     errors            time
-----------------  ---------  ---------  ---------  --------------
         download          0          0          0         11.592s
          extract          0          0          0          1.012s
      before load          6          6          0          0.019s
-----------------  ---------  ---------  ---------  --------------
 geolite.location     470387     470387          0          7.743s
   geolite.blocks    1903155    1903155          0         16.332s
-----------------  ---------  ---------  ---------  --------------
          finally          1          1          0         31.692s
-----------------  ---------  ---------  ---------  --------------
Total import time    2373542    2373542          0        1m8.390s

The timing of course includes the transformation of the 1.9 million pairs of integer into a single ipv4 range each. The finally step consists of creating the GiST specialized index as given in the main command:

CREATE INDEX blocks_ip4r_idx ON geolite.blocks USING gist(iprange);

That index will then be used to speed up queries wanting to find which recorded geolocation contains a specific IP address:

ip4r> select *
        from      geolite.location l
             join geolite.blocks b using(locid)
       where iprange >>= '8.8.8.8';

-[ RECORD 1 ]------------------
locid      | 223
country    | US
region     |
city       |
postalcode |
location   | (-97,38)
metrocode  |
areacode   |
iprange    | 8.8.8.8-8.8.37.255

Time: 0.747 ms

Loading dBase files with pgloader

The dBase format is still in use in some places as modern tools such as Filemaker and Excel offer some level of support for it. Speaking of support in modern tools, pgloader is right there on the list too!

The Command

To load data with pgloader you need to define in a command the operations in some details. Here’s our example for loading a dBase file, using a file provided by the french administration.

You can find more files from them at the Insee website.

Here’s our command:

LOAD DBF
    FROM http://www.insee.fr/fr/methodes/nomenclatures/cog/telechargement/2013/dbf/historiq2013.zip
    INTO postgresql:///pgloader
    WITH truncate, create table
     SET client_encoding TO 'latin1';

Note that here pgloader will benefit from the meta-data information found in the dBase file to create a PostgreSQL table capable of hosting the data as described, then load the data.

Loading the data

Let’s start the pgloader command with our dbf-zip.load command file:

$ pgloader dbf-zip.load
... LOG Starting pgloader, log system is ready.
... LOG Parsing commands from file "/Users/dim/dev/pgloader/test/dbf-zip.load"
... LOG Fetching 'http://www.insee.fr/fr/methodes/nomenclatures/cog/telechargement/2013/dbf/historiq2013.zip'
... LOG Extracting files from archive '//private/var/folders/w7/9n8v8pw54t1gngfff0lj16040000gn/T/pgloader//historiq2013.zip'

       table name       read   imported     errors            time
-----------------  ---------  ---------  ---------  --------------
         download          0          0          0          0.167s
          extract          0          0          0          1.010s
 create, truncate          0          0          0          0.071s
-----------------  ---------  ---------  ---------  --------------
     historiq2013       9181       9181          0          0.658s
-----------------  ---------  ---------  ---------  --------------
Total import time       9181       9181          0          1.906s

We can see that pgloader did download the file from its HTTP URL location then unziped it before the loading itself.

Note that the output of the command has been edited to facilitate its browsing online.

Loading SQLite files with pgloader

The SQLite database is a respected solution to manage your data with. Its embeded nature makes it a source of migrations when a projects now needs to handle more concurrency, which PostgreSQL is very good at. pgloader can help you there.

In a Single Command Line

You can

$ createdb chinook
$ pgloader https://github.com/lerocha/chinook-database/raw/master/ChinookDatabase/DataSources/Chinook_Sqlite_AutoIncrementPKs.sqlite pgsql:///chinook

Done! All with the schema, data, constraints, primary keys and foreign keys, etc. We also see an error with the Chinook schema that contains several primary key definitions against the same table, which is not accepted by PostgreSQL:

2017-06-20T16:18:59.019000+02:00 LOG Data errors in '/private/tmp/pgloader/'
2017-06-20T16:18:59.236000+02:00 LOG Fetching 'https://github.com/lerocha/chinook-database/raw/master/ChinookDatabase/DataSources/Chinook_Sqlite_AutoIncrementPKs.sqlite'
2017-06-20T16:19:00.664000+02:00 ERROR Database error 42P16: multiple primary keys for table "playlisttrack" are not allowed
QUERY: ALTER TABLE playlisttrack ADD PRIMARY KEY USING INDEX idx_66873_sqlite_autoindex_playlisttrack_1;
2017-06-20T16:19:00.665000+02:00 LOG report summary reset
             table name       read   imported     errors      total time
-----------------------  ---------  ---------  ---------  --------------
                  fetch          0          0          0          0.877s
        fetch meta data         33         33          0          0.033s
         Create Schemas          0          0          0          0.003s
       Create SQL Types          0          0          0          0.006s
          Create tables         22         22          0          0.043s
         Set Table OIDs         11         11          0          0.012s
-----------------------  ---------  ---------  ---------  --------------
                  album        347        347          0          0.023s
                 artist        275        275          0          0.023s
               customer         59         59          0          0.021s
               employee          8          8          0          0.018s
                invoice        412        412          0          0.031s
                  genre         25         25          0          0.021s
            invoiceline       2240       2240          0          0.034s
              mediatype          5          5          0          0.025s
          playlisttrack       8715       8715          0          0.040s
               playlist         18         18          0          0.016s
                  track       3503       3503          0          0.111s
-----------------------  ---------  ---------  ---------  --------------
COPY Threads Completion         33         33          0          0.313s
         Create Indexes         22         22          0          0.160s
 Index Build Completion         22         22          0          0.027s
        Reset Sequences          0          0          0          0.017s
           Primary Keys         12          0          1          0.013s
    Create Foreign Keys         11         11          0          0.040s
        Create Triggers          0          0          0          0.000s
       Install Comments          0          0          0          0.000s
-----------------------  ---------  ---------  ---------  --------------
      Total import time      15607      15607          0          1.669s

You may need to have special cases to take care of tho. In advanced case you can use the pgloader command.

The Command

To load data with pgloader you need to define in a command the operations in some details. Here’s our command:

load database
     from 'sqlite/Chinook_Sqlite_AutoIncrementPKs.sqlite'
     into postgresql:///pgloader

 with include drop, create tables, create indexes, reset sequences

  set work_mem to '16MB', maintenance_work_mem to '512 MB';

Note that here pgloader will benefit from the meta-data information found in the SQLite file to create a PostgreSQL database capable of hosting the data as described, then load the data.

Loading the data

Let’s start the pgloader command with our sqlite.load command file:

$ pgloader sqlite.load
... LOG Starting pgloader, log system is ready.
... LOG Parsing commands from file "/Users/dim/dev/pgloader/test/sqlite.load"
... WARNING Postgres warning: table "album" does not exist, skipping
... WARNING Postgres warning: table "artist" does not exist, skipping
... WARNING Postgres warning: table "customer" does not exist, skipping
... WARNING Postgres warning: table "employee" does not exist, skipping
... WARNING Postgres warning: table "genre" does not exist, skipping
... WARNING Postgres warning: table "invoice" does not exist, skipping
... WARNING Postgres warning: table "invoiceline" does not exist, skipping
... WARNING Postgres warning: table "mediatype" does not exist, skipping
... WARNING Postgres warning: table "playlist" does not exist, skipping
... WARNING Postgres warning: table "playlisttrack" does not exist, skipping
... WARNING Postgres warning: table "track" does not exist, skipping
            table name       read   imported     errors            time
----------------------  ---------  ---------  ---------  --------------
      create, truncate          0          0          0          0.052s
                 Album        347        347          0          0.070s
                Artist        275        275          0          0.014s
              Customer         59         59          0          0.014s
              Employee          8          8          0          0.012s
                 Genre         25         25          0          0.018s
               Invoice        412        412          0          0.032s
           InvoiceLine       2240       2240          0          0.077s
             MediaType          5          5          0          0.012s
              Playlist         18         18          0          0.008s
         PlaylistTrack       8715       8715          0          0.071s
                 Track       3503       3503          0          0.105s
index build completion          0          0          0          0.000s
----------------------  ---------  ---------  ---------  --------------
        Create Indexes         20         20          0          0.279s
       reset sequences          0          0          0          0.043s
----------------------  ---------  ---------  ---------  --------------
  Total streaming time      15607      15607          0          0.476s

We can see that pgloader did download the file from its HTTP URL location then unziped it before loading it.

Also, the WARNING messages we see here are expected as the PostgreSQL database is empty when running the command, and pgloader is using the SQL commands DROP TABLE IF EXISTS when the given command uses the include drop option.

Note that the output of the command has been edited to facilitate its browsing online.

Migrating from MySQL to PostgreSQL

If you want to migrate your data over to PostgreSQL from MySQL then pgloader is the tool of choice!

Most tools around are skipping the main problem with migrating from MySQL, which is to do with the type casting and data sanitizing that needs to be done. pgloader will not leave you alone on those topics.

In a Single Command Line

As an example, we will use the f1db database from <http://ergast.com/mrd/> which which provides a historical record of motor racing data for non-commercial purposes. You can either use their API or download the whole database at http://ergast.com/downloads/f1db.sql.gz. Once you’ve done that load the database in MySQL:

$ mysql -u root
> create database f1db;
> source f1db.sql

Now let’s migrate this database into PostgreSQL in a single command line:

$ createdb f1db
$ pgloader mysql://root@localhost/f1db pgsql:///f1db

Done! All with schema, table definitions, constraints, indexes, primary keys, auto_increment columns turned into bigserial , foreign keys, comments, and if you had some MySQL default values such as ON UPDATE CURRENT_TIMESTAMP they would have been translated to a PostgreSQL before update trigger automatically.

$ pgloader mysql://root@localhost/f1db pgsql:///f1db
2017-06-16T08:56:14.064000+02:00 LOG Main logs in '/private/tmp/pgloader/pgloader.log'
2017-06-16T08:56:14.068000+02:00 LOG Data errors in '/private/tmp/pgloader/'
2017-06-16T08:56:19.542000+02:00 LOG report summary reset
               table name       read   imported     errors      total time
-------------------------  ---------  ---------  ---------  --------------
          fetch meta data         33         33          0          0.365s
           Create Schemas          0          0          0          0.007s
         Create SQL Types          0          0          0          0.006s
            Create tables         26         26          0          0.068s
           Set Table OIDs         13         13          0          0.012s
-------------------------  ---------  ---------  ---------  --------------
  f1db.constructorresults      11011      11011          0          0.205s
            f1db.circuits         73         73          0          0.150s
        f1db.constructors        208        208          0          0.059s
f1db.constructorstandings      11766      11766          0          0.365s
             f1db.drivers        841        841          0          0.268s
            f1db.laptimes     413578     413578          0          2.892s
     f1db.driverstandings      31420      31420          0          0.583s
            f1db.pitstops       5796       5796          0          2.154s
               f1db.races        976        976          0          0.227s
          f1db.qualifying       7257       7257          0          0.228s
             f1db.seasons         68         68          0          0.527s
             f1db.results      23514      23514          0          0.658s
              f1db.status        133        133          0          0.130s
-------------------------  ---------  ---------  ---------  --------------
  COPY Threads Completion         39         39          0          4.303s
           Create Indexes         20         20          0          1.497s
   Index Build Completion         20         20          0          0.214s
          Reset Sequences          0         10          0          0.058s
             Primary Keys         13         13          0          0.012s
      Create Foreign Keys          0          0          0          0.000s
          Create Triggers          0          0          0          0.001s
         Install Comments          0          0          0          0.000s
-------------------------  ---------  ---------  ---------  --------------
        Total import time     506641     506641          0          5.547s

You may need to have special cases to take care of tho, or views that you want to materialize while doing the migration. In advanced case you can use the pgloader command.

The Command

To load data with pgloader you need to define in a command the operations in some details. Here’s our example for loading the MySQL Sakila Sample Database.

Here’s our command:

load database
     from      mysql://root@localhost/sakila
     into postgresql:///sakila

 WITH include drop, create tables, no truncate,
      create indexes, reset sequences, foreign keys

  SET maintenance_work_mem to '128MB', work_mem to '12MB', search_path to 'sakila'

 CAST type datetime to timestamptz
                drop default drop not null using zero-dates-to-null,
      type date drop not null drop default using zero-dates-to-null

 MATERIALIZE VIEWS film_list, staff_list

 -- INCLUDING ONLY TABLE NAMES MATCHING ~/film/, 'actor'
 -- EXCLUDING TABLE NAMES MATCHING ~<ory>

 BEFORE LOAD DO
 $$ create schema if not exists sakila; $$;

Note that here pgloader will benefit from the meta-data information found in the MySQL database to create a PostgreSQL database capable of hosting the data as described, then load the data.

In particular, some specific casting rules are given here, to cope with date values such as 0000-00-00 that MySQL allows and PostgreSQL rejects for not existing in our calendar. It’s possible to add per-column casting rules too, which is useful is some of your tinyint are in fact smallint while some others are in fact boolean values.

Finaly note that we are using the MATERIALIZE VIEWS clause of pgloader: the selected views here will be migrated over to PostgreSQL with their contents.

It’s possible to use the MATERIALIZE VIEWS clause and give both the name and the SQL (in MySQL dialect) definition of view, then pgloader creates the view before loading the data, then drops it again at the end.

Loading the data

Let’s start the pgloader command with our sakila.load command file:

$ pgloader sakila.load
... LOG Starting pgloader, log system is ready.
... LOG Parsing commands from file "/Users/dim/dev/pgloader/test/sakila.load"
   <WARNING: table "xxx" does not exists have been edited away>

            table name       read   imported     errors            time
----------------------  ---------  ---------  ---------  --------------
           before load          1          1          0          0.007s
       fetch meta data         45         45          0          0.402s
          create, drop          0         36          0          0.208s
----------------------  ---------  ---------  ---------  --------------
                 actor        200        200          0          0.071s
               address        603        603          0          0.035s
              category         16         16          0          0.018s
                  city        600        600          0          0.037s
               country        109        109          0          0.023s
              customer        599        599          0          0.073s
                  film       1000       1000          0          0.135s
            film_actor       5462       5462          0          0.236s
         film_category       1000       1000          0          0.070s
             film_text       1000       1000          0          0.080s
             inventory       4581       4581          0          0.136s
              language          6          6          0          0.036s
               payment      16049      16049          0          0.539s
                rental      16044      16044          0          0.648s
                 staff          2          2          0          0.041s
                 store          2          2          0          0.036s
             film_list        997        997          0          0.247s
            staff_list          2          2          0          0.135s
Index Build Completion          0          0          0          0.000s
----------------------  ---------  ---------  ---------  --------------
        Create Indexes         41         41          0          0.964s
       Reset Sequences          0          1          0          0.035s
          Foreign Keys         22         22          0          0.254s
----------------------  ---------  ---------  ---------  --------------
     Total import time      48272      48272          0          3.502s

The WARNING messages we see here are expected as the PostgreSQL database is empty when running the command, and pgloader is using the SQL commands DROP TABLE IF EXISTS when the given command uses the include drop option.

Note that the output of the command has been edited to facilitate its browsing online.