wget for downloading big dumps to your virtual machines is slow! Use
Today I had to download a 49GB Heroku Postgres backup to an EC2 server to import it into an RDS instance. The defacto way to download files is using
wget. The problem with these tools is that you download the file using one connection, and connections are often throttled.
You can see that by looking at my first attempt:
curl -o mydump.dump "https://..."
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 49.9G 0 144M 0 0 12.7M 0 1:07:00 0:00:11 1:06:49 16.5M
16.5M?!? One hour and six minutes!? This is silly, we are on a cloud hosted machine that is probably in the same data center as the source.
Luckily I've experienced the olden days of the internet where we'd wait for hours on one mp3, and so I have some experience in ways to download things a bit quicker. One such method was splitting up downloads and downloading through multiple connections simultaneously. I used FlashGet and other tools like it for this.
On servers, I've used a tool called
aget in the past, but I couldn't install it on my Ubuntu EC2 machine, so I looked around and found
aria2. I installed it, looked at the instructions and came up with this command:
aria2c -o mydump.dump --max-connection-per-server=10 -s 10 "https://..."
[#.... 36GiB/49GiB(74%) CN:10 DL:130MiB ETA:1m41s]
130MiB that is more like it!
The command above lets me download ten chunks at the same time. For some reason, the limit with
aria2 is 16, so let's try that:
With 16 parts:
aria2c -o mydump.dump --max-connection-per-server=16 -s 16 "https://..."
[#... 3.0GiB/49GiB(6%) CN:16 DL:222MiB ETA:3m36s]
222MiB, which means about 4 minutes instead of 1 hour and 6 minutes!
Another advantage of using
aria2 is that you can continue where you left off if the connection breaks.