Mirror a Website with WGet
1 min read

Mirror a Website with WGet

Mirror a Website with WGet

TL;DR: if you want to mirror a site where you need to authenticate, you need to do it in two steps.

In the process of testing my app, I needed a mirror of the website I'm using. Since the website has authentication (via a form), the mirroring process has 2 parts.

First part is to log in and get the cookies:

wget \
  --keep-session-cookies \
  --save-cookies cookies.txt \
  --post-data "login[LOGIN]&password=[PASSWORD]&module=admission&controller=login&action=logindo&auth_act=1" \
  https://europa.eu/epso/application/base/index.cfm

Now, the cookies are saved in the cookies.txt file. Note that you need to specify --keep-session-cookies. Otherwise, the file will be empty.

The second part is to actually perform the mirror:

wget --load-cookies cookies.txt \
  -r \
  -l 2 \
  -k \
  -nc \
  -R css,js,gif -R "*lang=*" -R "*srln=DE" -R "*srln=FR" \
  -I /epso/application/account,/epso/application/cv_new \
  -Deuropa.eu \
  https://europa.eu/epso/application/cv_new/index.cfm

I'll explain each flag:

  • -r recursive (d'oh!)
  • -l 2 don't exaggerate with recursivity!
  • -k convert links to relative
  • -nc don't re-download things
  • -R... exclude links and files (e.g. don't download css files)
  • -I... restrict downloading only to some paths
  • -D... download only from this domain

The result is that I get a mirror with all relevant items to me.

Note: Please read the wget manual for limitations on form access.

References

HTH,