Cloning A Website

Share This:

The great thing about the internet is how opened it is. If you come across a website that just opens your eyes in amazement then you can clone it very easily. However their are sometimes when cloning a website may have it’s difficulties. For example if a website has files that don’t have permissions set to 777 or 755 then accessing these files can be difficult because you need access to the server. Then of course just right clicking and copying the source code from the site may not work either due to the fact that their are other files you need to get a hold of as well. Either way though their is always going to be ways to clone a website. And most of the time it is extremely easy.

Save As From Your Web Browser

Back in the day simply going to File > Save As on your web browser wasn’t even considered to be worth the effort. It would pretty much just download the page your viewing. These days browsers have gotten a little better and will also download CSS and JS files as well. However their still are times when the browser will make some mistakes. When this happens you usually need to open the files up in text editor and start reviewing the code.

Usually mistakes in the above manner will be found in the <head></head> tags.


<head>
<rel="stylesheet" href="./scripts/style.css" type="text/css">
<script type="text/javascript" src="./scripts/audio.js">
</head>

In the above snippet of code we see that our webpage uses a CSS file and a JS file. We will now check to see if those files have been downloaded to our computer. So we look in the same directory that our downloaded website is in and look for another directory called scripts. If we see the directory we will go into it and see if we see the two files listed above. If we see just one of them we will need to get the one we don’t see. If we don’t see either of them we will need to get both of them. If we don’t see the scripts directory we will need to make it and add both of the files to it.




Getting The Files We Need

Finding the files we need isn’t hard. We know exactly where they are located. For example we see that <rel=”stylesheet” href=”./scripts/style.css” type=”text/css”> which means that if we copied the site http://www.example.com then the style.css file will be found at http://www.example.com/scripts/style.css the same with the JS file. Once we get those files we place them in the correct locations on our computer and test our site again. We should see a big difference.

Other issues may occur inside of the <body></body> tags. However this is rare and fixing them is as easy as doing the same as the above process.




Using WGet

Wget is another option that is famous in the Linux communities. However their are ports of the program for other systems such as Windows, Mac, BSD and Solaris. In order to use wget get you need to have it installed on your computer.

Installing WGet on Windows

Unfortunatly for you Windows users. Unless you go and build wget yourself. You will be forced to use an older version of wget. You can get compiled versions of wget at http://sourceforge.net/projects/gnuwin32/files/wget/. The installation is pretty straight forward. However the installation doesn’t set Envernment Variables for you which may come in handy if you will be using wget often.




Installing WGet on Mac

For you Mac OSX users you will first need to install the Mac OSX developer tools. This will allow you to use the GNU C compiler called GCC. Once you have the developer tools installed you will need to go to the GNU website where you can get the source files for Wget. The exact location of the source files can be found at http://www.gnu.org/software/wget/. Or you can fetch it using Mac OSX built in FTP utility by using the following commands.


ftp ftp://ftp.gnu.org/gnu/wget/wget-latest.tar.gz

Now that we have the source files we need to extract them. To this we will need to type the following commands.


tar -xvzf wget-latest.tar.gz

You should now have a directory called wget-XX.XXX where XX.XXX is the version of wget currently wget is at version 1.14. So the directory would be called wget-1.14. Now we will need to go to that directory.


cd wget-1.14

Now that we are in that directory we will need to build WGet. To do this we will first issue the following commands.


./configure
make
make install

Installing WGet on Ubuntu and other Debian Based Linux

By defualt Ubuntu should already have wget installed. However if it isn’t already installed you can simply install it from Symantic Package manager or the Ubuntu Software Center. You can also install it by typing the following commands.





sudo apt-get install wget -y

Installing WGet on Red Hat Based Linux

By default most Red Hat based Linux systems should already have wget. But if for some reason your system doesn’t already have wget you can easily install it using YUM.


sudo yum install wget -y

Installing WGet on BSD

Once again BSD should have WGet. But if it doesn’t you can easily install it in one of two ways. The easiest way is to use the pkg_add feature like so.


su -
pkg_add -r -v wget
rehash

For those BSD Portsnap users you can install Wget by using the following commands.

su -
portsnap fetch update
cd /usr/ports/ftp/wget
make install clean
rehash

Installing WGet on Solaris

Solaris users will need to do a little extra work to install WGet. But here is how you go about doing it. First you get wget from ftp://ftp.sunfreeware.com/pub/freeware/ so if your using an Intel based computer and you have Solaris 10 you will be going to ftp://ftp.sunfreeware.com/pub/freeware/intel/10/wget-1.12-sol10-x86-local.gz

Now we need to gunzip and install the sucker.


gunzip wget-1.12-sol10-x86-local.gz
pkgadd -d wget-1.12-sol10-x86-local

Using Wget

Using WGet is extremely easy. You just need to type the following commands.


wget http://www.example.com

The above command will download the index file of http://www.example.com and we can do this for any file on the interwebs.

Using HTTrack

In my opinion HTTrack is a monster. The power and it’s abilities our incredible. However because of the way it works it can take quite some time. But the great part about HTTrack is it will latterly download entire webistes in whole. And not only that it will point the codes to point to external files such as CSS and JS files in the correct locations and download those files as well. In short if you wanted to store the entire You Tube website on your computer you could easily do it with HTTrack. The only thing that would get in your way is time and disk space.

To use HTTrack to download say this Google you would type the following commands.


httrack "http://www.google.com/" -O "/somelocation/www.google.com" "+*.google.com/*" -v

Pretty much what the above command is doing is telling HTTrack to download http://www.google.com to a location on your computer called somelocation/www.google.com, and to also include all other files in sub domains such as http://drive.google.com and http://plus.google.com as well as include all pages of all the domains and sub domains such as http://www.google.com/adsense and http://drive.google.com/Your-Drive

If we wanted to just say http://www.google.com and all of the contents for that particular domain we wouldn’t use as many wild cards in our commands. So our command would look like the following.





httrack "http://www.google.com/" -O "/somelocation/www.google.com" "http://www.google.com/*" -v

Their really is so much you can do with HTTrack I could write several blogs about it. And because of that I had debated whether or not to include it in the post.