Quick Command Line Tip - Recursively Delete Files of a Certain Type

Applications can create a lot of temporary files sometimes, and these files aren't always cleaned up automatically.

An example of this is when you run Python applications. Particularly if you're a Python developer, your source code directories stack up with a .pyc version of each file, which is the cached compiled copy of the script.

To clean up (especially if you're going to do a source commit or an upload somewhere to extend that example) files of a certain file extension, you can use this command line snippet:

$ find . -name "*.ext" -exec rm '{}' ';'

Obviously, replace *.ext with the pattern that you want to delete.

I shouldn't need to say this, but use this with caution. Make sure you're not accidentally going to delete something useful that matches the pattern you enter, and always keep backups yada yada. Tread carefully when batch deleting.

Quick Command Line Tip - Whois from the Command Line

A very quick command line tip today, for users of pretty much any Unix-based operating system, including Linux distributions.

When you're looking up information on a certain web site or domain name, you might be used to using whois functions on websites such as DNS Tools to see who owns a domain.

However, in most cases there is a much quicker way to get the same information, which is through your command line.

As you might guess, it's simply:

$ whois domain.com

If you also want to hide the legal information that gets returned on a whois request, for brevity, you can easily do so with:

$ whois -H domain.com

This often won't catch it all and give you purely the results, but it usually helps reduce the level of output.

More advanced wget usage

I recently covered how to make a mirror of a website with GNU's wget command line program and in the comments of that post there were several suggestions for more advanced options which allow you to control your downloading further.

So I've decided to follow up on that post and look at some of the more advanced options that wget offers the user.

No parent option

If you are doing a mirror, but you only want to mirror a subdirectory of the main site (for example, just /news/), you might run into a problem. Because many of the pages at /news/ link back to /, you'll inadvertently end up downloading the whole site.

The solution to this, pointed out by Todd in the comments, is to use the no parent option, -np.

In our example, we'd do:

$ wget -mk -w 20 -np http://example.com/news/

Update only changed files

Continuing in our mirroring scenario, another extremely useful option for preserving bandwidth on both sides is to update only the files that the server reports as changed.

This option is -N.

$ wget -mk -w 20 -N http://example.com/

Thanks to Paul William Tenny in the comments for that tip.

Random delay on mirror

And finally for our mirror-specific tips, you can also randomise the delay between downloads. There are several reasons you might want to do this, including sites that don't take kindly to being mirrored, even considerately, and block clients that they suspect of doing it (some bots can be pretty nasty, and you might be categorised as one of 'them').

Randomising the wait time - and combining with the user agent option below - can be steps to circumvent this automatic protection.

If you do find yourself using this feature for that reason, please continue to be considerate and follow any rules regarding the content you've been given. Mirror responsibly.

$ wget -w 20 --random-wait -mk http://example.com/

The wait value - 20 in this case - is used as a base value to calculate what the random wait times will be. They will alternate between 0 and 2 times that value (in this case, 0-40 seconds).

Custom user agent

Some sites might have some strange restrictions on what browsers can access it, or perhaps have different versions of a site depending on the browser used. I can't say I agree with sites that do this, unless there's a really good reason, but it shouldn't stop you from using wget for access.

Using wget, you can set a fake user agent string so that the program reports itself as a different browser.

$ wget -U "user agent" http://example.com/

Combine the -U option with any others you want, obviously. Here are a few user agents you can use to get you started:


IE6 on Windows XP: Mozilla/4.0 (compatible; MSIE 6.0; Microsoft Windows NT 5.1)
Firefox on Windows XP: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.14) Gecko/20080404 Firefox/2.0.0.14
Firefox on Ubuntu Gutsy: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.14) Gecko/20080418 Ubuntu/7.10 (gutsy) Firefox/2.0.0.14
Safari on Mac OS X Leopard: Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en) AppleWebKit/523.12.2 (KHTML, like Gecko) Version/3.0.4 Safari/523.12.2

That's it for now, if you have any more useful wget tips and tricks, share them in the comments!

  1. 1
  2. 2
  3. 3
  4. 4

Sign In

    Enjoy FOSSwire's content? Have it delivered! Subscribe