Remove Duplicate Files in Linux via Bash Scripting

When dealing with large collections of automated or aggregated files, you can end up with duplicate file names differing only in strings after delimiters like underscores.

Kode

When dealing with large numbers of files generated programmatically or collected from disparate sources, it’s common to end up with filename duplicates that differ only by appended strings after a delimiter such as an underscore. For example, you may have files like:

[email protected]_Arizona
[email protected]_Arizona_State
[email protected]_Washington

This presents challenges when processing the files further, as scripts may unintentionally overwrite files or skip ones that should be distinct. To tackle this, we can use some handy Linux bash scripting to deduplicate the files by their base name.

The core premise is to:

  1. Iterate through each file
  2. Extract the base name by removing text after the first underscore
  3. Check if a file already exists with that base name
  4. If so, delete the « duplicate » file with longer name
  5. If not, rename current file to the base name

By the end, we condense the files down to:

[email protected]
[email protected]

To implement this in bash, we:

  1. Set up a for loop to process each file in the current directory
  2. Use a conditional to check if the filename contains an underscore
  3. If so, employ bash utilities like cut and parameter expansion to extract the base name
  4. Add back the .txt extension to make cleaned filenames
  5. Check if base name file exists already
  6. Execute rename or delete accordingly

The key bash capabilities that enable this workflow are:

  • File globbing to loop through * all files
  • Cut utility to parse on delimiter
  • Conditional logic with if/then statements
  • String concatenation and parameter expansion
  • Bash pattern matching to simplify wildcards
  • Filesystem commands like mv and rm

Here is what the full script looks like:

#!/bin/bash 

for file in *; do

  if [[ $file == *"_"* ]]; then

    new_name=$(echo "$file" | cut -d'_' -f1)
    new_name="$new_name.txt"

    if [ -f "$new_name" ]; then
      rm "$file" 
    else
      mv "$file" "$new_name"
    fi

  fi

done

In this way, by combining just a few basic bash scripting capabilities you can easily deduplicate file collections to better organize your filesystem. The same approach could be adapted to other filename delimiters or file types as well. Bash makes easy work of tasks like this that would otherwise require tedious manual effort.