Remove Duplicate Files in Linux via Bash Scripting

When dealing with large collections of automated or aggregated files, you can end up with duplicate file names differing only in strings after delimiters like underscores.

Kode

novembre 13, 2023

When dealing with large numbers of files generated programmatically or collected from disparate sources, it’s common to end up with filename duplicates that differ only by appended strings after a delimiter such as an underscore. For example, you may have files like:

[email protected]_Arizona
[email protected]_Arizona_State
[email protected]_Washington

This presents challenges when processing the files further, as scripts may unintentionally overwrite files or skip ones that should be distinct. To tackle this, we can use some handy Linux bash scripting to deduplicate the files by their base name.

The core premise is to:

Iterate through each file
Extract the base name by removing text after the first underscore
Check if a file already exists with that base name
If so, delete the « duplicate » file with longer name
If not, rename current file to the base name

By the end, we condense the files down to:

[email protected]
[email protected]

To implement this in bash, we:

Set up a for loop to process each file in the current directory
Use a conditional to check if the filename contains an underscore
If so, employ bash utilities like cut and parameter expansion to extract the base name
Add back the .txt extension to make cleaned filenames
Check if base name file exists already
Execute rename or delete accordingly

The key bash capabilities that enable this workflow are:

File globbing to loop through * all files
Cut utility to parse on delimiter
Conditional logic with if/then statements
String concatenation and parameter expansion
Bash pattern matching to simplify wildcards
Filesystem commands like mv and rm

Here is what the full script looks like:

#!/bin/bash 

for file in *; do

  if [[ $file == *"_"* ]]; then

    new_name=$(echo "$file" | cut -d'_' -f1)
    new_name="$new_name.txt"

    if [ -f "$new_name" ]; then
      rm "$file" 
    else
      mv "$file" "$new_name"
    fi

  fi

done

In this way, by combining just a few basic bash scripting capabilities you can easily deduplicate file collections to better organize your filesystem. The same approach could be adapted to other filename delimiters or file types as well. Bash makes easy work of tasks like this that would otherwise require tedious manual effort.