Discussion:
program to remove duplicates
(too old to reply)
fir
2024-09-21 18:53:47 UTC
Permalink
i think if to write a simple comandline program
that remove duplicates in a given folder

i mean some should copy a program to given folder
run it and all duplicates and multiplicates (when
duplicate means a file with different name but
exact binary size and byte content) will be removed
leafting only one for multiplicate set

this should work for a big doze of files -
i need it for example i once recovered a hdd disk
and as i got some copies of files on this disc
the removed files are generally multiplicated
and consume a lot of disk space

so is there some approach i need to take to make this
proces faster?

probably i would need to read list of files and sizes in
current directory then sort or go thru the list and if found
exact size read it into ram tnen compare it byte by byte

in not sure if to do sorting as i need write it quick
also and maybe sorting will complicate a bit but not gives much

some thoughts?
fir
2024-09-21 18:56:18 UTC
Permalink
Post by fir
i think if to write a simple comandline program
that remove duplicates in a given folder
i mean some should copy a program to given folder
run it and all duplicates and multiplicates (when
duplicate means a file with different name but
exact binary size and byte content) will be removed
leafting only one for multiplicate set
this should work for a big doze of files -
i need it for example i once recovered a hdd disk
and as i got some copies of files on this disc
the removed files are generally multiplicated
and consume a lot of disk space
so is there some approach i need to take to make this
proces faster?
probably i would need to read list of files and sizes in
current directory then sort or go thru the list and if found
exact size read it into ram tnen compare it byte by byte
in not sure if to do sorting as i need write it quick
also and maybe sorting will complicate a bit but not gives much
some thoughts?
couriously, i could add i once searched for program to remove duplicates
but they was not looking good..so such commandline
(or commandline less in fact as i dont even want toa dd comandline
options maybe) program is quite practically needed
fir
2024-09-21 19:27:08 UTC
Permalink
Post by fir
Post by fir
i think if to write a simple comandline program
that remove duplicates in a given folder
i mean some should copy a program to given folder
run it and all duplicates and multiplicates (when
duplicate means a file with different name but
exact binary size and byte content) will be removed
leafting only one for multiplicate set
this should work for a big doze of files -
i need it for example i once recovered a hdd disk
and as i got some copies of files on this disc
the removed files are generally multiplicated
and consume a lot of disk space
so is there some approach i need to take to make this
proces faster?
probably i would need to read list of files and sizes in
current directory then sort or go thru the list and if found
exact size read it into ram tnen compare it byte by byte
in not sure if to do sorting as i need write it quick
also and maybe sorting will complicate a bit but not gives much
some thoughts?
couriously, i could add i once searched for program to remove duplicates
but they was not looking good..so such commandline
(or commandline less in fact as i dont even want toa dd comandline
options maybe) program is quite practically needed
assuming i got code to read in list of filanemes in given directory
(which i found) what you suggest i should add to remove such duplicates
- the code to read those filenames into l;ist
(tested to work but not tested for being 100% errorless)

#include<windows.h>
#include<stdio.h>

void StrCopyMaxNBytes(char* dest, char* src, int n)
{
for(int i=0; i<n; i++) { dest[i]=src[i]; if(!src[i]) break; }
}

//list of file names
const int FileNameListEntry_name_max = 500;
struct FileNameListEntry { char name[FileNameListEntry_name_max]; };

FileNameListEntry* FileNameList = NULL;
int FileNameList_Size = 0;

void FileNameList_AddOne(char* name)
{
FileNameList_Size++;
FileNameList = (FileNameListEntry*) realloc(FileNameList,
FileNameList_Size * sizeof(FileNameListEntry) );
StrCopyMaxNBytes((char*)&FileNameList[FileNameList_Size-1].name,
name, FileNameListEntry_name_max);
return ;
}


// collect list of filenames
WIN32_FIND_DATA ffd;

void ReadDIrectoryFileNamesToList(char* dir)
{
HANDLE h = FindFirstFile(dir, &ffd);

if(!h) { printf("error reading directory"); exit(-1);}

do {
if (!(ffd.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY))
FileNameList_AddOne(ffd.cFileName);
}
while (FindNextFile(h, &ffd));

}



int main()
{

ReadDIrectoryFileNamesToList("*");

for(int i=0; i< FileNameList_Size; i++)
printf("\n %d %s", i, FileNameList[i].name );

return 'ok';
}
fir
2024-09-21 20:12:04 UTC
Permalink
Post by fir
Post by fir
Post by fir
i think if to write a simple comandline program
that remove duplicates in a given folder
i mean some should copy a program to given folder
run it and all duplicates and multiplicates (when
duplicate means a file with different name but
exact binary size and byte content) will be removed
leafting only one for multiplicate set
this should work for a big doze of files -
i need it for example i once recovered a hdd disk
and as i got some copies of files on this disc
the removed files are generally multiplicated
and consume a lot of disk space
so is there some approach i need to take to make this
proces faster?
probably i would need to read list of files and sizes in
current directory then sort or go thru the list and if found
exact size read it into ram tnen compare it byte by byte
in not sure if to do sorting as i need write it quick
also and maybe sorting will complicate a bit but not gives much
some thoughts?
couriously, i could add i once searched for program to remove duplicates
but they was not looking good..so such commandline
(or commandline less in fact as i dont even want toa dd comandline
options maybe) program is quite practically needed
assuming i got code to read in list of filanemes in given directory
(which i found) what you suggest i should add to remove such duplicates
- the code to read those filenames into l;ist
(tested to work but not tested for being 100% errorless)
#include<windows.h>
#include<stdio.h>
void StrCopyMaxNBytes(char* dest, char* src, int n)
{
for(int i=0; i<n; i++) { dest[i]=src[i]; if(!src[i]) break; }
}
//list of file names
const int FileNameListEntry_name_max = 500;
struct FileNameListEntry { char name[FileNameListEntry_name_max]; };
FileNameListEntry* FileNameList = NULL;
int FileNameList_Size = 0;
void FileNameList_AddOne(char* name)
{
FileNameList_Size++;
FileNameList = (FileNameListEntry*) realloc(FileNameList,
FileNameList_Size * sizeof(FileNameListEntry) );
StrCopyMaxNBytes((char*)&FileNameList[FileNameList_Size-1].name,
name, FileNameListEntry_name_max);
return ;
}
// collect list of filenames
WIN32_FIND_DATA ffd;
void ReadDIrectoryFileNamesToList(char* dir)
{
HANDLE h = FindFirstFile(dir, &ffd);
if(!h) { printf("error reading directory"); exit(-1);}
do {
if (!(ffd.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY))
FileNameList_AddOne(ffd.cFileName);
}
while (FindNextFile(h, &ffd));
}
int main()
{
ReadDIrectoryFileNamesToList("*");
for(int i=0; i< FileNameList_Size; i++)
printf("\n %d %s", i, FileNameList[i].name );
return 'ok';
}
ok i skethed some code only i dont know how to remove given file
(given by filename) to some subfolder ..is such dunction in c?
fir
2024-09-21 21:13:50 UTC
Permalink
ok i wrote this duplicates remover but i dont know if it has no errors etc

heres the code you may comment if you some errors, alternatives or
improvements (note i wrote it among the time i posted on it and that
moment here so its kinda speedy draft (i reused old routines for loading
files etc)

#include<windows.h>
#include<stdio.h>

void StrCopyMaxNBytes(char* dest, char* src, int n)
{
for(int i=0; i<n; i++) { dest[i]=src[i]; if(!src[i]) break; }
}

//list of file names
const int FileNameListEntry_name_max = 500;
struct FileNameListEntry { char name[FileNameListEntry_name_max];
unsigned int file_size; };

FileNameListEntry* FileNameList = NULL;
int FileNameList_Size = 0;

void FileNameList_AddOne(char* name, unsigned int file_size)
{
FileNameList_Size++;
FileNameList = (FileNameListEntry*) realloc(FileNameList,
FileNameList_Size * sizeof(FileNameListEntry) );
StrCopyMaxNBytes((char*)&FileNameList[FileNameList_Size-1].name,
name, FileNameListEntry_name_max);
FileNameList[FileNameList_Size-1].file_size = file_size;
return ;
}


// collect list of filenames
WIN32_FIND_DATA ffd;

void ReadDIrectoryFileNamesToList(char* dir)
{
HANDLE h = FindFirstFile(dir, &ffd);

if(!h) { printf("error reading directory"); exit(-1);}

do {
if (!(ffd.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY))
{
FileNameList_AddOne(ffd.cFileName, ffd.nFileSizeLow);
if(ffd.nFileSizeHigh!=0) { printf("this program only work for
files up to 4GB"); exit(-1);}
}
}
while (FindNextFile(h, &ffd));

}

#include <sys/stat.h>

int GetFileSize2(char *filename)
{
struct stat st;
if (stat(filename, &st)==0) return (int) st.st_size;

printf("error obtaining file size for %s", filename); exit(-1);
return -1;
}

int FolderExist(char *name)
{
static struct stat st;
if(stat(name, &st) == 0 && S_ISDIR(st.st_mode)) return 1;
return 0;
}


//////////

unsigned char* bytes2 = NULL;
int bytes2_size = 0;
int bytes2_allocked = 0;

unsigned char* bytes2_resize(int size)
{
bytes2_size=size;
if((bytes2_size+100)*2<bytes2_allocked | bytes2_size>bytes2_allocked)
return bytes2=(unsigned char*)realloc(bytes2,
(bytes2_allocked=(bytes2_size+100)*2)*sizeof(unsigned char));
}

void bytes2_load(unsigned char* name)
{
int flen = GetFileSize2(name);
FILE *f = fopen(name, "rb");
if(!f) { printf( "errot: cannot open file %s for load ", name);
exit(-1); }
int loaded = fread(bytes2_resize(flen), 1, flen, f);
fclose(f);
}

/////////////////


unsigned char* bytes1 = NULL;
int bytes1_size = 0;
int bytes1_allocked = 0;

unsigned char* bytes1_resize(int size)
{
bytes1_size=size;
if((bytes1_size+100)*2<bytes1_allocked | bytes1_size>bytes1_allocked)
return bytes1=(unsigned char*)realloc(bytes1,
(bytes1_allocked=(bytes1_size+100)*2)*sizeof(unsigned char));
}

void bytes1_load(unsigned char* name)
{
int flen = GetFileSize2(name);
FILE *f = fopen(name, "rb");
if(!f) { printf( "errot: cannot open file %s for load ", name);
exit(-1); }
int loaded = fread(bytes1_resize(flen), 1, flen, f);
fclose(f);
}

/////////////



int CompareTwoFilesByContentsAndSayIfEqual(char* file_a, char* file_b)
{
bytes1_load(file_a);
bytes2_load(file_b);
if(bytes1_size!=bytes2_size) { printf("\n something is wrong
compared files assumed to be be same size"); exit(-1); }

for(unsigned int i=0; i<=bytes1_size;i++)
if(bytes1[i]!=bytes2[i]) return 0;

return 1;

}

#include<direct.h>
#include <dirent.h>
#include <errno.h>

int duplicates_moved = 0;
void MoveDuplicateToSubdirectory(char*name)
{

if(!FolderExist("duplicates"))
{
int n = _mkdir("duplicates");
if(n) { printf ("\n i cannot create subfolder"); exit(-1); }
}

static char renamed[1000];
int n = snprintf(renamed, sizeof(renamed), "duplicates\\%s", name);

if(rename(name, renamed))
{printf("\n rename %s %s failed", name, renamed); exit(-1);}

duplicates_moved++;

}

int main()
{
printf("\n (RE)MOVE FILE DUPLICATES");
printf("\n ");

printf("\n this program searches for binaric (comparec byute to
byte)");
printf("\n duplicates/multiplicates of files in its own");
printf("\n folder (no search in subdirectories, just flat)");
printf("\n and if found it copies it into 'duplicates'");
printf("\n subfolder it creates If you want to remove that");
printf("\n duplicates you may delete the subfolder then,");
printf("\n if you decided to not remove just move the contents");
printf("\n of 'duplicates' subfolder back");
printf("\n ");
printf("\n note this program not work on files larger than 4GB ");
printf("\n and no warranty at all youre responsible for any dameges ");
printf("\n if use of this program would eventually do - i just
wrote ");
printf("\n the code and it work for me but not tested it to much
besides");
printf("\n ");
printf("\n september 2024");

printf("\n ");
printf("\n starting.. ");

ReadDIrectoryFileNamesToList("*");

// for(int i=0; i< FileNameList_Size; i++)
// printf("\n %d %s %d", i, FileNameList[i].name,
FileNameList[i].file_size );


for(int i=0; i< FileNameList_Size; i++)
{
for(int j=i+1; j< FileNameList_Size; j++)
{
if(FileNameList[i].file_size!=FileNameList[j].file_size) continue;
if( CompareTwoFilesByContentsAndSayIfEqual(FileNameList[i].name,
FileNameList[j].name))
{
// printf("\nduplicate found (%s) ", FileNameList[j].name);
MoveDuplicateToSubdirectory(FileNameList[j].name);
}

}

}

printf(" \n\n %d duplicates moved \n\n\n", duplicates_moved);

return 'ok';
}
fir
2024-09-21 22:48:05 UTC
Permalink
okay thet previous code has soem errors but i make changes and this one
seem to work


i put it on a 50 HB files from recuva and it moved about 22 GB as
duplicates... by the eye test it seem to work

#include<windows.h>
#include<stdio.h>

void StrCopyMaxNBytes(char* dest, char* src, int n)
{
for(int i=0; i<n; i++) { dest[i]=src[i]; if(!src[i]) break; }
}

//list of file names
const int FileNameListEntry_name_max = 500;
struct FileNameListEntry { char name[FileNameListEntry_name_max];
unsigned int file_size; int is_duplicate; };

FileNameListEntry* FileNameList = NULL;
int FileNameList_Size = 0;

void FileNameList_AddOne(char* name, unsigned int file_size)
{
FileNameList_Size++;
FileNameList = (FileNameListEntry*) realloc(FileNameList,
FileNameList_Size * sizeof(FileNameListEntry) );
StrCopyMaxNBytes((char*)&FileNameList[FileNameList_Size-1].name,
name, FileNameListEntry_name_max);
FileNameList[FileNameList_Size-1].file_size = file_size;
FileNameList[FileNameList_Size-1].is_duplicate = 0;

return ;
}


// collect list of filenames
WIN32_FIND_DATA ffd;

void ReadDIrectoryFileNamesToList(char* dir)
{
HANDLE h = FindFirstFile(dir, &ffd);

if(!h) { printf("error reading directory"); exit(-1);}

do {
if (!(ffd.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY))
{
FileNameList_AddOne(ffd.cFileName, ffd.nFileSizeLow);
if(ffd.nFileSizeHigh!=0) { printf("this program only work for
files up to 4GB"); exit(-1);}
}
}
while (FindNextFile(h, &ffd));

}

#include <sys/stat.h>

int GetFileSize2(char *filename)
{
struct stat st;
if (stat(filename, &st)==0) return (int) st.st_size;

printf("\n *** error obtaining file size for %s", filename); exit(-1);
return -1;
}

int FolderExist(char *name)
{
static struct stat st;
if(stat(name, &st) == 0 && S_ISDIR(st.st_mode)) return 1;
return 0;
}


//////////

unsigned char* bytes2 = NULL;
int bytes2_size = 0;
int bytes2_allocked = 0;

unsigned char* bytes2_resize(int size)
{
bytes2_size=size;
return bytes2=(unsigned char*)realloc(bytes2,
bytes2_size*sizeof(unsigned char));

}

void bytes2_load(unsigned char* name)
{
int flen = GetFileSize2(name);
FILE *f = fopen(name, "rb");
if(!f) { printf( "errot: cannot open file %s for load ", name);
exit(-1); }
int loaded = fread(bytes2_resize(flen), 1, flen, f);
fclose(f);
}

/////////////////


unsigned char* bytes1 = NULL;
int bytes1_size = 0;
int bytes1_allocked = 0;

unsigned char* bytes1_resize(int size)
{
bytes1_size=size;
return bytes1=(unsigned char*)realloc(bytes1,
bytes1_size*sizeof(unsigned char));

}

void bytes1_load(unsigned char* name)
{
int flen = GetFileSize2(name);
FILE *f = fopen(name, "rb");
if(!f) { printf( "errot: cannot open file %s for load ", name);
exit(-1); }
int loaded = fread(bytes1_resize(flen), 1, flen, f);
fclose(f);
}

/////////////



int CompareTwoFilesByContentsAndSayIfEqual(char* file_a, char* file_b)
{
bytes1_load(file_a);
bytes2_load(file_b);
if(bytes1_size!=bytes2_size) { printf("\n something is wrong
compared files assumed to be be same size"); exit(-1); }

for(unsigned int i=0; i<bytes1_size;i++)
if(bytes1[i]!=bytes2[i]) return 0;

return 1;

}

#include<direct.h>
#include <dirent.h>
#include <errno.h>

int duplicates_moved = 0;
void MoveDuplicateToSubdirectory(char*name)
{

if(!FolderExist("duplicates"))
{
int n = _mkdir("duplicates");
if(n) { printf ("\n i cannot create subfolder"); exit(-1); }
}

static char renamed[1000];
int n = snprintf(renamed, sizeof(renamed), "duplicates\\%s", name);

if(rename(name, renamed))
{printf("\n rename %s %s failed", name, renamed); exit(-1);}

duplicates_moved++;

}

int main()
{
printf("\n (RE)MOVE FILE DUPLICATES");
printf("\n ");

printf("\n this program searches for binaric (comparec byute to
byte)");
printf("\n duplicates/multiplicates of files in its own");
printf("\n folder (no search in subdirectories, just flat)");
printf("\n and if found it copies it into 'duplicates'");
printf("\n subfolder it creates If you want to remove that");
printf("\n duplicates you may delete the subfolder then,");
printf("\n if you decided to not remove just move the contents");
printf("\n of 'duplicates' subfolder back");
printf("\n ");
printf("\n note this program not work on files larger than 4GB ");
printf("\n and no warranty at all youre responsible for any dameges ");
printf("\n if use of this program would eventually do - i just
wrote ");
printf("\n the code and it work for me but not tested it to much
besides");
printf("\n ");
printf("\n september 2024");

printf("\n ");
printf("\n starting.. ");

ReadDIrectoryFileNamesToList("*");


printf("\n\n found %d files in current directory", FileNameList_Size);
for(int i=0; i< FileNameList_Size; i++)
printf("\n #%d %s %d", i, FileNameList[i].name,
FileNameList[i].file_size );

// return 'ok';

for(int i=0; i< FileNameList_Size; i++)
{
if(FileNameList[i].is_duplicate) continue;

for(int j=i+1; j< FileNameList_Size; j++)
{
if(FileNameList[j].is_duplicate) continue;

if(FileNameList[i].file_size!=FileNameList[j].file_size) continue;

if( CompareTwoFilesByContentsAndSayIfEqual(FileNameList[i].name,
FileNameList[j].name))
{
printf("\n#%d %s (%d) has duplicate #%d %s (%d) ",i,
FileNameList[i].name,FileNameList[i].file_size, j, FileNameList[j].name,
FileNameList[j].file_size);
FileNameList[j].is_duplicate=1;
// MoveDuplicateToSubdirectory(FileNameList[i].name);
}

}

}

printf("\n moving duplicates to subfolder...");

for(int i=0; i< FileNameList_Size; i++)
{
if(FileNameList[i].is_duplicate)
MoveDuplicateToSubdirectory(FileNameList[i].name);
}

printf(" \n\n %d duplicates moved \n\n\n", duplicates_moved);

return 'ok';
}
Chris M. Thomasson
2024-09-21 21:54:58 UTC
Permalink
Post by fir
i think if to write a simple comandline program
that remove duplicates in a given folder
[...]

Not sure if this will help you or not... ;^o

Fwiw, I have to sort and remove duplicates in this experimental locking
system that I called the multex. Here is the C++ code I used to do it. I
sort and then remove any duplicates, so say a threads local lock set was:

31, 59, 69, 31, 4, 1, 1, 5

would become:

1, 4, 5, 31, 59, 69

this ensures no deadlocks. As for the algorithm for removing duplicates,
well, there are more than one. Actually, I don't know what one my C++
impl is using right now.

https://groups.google.com/g/comp.lang.c++/c/sV4WC_cBb9Q/m/Ti8LFyH4CgAJ

// Deadlock free baby!
void ensure_locking_order()
{
// sort and remove duplicates

std::sort(m_lock_idxs.begin(), m_lock_idxs.end());

m_lock_idxs.erase(std::unique(m_lock_idxs.begin(),
m_lock_idxs.end()), m_lock_idxs.end());
}

Using the std C++ template lib.
fir
2024-09-21 22:18:09 UTC
Permalink
Post by Chris M. Thomasson
Post by fir
i think if to write a simple comandline program
that remove duplicates in a given folder
[...]
Not sure if this will help you or not... ;^o
Fwiw, I have to sort and remove duplicates in this experimental locking
system that I called the multex. Here is the C++ code I used to do it. I
31, 59, 69, 31, 4, 1, 1, 5
1, 4, 5, 31, 59, 69
this ensures no deadlocks. As for the algorithm for removing duplicates,
well, there are more than one. Actually, I don't know what one my C++
impl is using right now.
https://groups.google.com/g/comp.lang.c++/c/sV4WC_cBb9Q/m/Ti8LFyH4CgAJ
// Deadlock free baby!
void ensure_locking_order()
{
// sort and remove duplicates
std::sort(m_lock_idxs.begin(), m_lock_idxs.end());
m_lock_idxs.erase(std::unique(m_lock_idxs.begin(),
m_lock_idxs.end()), m_lock_idxs.end());
}
Using the std C++ template lib.
im not sure what you talking about but i write on finding file
duplicates (by binary contents not by name).. it is disk thing and i
dont think mutexes are needed - you just need to read all files in
folder and compare it byte by byte to other files in folder of the same size
Chris M. Thomasson
2024-09-21 23:46:09 UTC
Permalink
Post by fir
Post by Chris M. Thomasson
Post by fir
i think if to write a simple comandline program
that remove duplicates in a given folder
[...]
Not sure if this will help you or not... ;^o
Fwiw, I have to sort and remove duplicates in this experimental locking
system that I called the multex. Here is the C++ code I used to do it. I
31, 59, 69, 31, 4, 1, 1, 5
1, 4, 5, 31, 59, 69
this ensures no deadlocks. As for the algorithm for removing duplicates,
well, there are more than one. Actually, I don't know what one my C++
impl is using right now.
https://groups.google.com/g/comp.lang.c++/c/sV4WC_cBb9Q/m/Ti8LFyH4CgAJ
// Deadlock free baby!
void ensure_locking_order()
{
   // sort and remove duplicates
   std::sort(m_lock_idxs.begin(), m_lock_idxs.end());
   m_lock_idxs.erase(std::unique(m_lock_idxs.begin(),
     m_lock_idxs.end()), m_lock_idxs.end());
}
Using the std C++ template lib.
im not sure what you talking about but i write on finding file
duplicates (by binary contents not by name).. it is disk thing and i
dont think mutexes are needed - you just need to read all files in
folder and compare it byte by byte to other files in folder of the same size
It's just that there are many different ways to sort and remove
duplicates. That sometimes, it is required...
Lawrence D'Oliveiro
2024-09-22 02:06:49 UTC
Permalink
... you just need to read all files in
folder and compare it byte by byte to other files in folder of the same size
For N files, that requires N × (N - 1) ÷ 2 byte-by-byte comparisons.
That’s an O(N²) algorithm.

There is a faster way.
fir
2024-09-22 02:36:03 UTC
Permalink
Post by Lawrence D'Oliveiro
... you just need to read all files in
folder and compare it byte by byte to other files in folder of the same size
For N files, that requires N × (N - 1) ÷ 2 byte-by-byte comparisons.
That’s an O(N²) algorithm.
There is a faster way.
not quite as most files have different sizes so most binary comparsions
are discarded becouse size of files differ (and those sizes i read
linearly when bulding lidt of filenames)

what i posted seem to work ok, it odesnt work fast but hard to say if it
can be optimised or it takes as long as it should..hard to say
Chris M. Thomasson
2024-09-22 04:18:34 UTC
Permalink
Post by fir
Post by Lawrence D'Oliveiro
... you just need to read all files in
folder and compare it byte by byte to other files in folder of the same size
For N files, that requires N × (N - 1) ÷ 2 byte-by-byte comparisons.
That’s an O(N²) algorithm.
There is a faster way.
not quite as most files have different sizes so most binary comparsions
are discarded becouse size of files differ (and those sizes i read
linearly when bulding lidt of filenames)
what i posted seem to work ok, it odesnt work fast but hard to say if it
can be optimised or it takes as long as it should..hard to say
Are you seeming to suggest that two files in the same directory have the
exact same file names? hidden files, aside for a moment...
Lawrence D'Oliveiro
2024-09-22 07:09:45 UTC
Permalink
Post by Lawrence D'Oliveiro
There is a faster way.
not quite ...
Yes there is. See how those other programs do it.
Paul
2024-09-22 07:29:08 UTC
Permalink
Post by fir
Post by Lawrence D'Oliveiro
... you just need to read all files in
folder and compare it byte by byte to other files in folder of the same size
For N files, that requires N × (N - 1) ÷ 2 byte-by-byte comparisons.
That’s an O(N²) algorithm.
There is a faster way.
not quite as most files have different sizes so most binary comparsions
are discarded becouse size of files differ (and those sizes i read linearly when bulding lidt of filenames)
what i posted seem to work ok, it odesnt work fast but hard to say if it can be optimised or it takes as long as it should..hard to say
The normal way to do this, is do a hash check on the
files and compare the hash. You can use MD5SUM, SHA1SUM, SHA256SUM,
as a means to compare two files. If you want to be picky about
it, stick with SHA256SUM.

hashdeep64 -c MD5 -j 1 -r H: > H_sums.txt # Took about two minutes to run this on an SSD
# Hard drive, use -j 1 . For an SSD, use a higher thread count for -j .

Size MD5SUM Path

Same size, same hash value. The size is zero. The MD5SUM in this case, is always the same (the initialization value of MD5).

0, d41d8cd98f00b204e9800998ecf8427e, H:\Users\Bullwinkle\AppData\Local\.IdentityService\AadConfigurations\AadConfiguration.lock
0, d41d8cd98f00b204e9800998ecf8427e, H:\Users\Bullwinkle\AppData\Local\.IdentityService\V2AccountStore.lock

Same size, different hash value. These are not the same file.

65536, a8113cfdf0227ddf1c25367ecccc894b, H:\Users\Bullwinkle\AppData\Local\AMD\DxCache\5213954f4433d4fbe45ed37ffc67d43fc43b54584bfd3a8d.bin
65536, 5e91acf90e90be408b6549e11865009d, H:\Users\Bullwinkle\AppData\Local\AMD\DxCache\bf7b3ea78a361dc533a9344051255c035491d960f2bc7f31.bin

You can use the "sort" command, to sort by the first and second fields if you want.
Sorting the output lines, places the identical files next to one another, in the output.

The output of data recovery software is full of "fragments". Using
the "file" command (Windows port available, it's a Linux command),
can allow ignoring files which have no value (listed as "Data").
Recognizable files will be listed as "PNG" or "JPG" and so on.

A utility such as Photorec, can attempt to glue together files. Your mileage may vary.
That is a scan based file recovery method. I have not used it.

https://en.wikipedia.org/wiki/PhotoRec

Paul
fir
2024-09-22 10:24:06 UTC
Permalink
Post by Paul
Post by fir
Post by Lawrence D'Oliveiro
... you just need to read all files in
folder and compare it byte by byte to other files in folder of the same size
For N files, that requires N × (N - 1) ÷ 2 byte-by-byte comparisons.
That’s an O(N²) algorithm.
There is a faster way.
not quite as most files have different sizes so most binary comparsions
are discarded becouse size of files differ (and those sizes i read linearly when bulding lidt of filenames)
what i posted seem to work ok, it odesnt work fast but hard to say if it can be optimised or it takes as long as it should..hard to say
The normal way to do this, is do a hash check on the
files and compare the hash. You can use MD5SUM, SHA1SUM, SHA256SUM,
as a means to compare two files. If you want to be picky about
it, stick with SHA256SUM.
hashdeep64 -c MD5 -j 1 -r H: > H_sums.txt # Took about two minutes to run this on an SSD
# Hard drive, use -j 1 . For an SSD, use a higher thread count for -j .
Size MD5SUM Path
Same size, same hash value. The size is zero. The MD5SUM in this case, is always the same (the initialization value of MD5).
0, d41d8cd98f00b204e9800998ecf8427e, H:\Users\Bullwinkle\AppData\Local\.IdentityService\AadConfigurations\AadConfiguration.lock
0, d41d8cd98f00b204e9800998ecf8427e, H:\Users\Bullwinkle\AppData\Local\.IdentityService\V2AccountStore.lock
Same size, different hash value. These are not the same file.
65536, a8113cfdf0227ddf1c25367ecccc894b, H:\Users\Bullwinkle\AppData\Local\AMD\DxCache\5213954f4433d4fbe45ed37ffc67d43fc43b54584bfd3a8d.bin
65536, 5e91acf90e90be408b6549e11865009d, H:\Users\Bullwinkle\AppData\Local\AMD\DxCache\bf7b3ea78a361dc533a9344051255c035491d960f2bc7f31.bin
You can use the "sort" command, to sort by the first and second fields if you want.
Sorting the output lines, places the identical files next to one another, in the output.
The output of data recovery software is full of "fragments". Using
the "file" command (Windows port available, it's a Linux command),
can allow ignoring files which have no value (listed as "Data").
Recognizable files will be listed as "PNG" or "JPG" and so on.
A utility such as Photorec, can attempt to glue together files. Your mileage may vary.
That is a scan based file recovery method. I have not used it.
https://en.wikipedia.org/wiki/PhotoRec
Paul
i do not do recovery - it removes duplicates

i mean programs such as recuva when recovers files recoves a tens of
thousands and gigabytes of files with lost names and soem common types
.mp3 .jpg .txt and so on, and many of those files are binary duplicates

this code i posted last just finds files that are duplicates and moves
then to subdirectory 'duplicates' and it could show that half of those
files or more (heavy gigabytes) are pure duplicates so some may remove
the subfolder and recover space

the code i posted work ok, and if someone has windows and mingw/tdm may
compiel it and check the application if wants

hashing is not necessary imo though probably could speed things up - im
not strongly convinced that the probablility of misteke in this hashing
is strictly zero (as i dont ever used this and would need to produce my
own hashing probably).. probably its mathematically proven ists almost
zero but as for now at least it is more interesting for me if the cde i
posted is ok

yopu may see the main procedure of it

first it build list of files with sizes
using windows winapi function

HANDLE h = FindFirstFile(dir, &ffd);

(ils linear say 12k times for 12 k files in folder)

then it runs square loop (12k * 12k /2 - 12k)

and compares binarly those who have same size


int GetFileSize2(char *filename)
{
struct stat st;
if (stat(filename, &st)==0) return (int) st.st_size;

printf("\n *** error obtaining file size for %s", filename); exit(-1);
return -1;
}

void bytes1_load(unsigned char* name)
{
int flen = GetFileSize2(name);
FILE *f = fopen(name, "rb");
if(!f) { printf( "errot: cannot open file %s for load ", name);
exit(-1); }
int loaded = fread(bytes1_resize(flen), 1, flen, f);
fclose(f);
}


int CompareTwoFilesByContentsAndSayIfEqual(char* file_a, char* file_b)
{
bytes1_load(file_a);
bytes2_load(file_b);
if(bytes1_size!=bytes2_size) { printf("\n something is wrong
compared files assumed to be be same size"); exit(-1); }

for(unsigned int i=0; i<bytes1_size;i++)
if(bytes1[i]!=bytes2[i]) return 0;

return 1;

}


this has 2 elements its file load into ram and then comparsions

(the reading files is redundant as i got the info from
FindFirstFile(dir, &ffd); winapi functions, but maybe to be sure i
read it also form thsi stat() function again

and then finally i got a linear pary to move that ones on the list
marked as duplicates to subfolder

int FolderExist(char *name)
{
static struct stat st;
if(stat(name, &st) == 0 && S_ISDIR(st.st_mode)) return 1;
return 0;
}


int duplicates_moved = 0;
void MoveDuplicateToSubdirectory(char*name)
{

if(!FolderExist("duplicates"))
{
int n = _mkdir("duplicates");
if(n) { printf ("\n i cannot create subfolder"); exit(-1); }
}

static char renamed[1000];
int n = snprintf(renamed, sizeof(renamed), "duplicates\\%s", name);

if(rename(name, renamed))
{printf("\n rename %s %s failed", name, renamed); exit(-1);}

duplicates_moved++;

}


im not sure if some of tis functions are not slow and there is an
element of redundancy calling if(!FolderExist("duplicates"))
many times as it would be normal "ram based" not disk related function
- but its probably okay i guess (and thsi disk related function i hope
not really activates disk but hopefully only read some ram about it)
Bart
2024-09-22 10:38:17 UTC
Permalink
Post by fir
Post by Paul
The normal way to do this, is do a hash check on the
files and compare the hash. You can use MD5SUM, SHA1SUM, SHA256SUM,
as a means to compare two files. If you want to be picky about
it, stick with SHA256SUM.
the code i posted work ok, and if someone has windows and mingw/tdm may
compiel it and check the application if wants
hashing is not necessary imo though probably could speed things up - im
not strongly convinced that the probablility of misteke in this hashing
is strictly zero (as i dont ever used this and would need to produce my
own hashing probably).. probably its mathematically proven ists almost
zero but as for now at least it is more interesting for me if the cde i
posted is ok
I was going to post similar ideas (doing a linear pass working out
checksums for each file, sorting the list by checksum and size, then
candidates for a byte-by-byte comparison, if you want to do that, will
be grouped together).

But if you're going to reject everyone's suggestions in favour of your
own already working solution, then I wonder why you bothered posting.

(I didn't post after all because I knew it would be futile.)
fir
2024-09-22 12:46:12 UTC
Permalink
Post by Bart
Post by fir
Post by Paul
The normal way to do this, is do a hash check on the
files and compare the hash. You can use MD5SUM, SHA1SUM, SHA256SUM,
as a means to compare two files. If you want to be picky about
it, stick with SHA256SUM.
the code i posted work ok, and if someone has windows and mingw/tdm
may compiel it and check the application if wants
hashing is not necessary imo though probably could speed things up -
im not strongly convinced that the probablility of misteke in this
hashing is strictly zero (as i dont ever used this and would need to
produce my own hashing probably).. probably its mathematically proven
ists almost zero but as for now at least it is more interesting for me
if the cde i posted is ok
I was going to post similar ideas (doing a linear pass working out
checksums for each file, sorting the list by checksum and size, then
candidates for a byte-by-byte comparison, if you want to do that, will
be grouped together).
But if you're going to reject everyone's suggestions in favour of your
own already working solution, then I wonder why you bothered posting.
(I didn't post after all because I knew it would be futile.)
i wanta discus nt to do enything that is mentioned .. it is hard to
understand? so i may read on options but literally got no time to
implement even good idead - thsi program i wrote showed to work and im
now using it
fir
2024-09-22 12:48:11 UTC
Permalink
Post by fir
Post by Bart
Post by fir
Post by Paul
The normal way to do this, is do a hash check on the
files and compare the hash. You can use MD5SUM, SHA1SUM, SHA256SUM,
as a means to compare two files. If you want to be picky about
it, stick with SHA256SUM.
the code i posted work ok, and if someone has windows and mingw/tdm
may compiel it and check the application if wants
hashing is not necessary imo though probably could speed things up -
im not strongly convinced that the probablility of misteke in this
hashing is strictly zero (as i dont ever used this and would need to
produce my own hashing probably).. probably its mathematically proven
ists almost zero but as for now at least it is more interesting for me
if the cde i posted is ok
I was going to post similar ideas (doing a linear pass working out
checksums for each file, sorting the list by checksum and size, then
candidates for a byte-by-byte comparison, if you want to do that, will
be grouped together).
But if you're going to reject everyone's suggestions in favour of your
own already working solution, then I wonder why you bothered posting.
(I didn't post after all because I knew it would be futile.)
i wanta discus nt to do enything that is mentioned .. it is hard to
understand? so i may read on options but literally got no time to
implement even good idead - thsi program i wrote showed to work and im
now using it
also note i posted whole working program and some other just say what
can be done... in working code was my main goal not quite starting in
contest of what is fastest (this is also interesting topic but not the
main goal)
fir
2024-09-22 14:06:39 UTC
Permalink
Post by fir
Post by fir
Post by Bart
Post by fir
Post by Paul
The normal way to do this, is do a hash check on the
files and compare the hash. You can use MD5SUM, SHA1SUM, SHA256SUM,
as a means to compare two files. If you want to be picky about
it, stick with SHA256SUM.
the code i posted work ok, and if someone has windows and mingw/tdm
may compiel it and check the application if wants
hashing is not necessary imo though probably could speed things up -
im not strongly convinced that the probablility of misteke in this
hashing is strictly zero (as i dont ever used this and would need to
produce my own hashing probably).. probably its mathematically proven
ists almost zero but as for now at least it is more interesting for me
if the cde i posted is ok
I was going to post similar ideas (doing a linear pass working out
checksums for each file, sorting the list by checksum and size, then
candidates for a byte-by-byte comparison, if you want to do that, will
be grouped together).
But if you're going to reject everyone's suggestions in favour of your
own already working solution, then I wonder why you bothered posting.
(I didn't post after all because I knew it would be futile.)
i wanta discus nt to do enything that is mentioned .. it is hard to
understand? so i may read on options but literally got no time to
implement even good idead - thsi program i wrote showed to work and im
now using it
also note i posted whole working program and some other just say what
can be done... in working code was my main goal not quite starting in
contest of what is fastest (this is also interesting topic but not the
main goal)
interesting thing is yet how it work in system ...
im used to write cpu intensive application and used to controll frame
times and usage of cpu.. bet generally never t vrite disk based apps

here above is the first..i use sysinternals on windows ind when i run
this prog the it has like 3 stages
1) read directory info (if ts big like 30k files it mat take soem time)
2) square part that read file contents and compares and sets flags of
duplicates on list
3) the rename part - i mean i cal "reneme" function on duplicates

the most tiem it takes the square part and indicator of disk usage is
full , cpu usage is 50% it means probably one core usage is full

the disk indicator in tray shows (in square phase) something like
R: 1.6 GB
O: 635 KB
W: 198 B

dont know what it is, R is for read for sure and W is for write but what
it is exactly?


there is also a question if closing or killing program in those phases
may generate some disc dameges? - as ror most time in square phase it
takes reads i quite sure that closing in read phase may not incur anny
errors - but im not sure as to renaming phase
fir
2024-09-22 14:22:00 UTC
Permalink
Post by Bart
Post by fir
Post by Paul
The normal way to do this, is do a hash check on the
files and compare the hash. You can use MD5SUM, SHA1SUM, SHA256SUM,
as a means to compare two files. If you want to be picky about
it, stick with SHA256SUM.
the code i posted work ok, and if someone has windows and mingw/tdm
may compiel it and check the application if wants
hashing is not necessary imo though probably could speed things up -
im not strongly convinced that the probablility of misteke in this
hashing is strictly zero (as i dont ever used this and would need to
produce my own hashing probably).. probably its mathematically proven
ists almost zero but as for now at least it is more interesting for me
if the cde i posted is ok
I was going to post similar ideas (doing a linear pass working out
checksums for each file, sorting the list by checksum and size, then
candidates for a byte-by-byte comparison, if you want to do that, will
be grouped together).
But if you're going to reject everyone's suggestions in favour of your
own already working solution, then I wonder why you bothered posting.
(I didn't post after all because I knew it would be futile.)
yet to say about this efficiency

whan i observe how it work - this program is square in a sense it has
half square loop over the directory files list, so it may be lik
20x*20k/2-20k comparcions but it only compares mostly sizes so this
kind of being square im not sure how serious is ..200M int comparsions
is a problem? - mayeb it become to be for larger sets

in the meaning of real binary comparsions is not fully square but
its liek sets of smaller squares on diagonal of this large square
if yu (some) know what i mean... and that may be a problem as
if in that 20k files 100 have same size then it makes about 100x100 full
loads and 100x100 full binary copmpares byte to byte which
is practically full if there are indeed 100 duplicates
(maybe its less than 100x100 as at first finding of duplicate i mark it
as dumpicate and ship it in loop then

but indeed it shows practically that in case of folders bigger than 3k
files it slows down probably unproportionally so the optimisation is
in hand /needed for large folders

thats from the observation on it
from disk and
fir
2024-09-22 14:26:49 UTC
Permalink
Post by fir
Post by Bart
Post by fir
Post by Paul
The normal way to do this, is do a hash check on the
files and compare the hash. You can use MD5SUM, SHA1SUM, SHA256SUM,
as a means to compare two files. If you want to be picky about
it, stick with SHA256SUM.
the code i posted work ok, and if someone has windows and mingw/tdm
may compiel it and check the application if wants
hashing is not necessary imo though probably could speed things up -
im not strongly convinced that the probablility of misteke in this
hashing is strictly zero (as i dont ever used this and would need to
produce my own hashing probably).. probably its mathematically proven
ists almost zero but as for now at least it is more interesting for me
if the cde i posted is ok
I was going to post similar ideas (doing a linear pass working out
checksums for each file, sorting the list by checksum and size, then
candidates for a byte-by-byte comparison, if you want to do that, will
be grouped together).
But if you're going to reject everyone's suggestions in favour of your
own already working solution, then I wonder why you bothered posting.
(I didn't post after all because I knew it would be futile.)
yet to say about this efficiency
whan i observe how it work - this program is square in a sense it has
half square loop over the directory files list, so it may be lik
20x*20k/2-20k comparcions but it only compares mostly sizes so this
kind of being square im not sure how serious is ..200M int comparsions
is a problem? - mayeb it become to be for larger sets
in the meaning of real binary comparsions is not fully square but
its liek sets of smaller squares on diagonal of this large square
if yu (some) know what i mean... and that may be a problem as
if in that 20k files 100 have same size then it makes about 100x100 full
loads and 100x100 full binary copmpares byte to byte which
is practically full if there are indeed 100 duplicates
(maybe its less than 100x100 as at first finding of duplicate i mark it
as dumpicate and ship it in loop then
but indeed it shows practically that in case of folders bigger than 3k
files it slows down probably unproportionally so the optimisation is
in hand /needed for large folders
thats from the observation on it
but as i said i mainly wanted this to be done to remove soem space of
this recovered somewhat junk files.. and having it the partially square
way is more important than having it optimised

it works and if i see it slows down on large folders i can divide those
big folders on few for 3k files and run this duplicate mover in each one

more hand work but can be done by hand
fir
2024-09-22 14:32:05 UTC
Permalink
Post by fir
Post by fir
Post by Bart
Post by fir
Post by Paul
The normal way to do this, is do a hash check on the
files and compare the hash. You can use MD5SUM, SHA1SUM, SHA256SUM,
as a means to compare two files. If you want to be picky about
it, stick with SHA256SUM.
the code i posted work ok, and if someone has windows and mingw/tdm
may compiel it and check the application if wants
hashing is not necessary imo though probably could speed things up -
im not strongly convinced that the probablility of misteke in this
hashing is strictly zero (as i dont ever used this and would need to
produce my own hashing probably).. probably its mathematically proven
ists almost zero but as for now at least it is more interesting for me
if the cde i posted is ok
I was going to post similar ideas (doing a linear pass working out
checksums for each file, sorting the list by checksum and size, then
candidates for a byte-by-byte comparison, if you want to do that, will
be grouped together).
But if you're going to reject everyone's suggestions in favour of your
own already working solution, then I wonder why you bothered posting.
(I didn't post after all because I knew it would be futile.)
yet to say about this efficiency
whan i observe how it work - this program is square in a sense it has
half square loop over the directory files list, so it may be lik
20x*20k/2-20k comparcions but it only compares mostly sizes so this
kind of being square im not sure how serious is ..200M int comparsions
is a problem? - mayeb it become to be for larger sets
in the meaning of real binary comparsions is not fully square but
its liek sets of smaller squares on diagonal of this large square
if yu (some) know what i mean... and that may be a problem as
if in that 20k files 100 have same size then it makes about 100x100 full
loads and 100x100 full binary copmpares byte to byte which
is practically full if there are indeed 100 duplicates
(maybe its less than 100x100 as at first finding of duplicate i mark it
as dumpicate and ship it in loop then
but indeed it shows practically that in case of folders bigger than 3k
files it slows down probably unproportionally so the optimisation is
in hand /needed for large folders
thats from the observation on it
but as i said i mainly wanted this to be done to remove soem space of
this recovered somewhat junk files.. and having it the partially square
way is more important than having it optimised
it works and if i see it slows down on large folders i can divide those
big folders on few for 3k files and run this duplicate mover in each one
more hand work but can be done by hand
hovever saying that the checksuming/hashing idea is kinda good ofc
(sorting oprobably the less as maybe a bit harder to write, as im never
sure if my old quicksirt hand code has no error i once tested like 30
quicksort versions in mya life trying to rewrite it and once i get some
mistake in thsi code and later never strictly sure if the version i
finally get is good - its probably good but im not sure)

but i would need to understand that may own way of hashing has
practically no chances to generate same hash on different files..
and i never was doing that things so i not rethinked it..and now its a
side thing possibly not worth studying
fir
2024-09-22 14:51:24 UTC
Permalink
fir wrote:

this program has yet one pleasnt thing as it works

#5 f1795800624.bmp (589878) has duplicate #216 f1840569816.bmp (589878)
#6 f1795801784.bmp (589878) has duplicate #217 f1840570976.bmp (589878)
#7 f1795802944.bmp (589878) has duplicate #218 f1840572136.bmp (589878)
#8 f1795804112.bmp (589878) has duplicate #219 f1840573296.bmp (589878)
#9 f1795805272.bmp (589878) has duplicate #220 f1840574456.bmp (589878)

and those numbers on left goes form #1 to say #3000 (last)
as it marks duplicates "forward" i man if #8 has duplicate #218 then
#218 is marked as duplicate and excluded in both loops (the outside/row
one and this inside, element one) so the scaning speeds up the more
further it goes and its not linear like when coping files

its nicely pleasant
Chris M. Thomasson
2024-09-22 18:47:21 UTC
Permalink
Post by Paul
Post by fir
Post by Lawrence D'Oliveiro
... you just need to read all files in
folder and compare it byte by byte to other files in folder of the same size
For N files, that requires N × (N - 1) ÷ 2 byte-by-byte comparisons.
That’s an O(N²) algorithm.
There is a faster way.
not quite as most files have different sizes so most binary comparsions
are discarded becouse size of files differ (and those sizes i read linearly when bulding lidt of filenames)
what i posted seem to work ok, it odesnt work fast but hard to say if it can be optimised or it takes as long as it should..hard to say
The normal way to do this, is do a hash check on the
files and compare the hash. You can use MD5SUM, SHA1SUM, SHA256SUM,
as a means to compare two files. If you want to be picky about
it, stick with SHA256SUM.
[...]

That's fine.

file_0.bin
file_1.png
file_2.jpg

Say they all were identical wrt their actual bytes. The hash for them
would all be the same. As long as they did not hash the file name in
there for some reason... ;^)
DFS
2024-09-22 21:11:02 UTC
Permalink
Post by Lawrence D'Oliveiro
... you just need to read all files in
folder and compare it byte by byte to other files in folder of the same size
For N files, that requires N × (N - 1) ÷ 2 byte-by-byte comparisons.
That’s an O(N²) algorithm.
for (i = 0; i < N; i++) {
for (j = i+1; j < N; j++) {
... byte-byte compare file i to file j
}
}


For N = 10, 45 byte-byte comparisons would be made (assuming all files
are the same size)
Post by Lawrence D'Oliveiro
There is a faster way.
Calc the checksum of each file once, then compare the checksums as above?

Which is still an O(N^2) algorithm, but I would assume it's faster than
45 byte-byte comparisons.
Lawrence D'Oliveiro
2024-09-22 01:28:09 UTC
Permalink
Post by fir
i think if to write a simple comandline program
that remove duplicates in a given folder
<https://packages.debian.org/bookworm/duff>
<https://packages.debian.org/bookworm/dupeguru>
<https://packages.debian.org/trixie/backdown>
Josef Möllers
2024-10-01 14:34:47 UTC
Permalink
Post by fir
i think if to write a simple comandline program
that remove duplicates in a given folder
[...]

I have had the same problem. My solution was to use extended file
attributes and some file checksum, eg sha512sum, also, I wrote this in
PERL (see code below). Using the file attributes, I can re-run the
program after a while without having to re-calculate the checksums.
So, this solution only works for filesystems that have extended file
attributes, but you could also use some simple database (sqlite3?) to
map checksums to pathnames.

What I did was to walk through the directory tree and check if the file
being considered already has a checksum in an extended attribute. If
not. I'll calculate the checksum and store that in the extended
attribute. Also, I store the pathname in a hash (remember, this is
PERL), key is the checksum.
If there is a collision (checksum already in the hash), I remove the new
file (and link the new filename to the old file). One could be paranoid
and do a byte-by-byte file comparison then.

If I needed to do this in a C program, I'd probably use a GList to store
the hash, but otherwise the code logic would be the same.

HTH,

Josef

#! /usr/bin/perl

use warnings;
use strict;
use File::ExtAttr ':all'; # In case of problems, maybe insert "use
Scalar:Utils;" in /usr/lib/x86_64-linux-gnu/perl5/5.22/File/ExtAttr.pm
use Digest::SHA;
use File::Find;
use Getopt::Std;

# OPTIONS:
# s: force symlink
# n: don't do the actula removing/linking
# v: be more verbose
# h: print short help
my %opt = (
s => undef,
n => undef,
v => undef,
h => undef,
);
getopts('hnsv', \%opt);

if ($opt{h}) {
print STDERR "usage: lndup [-snvh] [dirname..]\n";
print STDERR "\t-s: use symlink rather than hard link\n";
print STDERR "\t-n: don't remove/link, just show what would be done\n";
print STDERR "\t-v: be more verbose (show pathname and SHA512 sum\n";
print STDERR "\t-h: show this text\n";
exit(0);
}

my %file;

if (@ARGV == 0) {
find({ wanted => \&lndup, no_chdir => 1 }, '.');
} else {
find({ wanted => \&lndup, no_chdir => 1 }, @ARGV);
}

# NAME: lndup
# PURPOSE: To handle a single file
# ARGUMENTS: None, pathname is taken from $File::Find::name
# RETURNS: Nothing
# NOTE: The SHA512 sum of a file is calculated.
# IF a file with the same sum was already found earlier, AND
# iF both files are NOT the same (hard link) AND
# iF both files reside on the same disk
# THEN the second occurrence is removed and
# replaced by a link to the first occurrence
sub lndup {
my $pathname = $File::Find::name;

return if ! -f $pathname;
if (-s $pathname) {
my $sha512sum = getfattr($pathname, 'SHA512');
if (!defined $sha512sum) {
my $ctx = Digest::SHA->new(512);
$ctx->addfile($pathname);
$sha512sum = $ctx->hexdigest;
print STDERR "$pathname $sha512sum\n" if $opt{v};
setfattr($pathname, "SHA512", $sha512sum);
} elsif ($opt{v}) {
print STDERR "Using sha512sum from attributes\n";
}

if (exists $file{$sha512sum}) {
if (!same_file($pathname, $file{$sha512sum})) {
my $links1 = (stat($pathname))[3];
my $links2 = (stat($file{$sha512sum}))[3];
# If one of them is a symbolic link, make sure it's
$pathname
if (is_symlink($file{$sha512sum})) {
print STDERR "Swapping $pathname and
$file{$sha512sum}\n" if $opt{v};
swap($file{$sha512sum}, $pathname);
}
# If $pathname has more links than $file{$sha512sum},
# exchange the two names.
# This ensures that $file{$sha512sum} has the most links.
elsif ($links1 > $links2) {
print STDERR "Swapping $pathname and
$file{$sha512sum}\n" if $opt{v};
swap($file{$sha512sum}, $pathname);
}

print "rm \"$pathname\"; ln \"$file{$sha512sum}\"
\"$pathname\"\n";
if (! $opt{n}) {
my $same_disk = same_disk($pathname,
$file{$sha512sum});
if (unlink($pathname)) {
if (! $same_disk || $opt{s}) {
symlink($file{$sha512sum}, $pathname) ||
print STDERR "Failed to symlink($file{$sha512sum}, $pathname): $!\n";
} else {
link($file{$sha512sum}, $pathname) || print
STDERR "Failed to link($file{$sha512sum}, $pathname): $!\n";
}
} else {
print STDERR "Failed to unlink $pathname: $!\n";
}
}
# print "Removing $pathname\n";
# unlink $pathname or warn "$0: Cannot remove $_: $!\n";

}
} else {
$file{$sha512sum} = $pathname;
}
}
}

# NAME: same_disk
# PURPOSE: To check if two files are on the same disk
# ARGUMENTS: pn1, pn2: pathnames of files
# RETURNS: true if files are on the same disk, else false
# NOTE: The check is made by comparing the device numbers of the
# filesystems of the two files.
sub same_disk {
my ($pn1, $pn2) = @_;

my @s1 = stat($pn1);
my @s2 = stat($pn2);

return $s1[0] == $s2[0];
}

# NAME: same_file
# PURPOSE: To check if two files are the same
# ARGUMENTS: pn1, pn2: pathnames of files
# RETURNS: true if files are the same, else false
# NOTE: files are the same if device number AND inode number
# are identical
sub same_file {
my ($pn1, $pn2) = @_;

my @s1 = stat($pn1);
my @s2 = stat($pn2);

return ($s1[0] == $s2[0]) && ($s1[1] == $s2[1]);
}

sub is_symlink {
my ($path) = @_;

return -l $path;
}

sub swap {
my $tmp;
$tmp = $_[0];
$_[0] = $_[1];
$_[1] = $tmp;
}
Kenny McCormack
2024-10-01 20:38:23 UTC
Permalink
In article <***@mid.individual.net>,
Josef Mllers <***@invalid.invalid> wrote:
...
Post by Josef Möllers
I have had the same problem. My solution was to use extended file
attributes and some file checksum, eg sha512sum, also, I wrote this in
PERL (see code below). Using the file attributes, I can re-run the
And is thus entirely OT here. Keith will tell the same.
--
"You can safely assume that you have created God in your own image when
it turns out that God hates all the same people you do." -- Anne Lamott
Loading...