المساعد الشخصي الرقمي

مشاهدة النسخة كاملة : Removing duplicates from large text files (Performance needed)



C# Programming
04-30-2009, 04:17 AM
Dear all,
I need a way to read a 500MB large TEXT file containing a url on each line and remove duplicates and then write data back on the text file. I know it sounds simple but here i need my algorithm to be as fast as possible. I don't want to design a special data structure for this purpose i just want to use C# pure ******** to do this.

Here i have some question in mind that i never found a good answer before:

1.When you use string s = File.ReadAllText(filename); the exception 'out of memory' will thrown (Because the amount of data is too big). So we should read the file line by line. dose reading line by line affect the speed? isn't it better to read the whole file and then do the processing?

2.What is the fastest data structure in c# to access data on it. i mean between hashtable list String[] array...

3.Because it's the matter of speed, should i think of writing data back to a database after reading from file? is it faster than writing directly back to file? (like inserting data into MySQL server with an index on my url field with UNIQUE type, so by inserting the data it will take care of duplicates)

I know my question is a bit big.. http://www.barakasoft.com/script/Forums/Images/smiley_wink.gif but if you have any idea in any part i will be glad to know your precious opinions.
Thank you.