Edit

Need help

Last Edited By Krjb Donovan

Last Updated: Mar 11, 2014 07:52 PM GMT

Question

QUESTION: Hi,

I need to implement an algorithm using c++ data structures.

I have got two files file1 and file2.

The way lines are stored in file1 as follows:

=================================

Letter/word patterns1 I SJ DL KP live VB KL BL in IJ DL ML London OJ KL BL ...... .. .. .. ...... .. .. .. etc.....

file2 is rule file which contains the input patterns and their corresponding output patterns

file2

input pattern output pattern DL KP MN OP KL BL QR QP SJ DL KP AB DF VB DL BL FG KE IJ DL ML KQ PT OJ KL BL AK BM DL ML BT -- -- -- --- --

My requirement is to parse the file1 for each word and identify the corresponding pattern in patterns1 column. Once the pattern is identified then search this pattern in the file2 (under input pattern column). Then match the pattern identified in "input pattern" column with the "output pattern" column. Then I have to generate file3 with new pattern selected from "output pattern" column of file2.

I need to implement an efficient (low memory,less iterations, more speed) algorithm using c++. Can any one from here help me with proper design, data structures and iterators

ANSWER: Hello mhd.

I can help you but I'm not sure about how much help your asking for or how much experience you have programming. I'll give you a general outline of what you need. From you description, you need to map the word in file1 to the output pattern in file2, then write the word and the output pattern to file3. For that to be done quickly, you need to read file2 into some kind of data structure that offers quick searching.

So we must determine if you have to implement the data structure, or if you can you use the standard C++ containers? If this is a school project that answer depends on what your instructor wants. If this is professional, you will want to use the STL map container. A map is an associative array that maps keys to values. Here the input pattern of file2 is the key, and the output pattern is the value. A STL map is implemented as a balanced binary tree so it is quite fast even if file2 is thousands of lines long.

Your description does not state how you can tell where the input pattern of file2 ends and the output patterns begins. Is it based on some position in the line ? I ask because it seems that the input pattern can be 2 or three sets of characters. You will need to answer that question before you can read in file2.

I generally write code for people that show me code first. I can give you a general outline of your program now. The pseudo code is something like this:

function readFile2

open file2 for each line

   split line into input pattern and output pattern
   insert output pattern into map with input pattern as a key

end for close file2

function main

call readFile2 open file1 for input open file3 for output for each line in file1

   split line into word and pattern1
   search map for pattern1 (use the map::find function, it will return an iterator)
   if iterator is not map.end() then
       write word and map value to file3
   endif

end for close file1 close file3

You will need to learn how to use ofsteam, ifstream, map, and string. The problem is not too difficult. Show me some effort in coding, and also give me more details on the file formats and I'll be happy to help you more.

---------- FOLLOW-UP ----------

QUESTION: Hi Zlatko,

I'm really grateful to you and thanks for your quick response. Your solution exactly matches my problem statement. This is my college project and usage of STL is highly recommended.

I discussed with my mentor today and showed the pseudo code sent by you. He has agreed with the solution proposed. He feels that searching algorithm in the file2 can be improved. He feels that proposed searching method works efficiently with small files (file2 with two thousand or three thousand lines) but not with the files having 500 or 600 thousand lines

Both files are in txt format and file1 is of 2000 lines and file2 is having 600 thousand lines.

Can you please suggest me an efficient search algorithm (in file2)?

I'm planning to start with coding once you help me efficient search algorithm.

Regards mhd

Answer

Hello mhd.

Your mentor may be correct that a map will be too slow for searching 600,000 entries. As you probably know, the map is usually implemented as a balanced binary tree, and the worst case performance for finding an item in 500,000 is 19 searches. For finding an item in 1,000,000, it is 20 searches. (I am using rounded figures here) Regardless of the implementation, I believe that the map must be implemented with O(log n) performance for searching.

You should be brainstorming now about alternatives, including maybe the benefits of not using STL and using a trie and perhaps thinking about how to organize your program to allow easy testing of alternatives.

Another choice is a hash_map. The hash_map is not actually part of the C++ standard, but I believe it is widely available. Certainly the gcc and Microsoft libraries have it. The hash map might give you better performance, but you will need to make a good hash function. For that reason, I think you should use the std::map, as a baseline against which to compare your hash_map performance. A good hash function will give you constant time performance, meaning search time will not increase as you add items to the hash. A really bad hash function will be like searching a linked list. I don't know how the hash_map consumes memory. That is something you would have to research.

Perhaps you could distribute file2 over many maps. For example, if the first letters of the file2 input pattern are uniformly distributed (more or less), you could have an array of 26 maps, and choose the map to search based on the first character. How would that cut down your search time? Actually, I just tried some numbers and I don't think it makes a big difference, but It's worth a try.

Your write-up will probably need to discuss design trade offs, and alternatives, and you will probably need to justify the final decision with some hard numbers.

What I can suggest to you now is that you should be ready to try many different algorithms. To do so easily, you should write an interface (pure virtual class) to the algorithm, so that the rest of your code will not change as you test different algorithms. The interface can be quite simple; a method to insert keys and values, and a method to return a value based on a key. Something like this:

class IDataStructure {

   virtual void insert(const string& key, const string& value) = 0;
   virtual bool find(const string& key, string& value) = 0;

};

Then all your different data structures go into classes that inherit this interface. At runtime, you can create the concrete data structure that you want, but the rest of the program just sees the interface. This is an example of the strategy pattern. See http://en.wikipedia.org/wiki/Strategy_pattern

Your first concrete data structure should use the map, because it will be easy and will allow you to develop the rest of your program without getting bogged down in the data structure details.

Something like this

class MapDataStructure : public IDataStructure { public: // insert and find methods go here and forward to the contained myMap private

  std::map<std::string, std::string> myMap;

};

Now, if you are writing on windows, you can use QueryPerformanceCounter to get CPU ticks at the start and end of your processing of file1. You can use the tick numbers to see how long the processing took for each data structure. You can compare different hashing functions against each other and the std::map. I don't know how to get cpu tick on linux, but if you are using linux, maybe your mentor knows how.

Finally, to be completely fair, your file1 should be completely in memory, so that disk access does not skew your results.

There, I gave you lots of words, but no answer. I should go into politics. Let me know how it goes.

GETTING STARTED
Invite Others
Create