Edit

C++ files

Last Edited By Krjb Donovan

Last Updated: Mar 11, 2014 07:51 PM GMT

Question

I have two files data1.txt and data2.txt with column names country,language,region,zip,latitude and longitude.how can i compare only the specific columns(country,language,zip,longitude and ignore remaining columns) of these two files and print the diffence line(i should compare first line of data1.txt with first line of data2.txt,second line of data1.txt with second line of data2.txt).Help me to write the comparision code in c++.

data1.txt: country;language;region;zip;latitude;longitude "ca";"en";"alberta";"tom 0a0";"51.49";"23.45" "ca";"en";"alberta";"tom 0a1";"51.56";"23.67"

data2.txt: country;language;region;zip;latitude;longitude "ca";"en";"";"tom 0a0";"";"23.45" "ca";"en";"";"tom 0a1";"";"23.67"

Answer

Well first we need to ignore C++ minutiae for the moment and sort out a strategy.

I can see two main ways to go about this (excluding the first lines of each file as these are not records data):

1/ read each line from the two files as described so they are available to display if they are not equivalent 2/ split each line into fields, comparing those you are interested in and ignoring those you are not.

1/ create a record class for the data 2/ provide the record with facilities to read data from a stream 3/ provide a function to compare records on the field members required 4/ read records from each file and compare each pair of records - again it would probably be best to read the lines into a string first then process the string to records with individual fields.

The two approaches are not really that dissimilar from each other. I shall concentrate on the first method here (as I am finding explaining all this with examples is taking too long as it is).

You can either read/ignore the first column name lines of each file or read and compare to ensure the two files contain the expected data.

Now getting back to C++ specifics.

We can read lines from a file into a std::string using the stand-alone std::getline functions (include <string>):

   #include <string>
   #include <fstream>
   
   // ...

   std::ifstream in("somefile.txt");

   // ...

   std::string line;

   // ...

   std::getline( in, line );

I have elided all error checking and only shown the bare bones of the code required to show the use of using std::getline for reading a sequence of characters from a file into a std::string terminated by a newline ('\n') character. Like the std::istream member function getline, the stand alone version reads but do not store the delimiter character (i.e. the newline).

We can split this string into fields using one of several techniques - of which using a std::istringstream (include <sstream>) is possibly the easiest. All but the last field of your records are delimited by semicolons rather than a newline, so in this case we can extract character sequences delimited by semicolons rather than newlines:

   #include <sstream>
   #include <iostream>

   // ...

   char const  FieldDelimiter(';');

   // ...

   line += FieldDelimiter;  // append delimiter for final field
   std::istringstream rec_in(line);
   std::string field;
   int fieldNumber(0);
   while ( std::getline(rec_in, field, FieldDelimiter) )
   {
       std::cout << "Field #" << ++fieldNumber << ": " << field << '\n';
   }

If we put the code together then if reading any line from data1.txt or data2.txt _except_ the first then the output for that line would be something like:

   Field #1: "ca"
   Field #2: "en"
   Field #3: "alberta"
   Field #4: "tom 0a0"
   Field #5: "51.49"
   Field #6: "23.45"

That is we have extracted the field data as quoted strings. If this is OK for your purposes - that is a field with a numeric value of "51.49" is considered different from say "051.49" or "5.149E1" then you can expand on the above to compare those fields of interest and ignore those of no interest, which in this case are the fields country,language,zip,longitude, i.e. fields 1, 2, 4, 6, or if zero based (as is more the C/C++ way!) 0, 1, 3 and 5. We could write a function that takes two std::strings and an array of field numbers (or field ids) (in increasing value) to compare:

   #include <vector>

   // ...

   typedef std::vector<int> CompareVector;
   
   enum FieldIdType
   { FldIdCountry
   , FldIdLanguage
   , FldIdRegion
   , FldIdZip
   , FldIdLatitude
   , FldIdLongitude
   };

   // ...

   bool RecordsAreEquivalent
   ( std::string const & rec1
   , std::string const & rec2
   , CompareVector const & fldsToCompare
   )
   {
       std::string record1(rec1);
       record1 += FieldDelimiter; // append delimiter for final field
       std::string record2(rec2);
       record2 += FieldDelimiter; // append delimiter for final field

       std::istringstream rec1_in(record1);
       std::istringstream rec2_in(record2);
       std::string field1;
       std::string field2;
       CompareVector::const_iterator fieldToComparePos( fldsToCompare.begin() );
       CompareVector::const_iterator endOfFieldsToCompare( fldsToCompare.end() );
       int fieldId(0);
       while ( fieldToComparePos!=endOfFieldsToCompare
            && std::getline(rec1_in, field1, FieldDelimiter) 
            && std::getline(rec2_in, field2, FieldDelimiter) 
           )
       {
           if ( fieldId==*fieldToComparePos )
           { // need to compare these fields
               ++fieldToComparePos; // position to next field ids to compare
               if ( field1 != field2 )
               {
                   return false;
               }              
           }
           ++fieldId;
       }
       return fieldId==fldsToCompare.back()+1; // assumes properly setup fldsToCompare vector
   }

OK the RecordsAreEquivalent function expands on the previous example. It takes, by reference to constant objects, two record strings to compare and a vector of integer field id values, where a field id is the zero based column of the field: 0 for country through to 5 for longitude.

The first part of RecordsAreEquivalent is basically similar to the previous example although some things are doubled up to take into account the two record strings.

First a copy of each record string is taken and the terminating delimiter appended to each copy. Next each record string copy is used as the stream source for std::istringstreams, and two local field strings - one to hold field values for each record as they are extracted from the std::istringstream.

An iterator to the current field id to compare is defined and initialised to point to the first (beginning) field id we wish to compare. A current field id integer is defined and initialised to zero, the first field id in a record.

The while loop continues while we still have fields to compare AND BOTH record string streams have a field string successfully extracted from them.

Within the loop if the field id of the fields just extracted from each record is the next one to compare then:

   - the iterator 'position' pointing to the fields' id to compare is incremented to point to
      the next one in the vector
   - the extracted field values from each record's string stream are compared and if not equal
      false returned

Before the next while loop iteration the extracted fields' id value is incremented.

If the while loop terminates without the function call returning then the returned value is true only if the final extracted fields' field id is one more than the last (i.e. back) value in the fldsToCompare vector, otherwise the while loop terminated because one (or both) record string streams were not good _before_ all fields that need to be compared were compared (most likely because one or both were missing some fields).

For ease of use an enumeration, FieldIdType, gives names to the raw field id values, and a typedef type alias, CompareVector, is used to name the std::vector<int> type used to store field ids to be compared.

One caveat of the above approach is that because the while loop terminates as soon as all fields that it needs to compare have been compared then short records (i.e. those with some columns' values missing) can be accepted as a match. It is up to you to decide if this behaviour is acceptable.

We can use the RecordsAreEquivalent function and the CompareVector and FieldIdType types like so, which shows some simple testing of the function:

   void checkRecords
   ( std::string const & rec1
   , std::string const & rec2
   , CompareVector const & fldsToCompare
   )
   {
       std::cout << rec1 << '\n' << rec2 << '\n'
                 << ( RecordsAreEquivalent(rec1,rec2,fldsToCompare) 
                       ? "are" 
                       : "are not" 
                    )
                 << " equivalent\n\n";

   int main()
   {
       CompareVector compareFields;
       compareFields.push_back(FldIdCountry);
       compareFields.push_back(FldIdLanguage);
       compareFields.push_back(FldIdZip);
       compareFields.push_back(FldIdLongitude);

       std::cout << "Comparing fields 0,1,3,5 (Country,Language,Zip,Longitude):\n"
                    "==========================================================\n"
                    ;

       checkRecords( "\"ca\";\"en\";\"alberta\";\"tom 0a0\";\"51.49\";\"23.45\""
                   , "\"ca\";\"en\";\"alberta\";\"tom 0a0\";\"51.49\";\"23.45\""
                   , compareFields
                   );
       checkRecords( "\"ca\";\"en\";\"alberta\";\"tom 0a0\";\"51.49\";\"23.45\""
                   , "\"ca\";\"en\";\"xxxxxcc\";\"tom 0a0\";\"xx.xx\";\"23.45\""
                   , compareFields
                   );
       checkRecords( "\"ca\";\"en\";\"alberta\";\"tom 0a0\";\"51.49\";\"23.45\""
                   , "\"ca\";\"en\";\"alberta\";\"tom 0a0\";\"51.49\";\"23.4X\""
                   , compareFields
                   );
       checkRecords( "\"ca\";\"en\";\"alberta\";\"tom 0a0\";\"51.49\";\"23.45\""
                   , "\"cX\";\"en\";\"alberta\";\"tom 0a0\";\"51.49\";\"23.45\""
                   , compareFields
                   );
       checkRecords( "\"ca\";\"en\";\"alberta\";\"tom 0a0\";\"51.49\";\"23.45\""
                   , "\"ca\";\"eX\";\"alberta\";\"tom 0a0\";\"51.49\";\"23.45\""
                   , compareFields
                   );
       checkRecords( "\"ca\";\"en\";\"alberta\";\"tom 0a0\";\"51.49\";\"23.45\""
                   , "\"ca\";\"en\";\"albertX\";\"tom 0a0\";\"51.49\";\"23.45\""
                   , compareFields
                   );
       checkRecords( "\"ca\";\"en\";\"alberta\";\"tom 0a0\";\"51.49\";\"23.45\""
                   , "\"ca\";\"en\";\"alberta\";\"tom 0aX\";\"51.49\";\"23.45\""
                   , compareFields
                   );
       checkRecords( "\"ca\";\"en\";\"alberta\";\"tom 0a0\";\"51.49\";\"23.45\""
                   , "\"ca\";\"en\";\"alberta\";\"tom 0a0\";\"51.49\""
                   , compareFields
                   );
       CompareVector compareFields2;
       compareFields2.push_back(FldIdCountry);
       std::cout << "Comparing field 0 (Country) only:\n"
                    "==========================================================\n"
                    ;
       checkRecords( "\"ca\";\"en\";\"alberta\";\"tom 0a0\";\"51.49\";\"23.45\""
                   , "\"ca\""
                   , compareFields2
                   );

       return 0;
   }

In your problem's case I would expect you to start by expanding on the reading of lines from a file code I showed initially. Open the two files, check they are open ok (using the streams' is_open member function), read lines from the file until one fails or is at the end of the file - probably a difference if the other file is also not at the end (i.e. one file has whole records the other does not). Pass each pair of lines to RecordsAreEquivalent together with the CompareVector (which you can setup before reading any lines from the files). If RecordsAreEquivalent returns false ( if ( ! RecordsAreEquivalent... ) then you can print out the lines in question maybe also with the line number (which you could arrange to keep track of in a similar fashion to the fieldId value in RecordsAreEquivalent).

You will have to decide what to do about the initial column name line in each file, some possibilities would be:

   - read and ignore them
   - read and compare as per record data
   - read an compare all column names using RecordsAreEquivalent and a different CompareVector.

Note: the code here is not meant to be of production quality - you will probably have to add additional error checking and the like, and maybe change the names of things.

Hope this has given you some ideas.

GETTING STARTED
Invite Others
Create