Fix text-formatted numbers

Beside the suggestion from the following Microsoft page, re-assigning the value with Macro will also resolve.

Selection.Value = Selection.Value

http://office.microsoft.com/en-us/excel-help/fix-text-formatted-numbers-by-applying-a-number-format-HA102749043.aspx

Importing data from Excel is not easy. If the formatting of a column changed (the first 100 cells is number and the second 100 cells is text), the SSIS may convert the second 100 cells to NULL values. Cleaning the formats must be done before the import.

sample code to find unsupported character conversions

/*
    http://en.wikipedia.org/wiki/Code_page_1252
    http://en.wikipedia.org/wiki/EBCDIC_037
  
    http://stackoverflow.com/questions/13389182/in-net-how-do-i-determine-what-characters-existing-in-the-windows-1252-codepag
 
 
    Notes:
    (1) There is a problem in the code.  The .Net conversion from Windows 1252 to IBM 37 has problem.  It converted the of Ÿ (159 in windows 12562) to "Y" (232 IBM 37) instead of "Ÿ" (255 IBM 37).
    (2) The Console cannot display all the characters on the terminal.
 */

using System;
using System.Collections.Generic;
using System.IO;
using System.Text;

namespace test_code_page
{
    class find_unsupported_conversions
    {
        static void Main(string[] args)
        {
            FileStream stream = null;
            StreamWriter streamwriter = null;
            TextWriter console_streamwriter = Console.Out;
            string path;

            if (args.Length == 1)
                path = args[0];
            else
            {
                Console.Write("Enter path: ");
                path = Console.ReadLine();
            }

            try
            {
                stream = new FileStream(path, FileMode.CreateNew, FileAccess.Write);
                streamwriter = new StreamWriter(stream);
                Console.SetOut(streamwriter);
            }
            catch (Exception e)
            {
                Console.WriteLine("Cannot open '" + path + "' for writing");
                Console.WriteLine(e.Message);
            }

            test_code_pages();

            Console.SetOut(console_streamwriter);
            if (streamwriter != null)
                streamwriter.Close();
            if (stream != null)
                stream.Close();
        }

        private static void test_code_pages()
        {
            Encoding source_encoding = Encoding.GetEncoding("Windows-1252");
            Encoding target_encoding = Encoding.GetEncoding("IBM037");
            
            byte source_question_mark = source_encoding.GetBytes("?")[0];
            byte target_question_mark = target_encoding.GetBytes("?")[0];

            int count = 0;
            byte[] b = new byte[1];
            for (int i = 0; i < 256; i++)
            {
                b[0] = (byte)i;
                if (b[0] != source_question_mark && Encoding.Convert(source_encoding, target_encoding, b)[0] == target_question_mark)
                {
                    count++;
                    Console.WriteLine(source_encoding.GetString(b) + " " + i.ToString());
                }
            }

            Console.WriteLine(count.ToString() + " unsupported.");
        }
    }
}

notes on character encoding conversion between SQL Server and AS400

Character encoding is a difficult topic because different companies have developed different terminologies and different standard.  For the Microsoft products, the default character encoding is Windows-1252.  It is being used in the SQL Servers.  For IBM servers, it is EBCDIC encoding schema.  In some AS400 system, both CCSID 37 and CCSID 65535 are being used.  The CCSID 37 seems to be mostly used.

The IBM iSeries OLE DB provider translates the character encoding when data are being sent from SQL Server to the AS400.  In most cases, the translation of Windows 1252 to CCSID 37 works.  But, there are 26 Windows 1252 characters cannot be translated, see the table below.  The CCSID 37 does not have the matching characters.  So, the program (like SSIS package) has to ignore the error and let the iSeries driver change the unsupported characters to question mark “?”.  Or, the program has to manually convert them, for example changing the character Ž to the character Z.

 

Windows 1252

character

code

128

130

ƒ

131

132

133

134

135

ˆ

136

137

Š

138

139

Œ

140

Ž

142

145

146

147

148

149

150

151

˜

152

153

š

154

155

œ

156

ž

158

Sticky Spaces

The Sticky Space is the non-breakable space CHAR(160).

In word processing, like MS Word, by applying the sticky spaces between words can avoid wrapping a phrase or a name into two lines. The sticky space looks like a regular space. The MS Word will handle the phrase connected by them as a single word. To put in the sticky space, just press the Ctrl+Shift+Space.

But, it may be a problem during ETL.

HTML uses non-breaking spaces CHAR(160) for multiple spaces. Copying string contained CHAR(160) from web page, or from HTML email, to Excel spreadsheet may cause problem. For example, the VBA Trim function will not remove the non-breaking spaces.

Beware of the 194, 160 pattern.
https://stackoverflow.com/questions/1431034/can-anyone-tell-me-what-this-ascii-character-is

Upserting

“Upserting is when you insert a row if it doesn’t exist or update it if itdoes.” Kirk Haselden, Microsoft SQL Server 2005 Integration Services

http://en.wikipedia.org/wiki/Upsert

The SSIS “Slowly Changing Dimensions” is kind of a upsert process. But, it is very slow. The following link provide a better solution. (Slowly Changing Dimensions is more then upsert.)

http://blog.stevienova.com/2008/11/22/ssis-slowly-changing-dimensions-with-checksum/