Simple PERL script to remove the extra new line character from delimited flat files

When you use BTEQ, SQL*Plus, or Informatica to generate delimited flat file from RDBMS, the new line character embedded within a string/varchar column is quite annoying. You can add REPLACE() function to each string column in your SQL to remove these offending \n, but that’s still painful.

Let’s pipe the output of SQL into the following PERL script, and it will try its best to reconstruct the column by removing the extra new line characters. The assumption is – either the 1st row is in good shape which does not contain extra new line character, or the 1st row is the column header.

#!/usr/bin/perl -w

my $file_name = shift @ARGV;

my ($line_count, $line_fixed, $expected_delimiter_count, $count, $n, $del) = (0,0,0,0,0, '');

my @possible_delimiter = (',',"\t",';','\|', '\cA', '\cZ');

# open STDIN or file
if (defined $file_name) { open INPUTPIPE, "<" , $file_name or die $!; } 
else { open INPUTPIPE, "<&STDIN"; }

# count delimiter in the 1st line
my $line_buffer = <INPUTPIPE>;
$n++;
print $line_buffer; $line_count++; # the 1st line must be in good shape

foreach $del (@possible_delimiter){
	chomp($line_buffer);
	$count = () = ($line_buffer =~ /$del/g);
	print STDERR $count . " ($del) delimiters found in 1st row.\n";
	if ( $count > $expected_delimiter_count ){
		$expected_delimiter_count = $count;
		$delimiter = $del;
	}
}

die "Can't detect delimiter!\n" if $expected_delimiter_count <= 0;

print STDERR $expected_delimiter_count . " ($delimiter) delimiters expected.\n\n";

# read through the rest 
while( $line_buffer = <INPUTPIPE> ){
	$n++;
	$count = () = ($line_buffer =~ /$delimiter/g);
	if ( $count >= $expected_delimiter_count ){
		print $line_buffer;
		$line_count++;
		print STDERR "$n : $count : output \n";
		next;
	} else {
		while( $next_line = <INPUTPIPE> ){
			$n++;
			chomp($line_buffer);
			$line_buffer .= $next_line;
			$count = () = ($line_buffer =~ /$delimiter/g);
			# print STDERR "$n : $count : $line_buffer";
			if ( $count >= $expected_delimiter_count ){
				print $line_buffer;
				$line_count++;
				$line_fixed++;
				
				last;
			}
		} # read the next line
	}
}

# finish up
print STDERR $line_count . " rows output into STDOUT and " . $line_fixed . " rows fixed.\n";
close INPUTPIPE;
Advertisements

, , ,

  1. Leave a comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: