sort and searching a particular pattern in a huge file

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

sort and searching a particular pattern in a huge file

krishnendu das
I have an input file whose size can go up to 1GB. The file is in  csv format with “|” separated. First I needs to sort that file based on first fields(Subscriber id) and secondly I needs to finds a particular pattern in the whole files and replace that one with a fixed string.

The whole code should be in perl.

Code time complexity is a big concern here.

Any suggestion?
Reply | Threaded
Open this post in threaded view
|

Re: sort and searching a particular pattern in a huge file

Guru
Administrator
This post was updated on .
It depends on the system in which you are running. If its a system with good amount of ram and enough diskspace(sort stores temporary files here), sort itself will be good.

$ ls -lh e
-rw-rw-r-- 1 user1 group1 1.1G Jun 26 05:27 e

$ time sort -n e > /dev/null

real    1m33.53s
user    1m31.28s
sys     0m2.22s

  As above, sort took 93 seconds to complete the sorting which is good.
 
  If in case, not enough memory, you can split the file into smaller chunks and then sort them individually, and then merge it.  

The below script splits, sorts and then merges:  

$ cat scr.sh
#!/bin/bash

split -l 1000000 e fa
for file in fa*
do
         sort -n $file > ${file}_
done
sort -m fa*_ > d

$ time ./scr.sh

real    0m19.77s
user    0m16.65s
sys     0m3.10s

  This finished in 19 seconds,  is much faster compared to earlier.
 
For finding a pattern and replacing, you can read the file line by line(dont store the entire file in array/hash) and then find the pattern and replace and write to a new file.

$ cat file
ab|245|YY
ef|445|XX
gh|665|AA


$ cat scr2.pl
#!/usr/bin/perl
use strict;
use warnings;
$\="\n";

open my $fh, '<', 'file' or die $!;
open my $fho, '>', 'file1' or die $!;
while(<$fh>){
        chomp;
        s/\|445\|/||/;
        print $fho $_;
}
close $fh;
close $fho;