100分问读取2G的大文件并进行比较的问题

wang_wei2007 2007-11-19 11:07:19

有两个文件,A文件和B文件,每个文件在2G左右大小,文件存储的内容是用户信息格式如下
1|张三|23|南京市
2|...........以下略
以每一行一条记录的格式,每个字段用 | 进行分隔,记录数在100万至500万之间,不会超过500万条.
A和B的文件的格式都是一样,现在要比较两个文件之间不同的记录数,要求把结果输出到新的文件,以A文件为基础,如果A文件里有的记录而B文件里没有,则把此记录记到新文件中,并在最后增加标志位设为3(表示删除);如果A文件里有的记录而B文件里也有,但记录的内容不一样,则也把B文件中的此记录内容记录到新文件中,并在最后增加标志位设为2(表示修改);如果A文件里没有的记录而B文件里有,则把B文件中的此记录内容记录到新文件中,并在最后增加标志位设为1(表示新增)

(注:同样的记录在两个文件中可能位置不一样,并且记录的顺序也可能不一样.不过两个文件的差异性不会超过10%)
最后生成的新文件则是A和B两个文件的差异.
问,用什么样的思路解决此问题?方法不限,要求速度越快越好,内存占用越少越好.

...全文

1709 63 打赏收藏转发到动态举报

写回复

用AI写文章

63 条回复

切换为时间正序

请发表友善的回复…

发表回复

gate1001 2011-12-23

打赏
举报

好贴学习中

mymtom 2009-08-24

打赏
举报

[Quote=引用 5 楼 guoxyj 的回复:]
跟数据库似的
[/Quote]
就是数据库嘛！！！

renxiaoyao868 2008-09-18

打赏
举报

好贴

xiyuan1999 2007-12-11

打赏
举报

使用缓存给缓冲区不要都放在内存里面

timbear 2007-12-11

打赏
举报

学习中～～

bukebuhao 2007-11-26

打赏
举报

学习,真的学了很多呀,

火龙果被占用了 2007-11-26

打赏
举报

to: hbwhwang

我的方案在代码中并未使用MD5或者SHA-1来压缩数据长度，因为我试
过使用字符串数组来存储数据的话500万行需要占用的内存超过300M，
如果再使用Java的集合类来存储的话，那内存占用还需要增加。

由于我的机器配置不是很高（P4:2.8G;480M内存）我在41～43楼的代码
中已经全部舍弃了使用对象类型来存储数据了，直接采用基本的int类
型来存储，没有任何浪费，一行仅占用8个字节左右的内存。

对于排序和查找都使用了Arrays.sort()、Arrays.binarySearch()方法，
这是个高度优化过的快速排序算法和拆半查找，对于大容量的排序特别
有效率。对于500万行的数据，执行的时间跟你的时间差不多，但是内存
占用量仅需要100M左右。

wang_wei2007 2007-11-26

打赏
举报

怎么结贴?是不是点无满意结贴?

wang_wei2007 2007-11-26

打赏
举报

经过测试,随机文件的存储在存取万级以上的记录时速度并不快,所以存储起止位置的方案可行性不高.
不过,采用多线程,提高的时间很有限(由于只有一个IO通道),而且程序出错的可能性大大提高,所以去掉多线程的操作.
而火龙果的方案的确不错,和我的最终方案非常一致,关键点都是要把很大的字符串的内容转化成特征码,不过你这里用的是hasCode,不安全,需要调整.因为存在重码的可能.

所以,最终方案,先建立索引文件,再用索引文件对关键字进行比较,把不同的地方保存到新的集合中,按行号进行排序,再读取原文件的记录,具体的实现可参考火龙果的代码(稍微优化一下就可用了).

对于大文件的比较,困难点在于不能够一次读入内存中(如果能够全部在内存中排序,就不会有问题了,java自带的排序就能工作的很好),而整个程序的瓶颈在于IO性能,读取磁盘的次数和读取量的多少决定了程序的性能.

谢谢大家这么多天的讨论.希望大家都从中学到了东西(至少我学了不少东西,呵呵).

maquan 2007-11-25

打赏
举报

要结贴了？hehe，俺来晚了……

看来制作一个索引是个不错的办法。

不过，我还是有点疑问，为什么就不能用数据库呢？你这是个实际的项目，又不是考试题，干嘛要把自己限制那么死呢？

搞个 MySQL，又不需要钱，又不占用很高的系统资源，又能很高效地解决问题，何乐不为呢……

wang_wei2007 2007-11-25

打赏
举报

谢谢hbwhwang的提醒，不过想请教一个问题，您的方案中每比较一次记录就要读一次文件，是不是会慢了点？
不过不记录行号而记录起止位置是个好方法，可以一用。
但如果不把记录转换成特征码，而是每次比较都要读取文件，是不是太慢了？
而转换字符串的时间和读取文件的时间肯定不在一个数量组上的。
另外想问一下，为什么用多线程是画蛇添足呢？还请点明一下，谢谢了

hbwhwang 2007-11-25

打赏
举报

我的程序还未优化，优化之后，代码运行时间应该在5-9分钟（真实环境）

hbwhwang 2007-11-25

打赏
举报

针对你的最终方案提3个意见：
1、做索引时不要用行号。看看我的代码，直接用记录的起止位置做索引，这样找起文件记录更快。毫无疑问，随机访问比顺序访问更快。
2、没必要用特征码！MD5,SHA-1算法都比较费时间，每个字符串都算一遍耗费不少时间，直接比较字符串虽然多读了次硬盘，但也慢不了多少。
3、这种程序用多线程纯属画蛇添足！

wang_wei2007 2007-11-25

打赏
举报

晕,如何结贴给分啊?

wang_wei2007 2007-11-25

打赏
举报

最终方案已确定,以关键字,行号,特征码建一个索引文件,对索引文件以关键字进行排序,然后对索引文件进行比较,找出不同处后再读取原文件把原记录进行输出.采用多线程操作后,时间能够控制在十五分钟以内,真正部署以后应该会更快,感谢大家的讨论,特别谢谢火龙果的方案,给了我不少的启发

火龙果被占用了 2007-11-22

打赏
举报

以上的代码，对于两个10行的文件，测试是OK的。

火龙果被占用了 2007-11-22

打赏
举报

FileGenerator.java（用于生成两个文件）

import java.io.BufferedWriter;

import java.io.File;

import java.io.FileWriter;

import java.io.IOException;

import java.security.MessageDigest;

import java.security.NoSuchAlgorithmException;

import java.util.HashSet;

import java.util.Random;

import java.util.Set;



public class FileGenerator {



    private static MessageDigest md;

    public final static String FIELD_SEPARATOR = "|";

    public final static String LINE_SEPARATOR = System.getProperty("line.separator");

    

    public final static int FILE_LINE = 10;

    

    public static void main(String[] args) {

        generateFile("f:/diff1.txt");        

        System.out.println("ok");

        generateFile("f:/diff2.txt");

        System.out.println("ok");

    }    

    

    static {

        try {

            md = MessageDigest.getInstance("sha-1");

        } catch (NoSuchAlgorithmException e) {

            e.printStackTrace();

        }

    }

    

    private static void init(String algorithm) {

        try {

            md = MessageDigest.getInstance(algorithm);

        } catch (NoSuchAlgorithmException e) {

            e.printStackTrace();

        }

    }

    

    public static String encrypte(String plainText) {

        md.update(plainText.getBytes());

        byte[] b = md.digest();

        StringBuffer output = new StringBuffer(32);

        for (int i = 0; i < b.length; i++) {

            String temp = Integer.toHexString(b[i] & 0xff);

            if (temp.length() < 2) {

                output.append("0");

            }

            output.append(temp);

        }

        return output.toString();

    }

    

    public static void generateFile(String filename) {

        Set<Integer> set = new HashSet<Integer>();

        File file = new File(filename);

        Random ran = new Random();

        BufferedWriter bw = null;

        

        String[] content = new String[4];

        init("sha-512");

        for(int i = 0; i < 4; i++) {

            content[i] = encrypte((i + System.nanoTime()) + "") + FIELD_SEPARATOR + encrypte(i+"");            

        }        

        

        try {

            bw = new BufferedWriter(new FileWriter(file));

            int key = -1;

            set.add(key);            

            int ranNum = FILE_LINE + (FILE_LINE * 2) / 10;  

            for(int i = 0; i < FILE_LINE; i++) {

                while(set.contains(key)) {

                    key = ran.nextInt(ranNum);                

                }

                set.add(key);

                String str = key + FIELD_SEPARATOR + content[(int)(System.nanoTime()%4)] + LINE_SEPARATOR;

                bw.write(str);

            }

            set.clear();

            bw.close();

        }catch(IOException e) {

            e.printStackTrace();

        }

    }

}

火龙果被占用了 2007-11-22

打赏
举报

FileComparison.java（第二部分）

    /************** 

     * 续，承上帖 *

     **************/

    

    /**

     * 从文件中读取 key

     * @param file      文件

     * @param size      文件的最大行数

     * @return int[]    已排序的 key 数组

     */

    private int[] readKey(File file, int size) {

        int[] keys = new int[size];

        BufferedReader br = null;

        try {

            br = new BufferedReader(new FileReader(file));

            String str = "";

            int i = 0;

            while ((str = br.readLine()) != null) {

                keys[i++] = getKey(str);

            }

            br.close();

        } catch (IOException e) {

            e.printStackTrace();

        }        

        Arrays.sort(keys);

        br = null;

        System.out.printf("  %s 的键读取完成，大小 %d%n", file.getName(), size);

        System.gc();

        return keys;

    }

    

    /**

     * 映射源文件

     * @param file          源文件

     * @param fileLine      源文件的行数

     * @param excludeFile   可以被排除的文件（删除/新增的记录集）

     * @param excludeSize   被排除文件的行数

     * @return int[][]      int[0][] 源文件的 key 数组，按照 int[0][] 值的大小排序

     *                      int[1][] 源文件的记录数组（已映射成为int数据）

     */

    private int[][] readFile(File file, int fileLine, File excludeFile, int excludeSize) {        

        int[][] map = new int[2][fileLine - excludeSize];

        BufferedReader br = null;

        int[] excludeKeys = readKey(excludeFile, excludeSize);

        int i = 0;

        try {

            br = new BufferedReader(new FileReader(oldFile));

            String oldFileLine = "";            

            int key;

            while((oldFileLine = br.readLine()) != null) {

                key = getKey(oldFileLine);

                if(Arrays.binarySearch(excludeKeys, key) < 0) {

                    map[0][i] = key;

                    map[1][i] = getValue(oldFileLine);

                    i++;

                }

            }

            br.close();

        }catch(IOException e) {

            e.printStackTrace();

        }

        sort(map, 0, map[0].length);

        excludeKeys = null;

        System.out.printf("  %s 文件映射完成，记录数 %d%n", file.getName(), i);

        System.gc();

        return map;

    }

    

    /******************************************************

     * 这个方法是核心，比较的成败取决于这个方法，当前采用的

     * 是 hashcode，但有待于改进

     *

     * 为了节省存储空间，需要将字符串转换成一个int类型的值

     * 期望能做到：

     * 两个字符串相同时，其int值也相同；

     * 两个字符串不同时，其int值也不同。

     *

     * @param str 需要转换的字符串

     * @return 字符串的“int值”

     ******************************************************/

    private int getValue(String str) {

        return str.hashCode();

    }

    

    /**

     * 从一行字符串中获得键

     * @param str

     * @return

     */

    private int getKey(String str) {

        char[] chars = str.toCharArray();

        int i = 0;

        int num = 0;

        while(chars[i] != '|') {

            num = num * 10 + (chars[i++] - '0');            

        }

        return num;

    }

    

    

    /**

     * 从JDK源代码中抄的Arrays.sort()排序算法，

     * 改进一下，以x[0]排序，同时交换a[1]的值，即同步交换

     * 排序前：

     * [0]  [1]

     * 63   9

     * 51   12

     * 7    33

     * 48   15

     * 45   82

     * 55   76

     * 排序后（[0]作为键，[1]作为值，交换键的同时交换值）：

     * 7    33

     * 45   82

     * 48   15

     * 51   12

     * 55   76

     * 63   9

     */

    private void sort(int x[][], int off, int len) {

        if (len < 7) {

            for (int i = off; i < len + off; i++)

                for (int j = i; j > off && x[0][j - 1] > x[0][j]; j--)

                    swap(x, j, j - 1);

            return;

        }

        // Choose a partition element, v

        int m = off + (len >> 1); // Small arrays, middle element

        if (len > 7) {

            int l = off;

            int n = off + len - 1;

            if (len > 40) { // Big arrays, pseudomedian of 9

                int s = len / 8;

                l = med3(x, l, l + s, l + 2 * s);

                m = med3(x, m - s, m, m + s);

                n = med3(x, n - 2 * s, n - s, n);

            }

            m = med3(x, l, m, n); // Mid-size, med of 3

        }

        int v = x[0][m];



        int a = off, b = a, c = off + len - 1, d = c;

        while (true) {

            while (b <= c && x[0][b] <= v) {

                if (x[0][b] == v)

                    swap(x, a++, b);

                b++;

            }

            while (c >= b && x[0][c] >= v) {

                if (x[0][c] == v)

                    swap(x, c, d--);

                c--;

            }

            if (b > c)

                break;

            swap(x, b++, c--);

        }



        int s, n = off + len;

        s = Math.min(a - off, b - a);

        vecswap(x, off, b - s, s);

        s = Math.min(d - c, n - d - 1);

        vecswap(x, b, n - s, s);



        if ((s = b - a) > 1)

            sort(x, off, s);

        if ((s = d - c) > 1)

            sort(x, n - s, s);

    }

    private void swap(int x[][], int a, int b) {

        int t = x[0][a];

        x[0][a] = x[0][b];

        x[0][b] = t;

        t = x[1][a];

        x[1][a] = x[1][b];

        x[1][b] = t;        

    }

    private void vecswap(int x[][], int a, int b, int n) {

        for (int i = 0; i < n; i++, a++, b++)

            swap(x, a, b);

    }

    private int med3(int x[][], int a, int b, int c) {

        return (x[0][a] < x[0][b] ? (x[0][b] < x[0][c] ? b : x[0][a] < x[0][c] ? c : a)

                : (x[0][b] > x[0][c] ? b : x[0][a] > x[0][c] ? c : a));

    }

Test.java（测试类）

import java.io.File;



public class Test {



    public static void main(String[] args) {

         File oldFile = new File("f:/diff3.txt");

         File newFile = new File("f:/diff4.txt");

         File addFile = new File("f:/diff_add.txt");

         File deleteFile = new File("f:/diff_delete.txt");

         File modifyFile = new File("f:/diff_modify.txt");

         FileComparison fc = new FileComparison(oldFile, newFile, addFile, deleteFile, modifyFile);

         // 这两个需要在recordModified前执行

         fc.recordAdded();

         fc.recordDeleted();

         fc.recordModified();

    }

}

火龙果被占用了 2007-11-22

打赏
举报

这道题目不能使用内置的集合类（内置的集合类太占内存了），
改进了一下存储结构，采用int数组，可以大大地节省内存。

测试数据为随机两个随机生成的100万行文件，key为（0～120万）的随机值，记录为4个常量值，
根据系统的纳秒数随机选择，文件大小253MB，采用这个算法的话，文件不在于多，而是在于行数
的多少，因为已经将记录映射为一个int类型的数据。在使用
[code=BatchFile]java -Xmx20m Test[/code]
设最高heap堆内存为20MB，完成比较，耗时150秒左右，如果你有500万行的话，可以提高堆内存占
用，比如说增加5倍到100MB（-Xmx100m），这样不过是存在三个文件当中的，为了加快执行速度，
源代码都放在了一个文件当中，比较大，将分多次贴出，可以参考一下：

FileComparison.java（第一部分）

import java.io.BufferedReader;

import java.io.BufferedWriter;

import java.io.File;

import java.io.FileReader;

import java.io.FileWriter;

import java.io.IOException;

import java.util.Arrays;



public class FileComparison {  

    

    // 文件中最大的行数，不会超过500万的话，就设为500万

    private final int LINE = 1000000;    

    // 字段分隔符

    private final String FIELD_SEPARATOR = "|";

    // 行结束符

    private final String LINE_SEPARATOR = System.getProperty("line.separator");

    // 新增加条目所加的后缀

    private final String ADD = FIELD_SEPARATOR + "1" + LINE_SEPARATOR;

    // 被删除条目所加的后缀

    private final String MODIFY = FIELD_SEPARATOR + "2" + LINE_SEPARATOR;

    // 被修改过条目所加的后缀

    private final String DELETE = FIELD_SEPARATOR + "3" + LINE_SEPARATOR;  

    

    // 旧文件

    private File oldFile;

    // 新文件

    private File newFile;

    // 记录增加的文件

    private File addedFile;

    // 记录删除的文件

    private File deletedFile;

    // 记录被修改的文件

    private File modifyFile;    

    // 增加记录的数量

    private int addedNumber = 0;

    // 删除记录的数量

    private int deletedNumber = 0;

    

    public FileComparison(File oldFile, File newFile, File addedFile, File deletedFile, File modifyFile) {

        this.oldFile = oldFile;

        this.newFile = newFile;

        this.addedFile = addedFile;

        this.deletedFile = deletedFile;

        this.modifyFile = modifyFile;

    }



    /**

     * 比较被修改的条目，以新文件为依据，若有不同则存入新文件的条目，

     * 并在最后标记为“2”

     */

    public void recordModified() {

        long t0, t1;

        System.out.println("开始比较修改的记录...");

        t0 = System.currentTimeMillis();

        int[][] oldFileMap = readFile(oldFile, LINE, deletedFile, deletedNumber);

        BufferedReader br = null;

        BufferedWriter bw = null;

        int[] addedKeys = readKey(addedFile, addedNumber);

        int count = 0;

        try {

            br = new BufferedReader(new FileReader(newFile));

            bw = new BufferedWriter(new FileWriter(modifyFile));

            String newFileLine = "";

            int key;

            int value;

            while((newFileLine = br.readLine()) != null) {

                key = getKey(newFileLine);

                if(Arrays.binarySearch(addedKeys, key) < 0) {

                    int index = Arrays.binarySearch(oldFileMap[0], key);

                    if(index >= 0) {

                        value = getValue(newFileLine);

                        if(oldFileMap[1][index] != value) {

                            bw.write(newFileLine + MODIFY);

                            count++;

                        }

                    }

                }

            }

            bw.close();

            br.close();

        }catch(IOException e) {

            e.printStackTrace();

        }

        t1 = System.currentTimeMillis();

        System.out.printf("被修改的记录比较完成，有 %d 条被修，耗时 %.1f 秒，存放于：%s%n", count, (t1-t0)/1000F, modifyFile.getName());

    }

    

    /**

     * 比较新文件中，新增的条目（新文件中有的，旧文件中没有的），

     * 记录到文件中，并在最后标记为“1”

     */

    public void recordAdded() {

        long t0, t1;

        System.out.println("开始比较新增加的记录...");

        t0 = System.currentTimeMillis();

        int[] oldFileKey = readKey(oldFile, LINE);        

        BufferedReader br = null;

        BufferedWriter bw = null;

        try {

            br = new BufferedReader(new FileReader(newFile));

            bw = new BufferedWriter(new FileWriter(addedFile));

            String newFileLine = "";            

            while((newFileLine = br.readLine()) != null) {

                int key = getKey(newFileLine);

                if(Arrays.binarySearch(oldFileKey, key) < 0){

                    addedNumber++;

                    bw.write(newFileLine + ADD);

                }

            }

            br.close();

            bw.close();            

        }catch(IOException e) {

            e.printStackTrace();

        }

        oldFileKey = null;

        br = null;

        bw = null;

        t1 = System.currentTimeMillis();

        System.out.printf("新增加的记录比较完成，有 %d 条新增，耗时 %.1f 秒，存放于：%s%n", addedNumber, (t1-t0)/1000F, addedFile.getName());

        System.gc();

    }

    

    /**

     * 比较新文件中，删除的条目（旧文件中有的，新文件中没有的），

     * 记录到文件中，并在最后标记为“3”

     */

    public void recordDeleted() {

        long t0, t1;

        System.out.println("开始比较删除的记录...");

        t0 = System.currentTimeMillis();

        int[] newFileKey = readKey(newFile, LINE);

        BufferedReader br = null;

        BufferedWriter bw = null;

        try {

            br = new BufferedReader(new FileReader(oldFile));

            bw = new BufferedWriter(new FileWriter(deletedFile));

            String newFileLine = "";

            int key;

            while((newFileLine = br.readLine()) != null) {

                key = getKey(newFileLine);

                if(Arrays.binarySearch(newFileKey, key) < 0) {

                    deletedNumber++;

                    bw.write(newFileLine + DELETE);                    

                }

            }

            br.close();

            bw.close();

        }catch(IOException e) {

            e.printStackTrace();

        }

        newFileKey = null;

        bw = null;

        br = null;

        t1 = System.currentTimeMillis();

        System.out.printf("被删除的记录比较完成，有 %d 条删除，耗时 %.1f 秒，存放于：%s%n", deletedNumber, (t1-t0)/1000F, deletedFile.getName());

        System.gc();

    }

    

    /**************** 

     * 未完，接下帖 *

     ****************/

}