从php中的csv文件读取大数据

一尘不染

从php中的csv文件读取大数据

php

我正在读取csv并与mysql检查记录是否存在于我的表中或不在php中。

csv大约有25000条记录，当我运行我的代码时，它在2m 10s后显示“服务不可用”错误（加载：2m 10s）

在这里我添加了代码

// for set memory limit & execution time
ini_set('memory_limit', '512M');
ini_set('max_execution_time', '180');

//function to read csv file
function readCSV($csvFile)
{
    $file_handle = fopen($csvFile, 'r');
    while (!feof($file_handle) ) {

       set_time_limit(60); // you can enable this if you have lot of data

       $line_of_text[] = fgetcsv($file_handle, 1024);
   }
   fclose($file_handle);
   return $line_of_text;
 }

// Set path to CSV file
$csvFile = 'my_records.csv';

$csv = readCSV($csvFile);

for($i=1;$i<count($csv);$i++)
{
   $user_email= $csv[$i][1];

   $qry = "SELECT u.user_id, u.user_email_id FROM tbl_user as u WHERE u.user_email_id = '".$user_email."'";

   $result = @mysql_query($qry) or die("Couldn't execute query:".mysql_error().''.mysql_errno());

   $rec = @mysql_fetch_row($result);

   if($rec)
   {
      echo "Record exist";
   }
   else
   {
      echo "Record not exist"; 
   }
}

注意：我只想列出表中不存在的记录。

请为我建议解决方案…

阅读 522

2020-05-29

共1个答案

一尘不染

首先，您应该了解，在使用file_get_contents时，您会将整个数据字符串提取到一个变量中，该变量存储在主机内存中。

如果该字符串大于专用于PHP进程的大小，则PHP将停止并显示上面的错误消息。

解决此问题的方法是将文件作为指针打开，然后一次取一个块。这样，如果您有一个500MB的文件，则可以读取前1MB的数据，对其进行处理，然后从系统内存中删除该1MB，然后用下一个MB替换它。这使您可以管理要在内存中放入多少数据。

如果可以在下面看到一个示例，我将创建一个类似于node.js的函数

function file_get_contents_chunked($file,$chunk_size,$callback)
{
    try
    {
        $handle = fopen($file, "r");
        $i = 0;
        while (!feof($handle))
        {
            call_user_func_array($callback,array(fread($handle,$chunk_size),&$handle,$i));
            $i++;
        }

        fclose($handle);

    }
    catch(Exception $e)
    {
         trigger_error("file_get_contents_chunked::" . $e->getMessage(),E_USER_NOTICE);
         return false;
    }

    return true;
}

然后像这样使用：

$success = file_get_contents_chunked("my/large/file",4096,function($chunk,&$handle,$iteration){
    /*
        * Do what you will with the {$chunk} here
        * {$handle} is passed in case you want to seek
        ** to different parts of the file
        * {$iteration} is the section of the file that has been read so
        * ($i * 4096) is your current offset within the file.
    */

});

if(!$success)
{
    //It Failed
}

您会发现的问题之一是，您试图对非常大的数据执行几次正则表达式。不仅如此，您的正则表达式还可以匹配整个文件。

使用上述方法，您的正则表达式可能会变得无用，因为您可能只匹配一半的数据。您应该做的就是还原为本地字符串函数，例如

strpos
substr
trim
explode

为了匹配字符串，我在回调中添加了支持，以便传递句柄和当前迭代。这将允许您与档案工作直接在回调中，让您使用类似功能fseek，ftruncate并fwrite为实例。

构建字符串操作的方式无论如何都不是很有效，而使用上面提出的方法到目前为止是一种更好的方法。

希望这可以帮助。

2020-05-29