我正在寻找一种方法来计算非常大的文件的SHA-1校验和,而不必一次将它们完全加载到内存中。
我不知道SHA-1实施的细节,因此想知道是否有可能这样做。
如果您知道SAX XML解析器,那么我将寻找类似的东西:通过始终始终仅将一小部分加载到内存中来计算SHA-1校验和。
我发现的所有示例(至少在Java中)始终依赖于将文件/字节数组/字符串完全加载到内存中。
如果您甚至知道实现(任何语言),请告诉我!
是的,这完全有可能。SHA-1是一种块算法,因此它一次只能处理一个块。确切的块大小随生成的哈希大小而变化,但始终很小(例如20-50个字节左右)。当然,那是假设您要包括SHA-1 256、384和/或512(又名SHA-256,SHA-384和SHA-512)。如果仅包含原始的160位版本,则块大小始终为20字节(160位)。
编辑:哎呀-重新读取规格,SHA-1,SHA-224,SHA-256的块大小为512位,SHA-384和SHA-512的块大小为1024位。谢谢查尔斯!
Edit2:我几乎忘记了他在寻找源代码的部分,而不仅仅是建议。这是一些代码。首先是标题:
// Sha.h: #ifndef SHA_1_H_INCLUDED #define SHA_1_H_INCLUDED // This is a relatively straightforward implementation of SHA-1. It makes no particular // attempt at optimization, instead aiming toward easy verification against the standard. // To that end, many of the variable names are identical to those used in FIPS 180-2 and // FIPS 180-3. // // The code should be fairly portable, within a few limitations: // 1. It requires that 'char' have 8 bits. In theory this is avoidable, but I don't think // it's worth the bother. // 2. It only deals with inputs in (8-bit) bytes. In theory, SHA-1 can deal with a number of // bits that's not a multiple of 8, but I've never needed it. Since the padding always results // in a byte-sized stream, the only parts that would need changing would be reading and padding // the input. The main hashing portion would be unaffected. // // Compiles cleanly with: // MS VC++ 9.0SP1 (x86 or x64): -W4 -Za // gc++ 3.4: -ansi -pedantic -Wall // comeau 4.3.3: --vc71 // Appears to work corectly in all cases. // You can't use maximum warnings with Comeau though -- this code itself doesn't give problems // (that I know of) but Microsoft's headers give it *major* heartburn. // // // Written by Jerry Coffin, February 2008 // // You can use this software any way you want to, with following limitations // (shamelessly stolen from the Boost software license): // // THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR // IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, // FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. IN NO EVENT // SHALL THE COPYRIGHT HOLDERS OR ANYONE DISTRIBUTING THE SOFTWARE BE LIABLE // FOR ANY DAMAGES OR OTHER LIABILITY, WHETHER IN CONTRACT, TORT OR OTHERWISE, // ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER // DEALINGS IN THE SOFTWARE. // // If you put this to real use, I'd be happy to hear about it. If you find a bug, // I'd be interested in hearing about that too. There's even a pretty good chance // that I'll try to fix it, though I certainly can't guarantee that. // #include <algorithm> #include <vector> #include <string> #include <assert.h> #include <iostream> #include <sstream> #include <iomanip> #if defined(_MSC_VER) && _MSC_VER < 1600 typedef unsigned int uint32_t; typedef unsigned __int64 uint64_t; #else #include <stdint.h> #endif namespace crypto { namespace { struct ternary_operator { virtual uint32_t operator()(uint32_t x, uint32_t y, uint32_t z) = 0; }; } class sha1 { static const size_t hash_size = 5; static const size_t min_pad = 64; static const size_t block_bits = 512; static const size_t block_bytes = block_bits / 8; static const size_t block_words = block_bytes / 4; std::vector<uint32_t> K; std::vector<uint32_t> H; std::vector<uint32_t> W; std::vector<ternary_operator *> fs; uint32_t a, b, c, d, e, T; static const size_t block_size = 16; static const size_t bytes_per_word = 4; size_t total_size; // hash a 512-bit block of input. // void hash_block(std::vector<uint32_t> const &block); // Pad the input to a multiple of 512 bits, and add the length // in binary to the end. static std::string pad(std::string const &input, size_t size); // Turn 64 bytes into a block of 16 uint32_t's. std::vector<uint32_t> make_block(std::string const &in); public: // Construct a SHA-1 object. More expensive that typical // ctor, but not expected to be copied a lot or anything // like that, so it should be fairly harmless. sha1(); // The two ways to provide input for hashing: as a stream or a string. // Either way, you get the result as a vector<uint32_t>. It's a fairly // small vector, so even if your compiler doesn't do return-value // optimization, the time taken for copying it isn't like to be // significant. // std::vector<uint32_t> operator()(std::istream &in); std::vector<uint32_t> operator()(std::string const &input); friend std::ostream &operator<<(std::ostream &os, sha1 const &s); }; } #endif
并执行:
// Sha1.cpp: #include "sha.h" // Please see comments in sha.h for licensing information, etc. // // Many people don't like the names I usually use for namespaces, so I've kept this one // short and simple. // namespace crypto { namespace { uint32_t ROTL(uint32_t const &value, unsigned bits) { uint32_t mask = (1 << bits) - 1; return value << bits | (value >> (32-bits))&mask; } struct f1 : ternary_operator { uint32_t operator()(uint32_t x, uint32_t y, uint32_t z) { return (x & y) ^ (~x&z); } }; struct f2 : ternary_operator { uint32_t operator()(uint32_t x, uint32_t y, uint32_t z) { return x ^ y ^ z; } }; struct f3 : ternary_operator { uint32_t operator()(uint32_t x, uint32_t y, uint32_t z) { return (x&y) ^ (x&z) ^ (y&z); } }; uint32_t word(int a, int b, int c, int d) { a &= 0xff; b &= 0xff; c &= 0xff; d &= 0xff; int val = a << 24 | b << 16 | c << 8 | d; return val; } } // hash a 512-bit block of input. // void sha1::hash_block(std::vector<uint32_t> const &block) { assert(block.size() == block_words); int t; std::copy(block.begin(), block.end(), W.begin()); for (t=16; t<80; t++) { W[t] = ROTL(W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16], 1); } a = H[0]; b = H[1]; c = H[2]; d = H[3]; e = H[4]; for (t=0; t<80; t++) { T = ROTL(a, 5) + (*fs[t])(b, c, d) + e + K[t] + W[t]; e = d; d = c; c = ROTL(b, 30); b = a; a = T; } H[0] += a; H[1] += b; H[2] += c; H[3] += d; H[4] += e; } // Pad the input to a multiple of 512 bits, and put the length // in binary at the end. std::string sha1::pad(std::string const &input, size_t size) { size_t length = size * 8 + 1; size_t remainder = length % block_bits; size_t pad_len = block_bits-remainder; if (pad_len < min_pad) pad_len += block_bits; ++pad_len; pad_len &= ~7; std::string padding(pad_len/8, '\0'); for (size_t i=0; i<sizeof(padding.size()); i++) padding[padding.size()-i-1] = (length-1) >> (i*8) & 0xff; padding[0] |= (unsigned char)0x80; std::string ret(input+padding); return ret; } // Turn 64 bytes into a block of 16 uint32_t's. std::vector<uint32_t> sha1::make_block(std::string const &in) { assert(in.size() >= block_bytes); std::vector<uint32_t> ret(block_words); for (size_t i=0; i<block_words; i++) { size_t s = i*4; ret[i] = word(in[s], in[s+1], in[s+2], in[s+3]); } return ret; } // Construct a SHA-1 object. More expensive that typical // ctor, but not expected to be copied a lot or anything // like that, so it should be fairly harmless. sha1::sha1() : K(80), H(5), W(80), fs(80), total_size(0) { static const uint32_t H0[] = { 0x67452301, 0xefcdab89, 0x98badcfe, 0x10325476, 0xc3d2e1f0 }; static const uint32_t Ks[] = { 0x5a827999, 0x6ed9eba1, 0x8f1bbcdc, 0xca62c1d6 }; std::copy(H0, H0+hash_size, H.begin()); std::fill_n(K.begin()+00, 20, Ks[0]); std::fill_n(K.begin()+20, 20, Ks[1]); std::fill_n(K.begin()+40, 20, Ks[2]); std::fill_n(K.begin()+60, 20, Ks[3]); static f1 sf1; static f2 sf2; static f3 sf3; std::fill_n(fs.begin()+00, 20, &sf1); std::fill_n(fs.begin()+20, 20, &sf2); std::fill_n(fs.begin()+40, 20, &sf3); std::fill_n(fs.begin()+60, 20, &sf2); } // The two ways to provide input for hashing: as a stream or a string. // Either way, you get the result as a vector<uint32_t>. It's a fairly // small vector, so even if your compiler doesn't do return-value // optimization, the time taken for copying it isn't like to be // significant. // std::vector<uint32_t> sha1::operator()(std::string const &input) { std::string temp(pad(input, total_size + input.size())); std::vector<uint32_t> block(block_size); size_t num = temp.size()/block_bytes; for (unsigned block_num=0; block_num<num; block_num++) { size_t s; for (size_t i=0; i<block_size; i++) { s = block_num*block_bytes+i*4; block[i] = word(temp[s], temp[s+1], temp[s+2], temp[s+3]); } hash_block(block); } return H; } std::vector<uint32_t> sha1::operator()(std::istream &in) { char raw_block[65]; while (in.read(raw_block, block_bytes)) { total_size += block_bytes; std::string b(raw_block, in.gcount()); hash_block(make_block(b)); } std::string x(raw_block, in.gcount()); return operator()(x); } std::ostream &operator<<(std::ostream &os, sha1 const &s) { // Display a SHA-1 result in hex. for (size_t i=0; i<(s.H).size(); i++) os << std::fixed << std::setprecision(8) << std::hex << std::setfill('0') << (s.H)[i] << " "; return os << std::dec << std::setfill(' ') << "\n"; } } #ifdef TEST #include <iostream> #include <iomanip> #include <string> #include <sstream> // A minimal test harness to check that it's working correctly. Strictly black-box // testing, with no attempt at things like coverage analysis. Nonetheless, I believe // it should cover most of the code -- the core hashing code all gets used for every // possible value. The padding code should be tested fairly thoroughly as well -- the // first test is a fairly simple case, and the second the more complex one (where the // padding requires adding another block). class tester { bool verify(uint32_t *test_val, std::vector<uint32_t> const &hash, std::ostream &os) { // Verify that a result matches a test value and report result. for (size_t i=0; i<hash.size(); i++) if (hash[i] != test_val[i]) { os << "Mismatch. Expected: " << test_val[i] << ", but found: " << hash[i] << "\n"; return false; } os << "Message digest Verified.\n\n"; return true; } public: bool operator()(uint32_t *test_val, std::string const &input) { std::cout << "Testing hashing from string:\n\"" << input << "\"\n"; crypto::sha1 hasher1; std::vector<uint32_t> hash = hasher1(input); std::cout << "Message digest is:\n\t" << hasher1; bool verified = verify(test_val, hash, std::cerr); crypto::sha1 hasher2; std::cout << "Testing hashing from Stream:\n"; std::istringstream buf(input); hash = hasher2(buf); std::cout << "Message digest is:\n\t" << hasher2; return verified & verify(test_val, hash, std::cerr); } }; int main() { // These test values and results come directly from the FIPS pub. // char const *input1 = "abc"; char const *input2 = "abcdbcdecdefdefgefghfghighijhijkijkljklmklmnlmnomnopnopq"; uint32_t result1[] = {0xA9993E36, 0x4706816A, 0xBA3E2571, 0x7850C26C, 0x9CD0D89D}; uint32_t result2[] = {0x84983E44, 0x1C3BD26E, 0xBAAE4AA1, 0xF95129E5, 0xE54670F1 }; bool correct = tester()(result1, input1); correct &= tester()(result2, input2); if (correct) std::cerr << "All Tests passed!\n"; else std::cerr << "Test Failed!\n"; } #elif defined(MAIN) #include <sstream> #include <fstream> #include <iostream> int main(int argc, char **argv) { if (argc < 2) { std::cerr << "Usage: sha1 [filename]\n"; return EXIT_FAILURE; } for (int i=1; i<argc; i++) { crypto::sha1 hash; std::ifstream in(argv[i], std::ios_base::binary); if (in.good()) { hash(in); std::cout << "SHA-1(" << argv[i] << ") = " << hash << "\n"; } } return 0; } #endif