如何仅克隆 Git 存储库的子目录？

一尘不染

如何仅克隆 Git 存储库的子目录？

git

我有我的 Git 存储库，它在根目录下有两个子目录：

/finisht
/static

当它在SVN中时，/finisht在一个地方签出，而/static在其他地方签出，如下所示：

svn co svn+ssh://admin@domain.com/home/admin/repos/finisht/static static

有没有办法用 Git 做到这一点？

阅读 198

2022-02-18

共2个答案

一尘不染

您正在尝试做的事情称为sparse checkout，该功能已添加到 git 1.7.0（2012 年 2 月）中。进行稀疏克隆的步骤如下：

mkdir <repo>
cd <repo>
git init
git remote add -f origin <url>

这将使用您的遥控器创建一个空存储库，并获取所有对象但不检出它们。然后做：

git config core.sparseCheckout true

现在您需要定义要实际签出的文件/文件夹。这是通过在中列出它们来完成的.git/info/sparse-checkout，例如：

echo "some/dir/" >> .git/info/sparse-checkout
echo "another/sub/tree" >> .git/info/sparse-checkout

最后但同样重要的是，使用远程状态更新您的空仓库：

git pull origin master

您现在将在文件系统上“签出”some/dir文件another/sub/tree（这些路径仍然存在），并且不存在其他路径。

您可能想查看扩展教程，并且您可能应该阅读稀疏结帐和read-tree的官方文档。

作为一个函数：

function git_sparse_clone() (
  rurl="$1" localdir="$2" && shift 2

  mkdir -p "$localdir"
  cd "$localdir"

  git init
  git remote add -f origin "$rurl"

  git config core.sparseCheckout true

  # Loops over remaining args
  for i; do
    echo "$i" >> .git/info/sparse-checkout
  done

  git pull origin master
)

用法：

git_sparse_clone "http://github.com/tj/n" "./local/location" "/bin"

请注意，这仍然会从服务器下载整个存储库——只是结帐的大小减小了。目前不可能只克隆一个目录。但是，如果您不需要存储库的历史记录，您至少可以通过创建浅层克隆来节省带宽。

从 git 2.25.0（2020 年 1 月）开始，在 git 中添加了一个实验性的sparse-checkout命令：

git sparse-checkout init
# same as: 
# git config core.sparseCheckout true

git sparse-checkout set "A/B"
# same as:
# echo "A/B" >> .git/info/sparse-checkout

git sparse-checkout list
# same as:
# cat .git/info/sparse-checkout

2022-02-18

一尘不染

git clone --filter来自 git 2.19 现在可以在 GitHub 上运行（测试 2021-01-14，git 2.30.0）

此选项是与远程协议的更新一起添加的，它确实可以防止从服务器下载对象。

d1例如，仅克隆此最小测试存储库所需的对象： https ://github.com/cirosantilli/test-git-partial-clone我可以这样做：

git clone \
  --depth 1  \
  --filter=blob:none  \
  --sparse \
  https://github.com/cirosantilli/test-git-partial-clone \
;
cd test-git-partial-clone
git sparse-checkout set d1

这是https://github.com/cirosantilli/test-git-partial-clone-big-small上的一个不那么最小和更现实的版本

git clone \
  --depth 1  \
  --filter=blob:none  \
  --sparse \
  https://github.com/cirosantilli/test-git-partial-clone-big-small \
;
cd test-git-partial-clone-big-small
git sparse-checkout set small

该存储库包含：

一个包含 10 个 10MB 文件的大目录
一个包含 1000 个大小为 1 字节的文件的小目录

所有内容都是伪随机的，因此不可压缩。

在我的 36.4 Mbps 互联网上克隆时间：

满：24s
部分：“瞬时”

不幸的是，这sparse-checkout部分也是需要的。您也只能下载某些更易于理解的文件：

git clone \
  --depth 1  \
  --filter=blob:none  \
  --no-checkout \
  https://github.com/cirosantilli/test-git-partial-clone \
;
cd test-git-partial-clone
git checkout master -- d1

但是由于某种原因，该方法会非常缓慢地逐个下载文件，除非目录中的文件很少，否则无法使用。

分析最小存储库中的对象

克隆命令仅获得：

带有分支尖端的单个提交对象master
存储库的所有 4 个

树对象

：

提交的顶级目录
三个目录d1,,,d2``master

然后，该git sparse-checkout set命令仅从服务器获取丢失的 blob（文件）：

d1/a
d1/b

更好的是，稍后在 GitHub 上可能会开始支持：

  --filter=blob:none \
  --filter=tree:0 \

从--filter=tree:0Git 2.20 开始，将防止不必要地clone获取所有树对象，并允许将其推迟到checkout. 但是在我的 2020-09-18 测试中失败了：

fatal: invalid filter-spec 'combine:blob:none+tree:0'

大概是因为--filter=combine:复合过滤器（在 Git 2.24 中添加，由 multiple 暗示--filter）尚未实现。

我观察了哪些对象被提取：

git verify-pack -v .git/objects/pack/*.pack

如前所述：如何列出数据库中的所有 git 对象？它并没有给我一个非常清楚的指示每个对象到底是什么，但它确实说明了每个对象的类型（commit, tree, blob），并且由于那个最小的 repo 中的对象很少，我可以明确地推断出每个对象是什么.

git rev-list --objects --all确实产生了带有树/blob 路径的更清晰的输出，但不幸的是，当我运行它时它会获取一些对象，这使得很难确定何时获取了什么，如果有人有更好的命令，请告诉我。

TODO 找到 GitHub 的公告，上面写着他们何时开始支持它。https://github.blog/2020-01-17-bring-your-monorepo-down-to-size-with-sparse-checkout/从 2020-01-17 已经提到--filter blob:none。

git sparse-checkout

我认为这个命令旨在管理一个设置文件，上面写着“我只关心这些子树”，以便将来的命令只会影响这些子树。但是有点难以确定，因为当前的文档有点……稀疏;-)

它本身并不能阻止获取 blob。

如果这种理解是正确的，那么这将是对git clone --filter上述描述的一个很好的补充，因为如果您打算在部分克隆的 repo 中执行 git 操作，它将防止无意获取更多对象。

当我尝试使用 Git 2.25.1 时：

git clone \
  --depth 1 \
  --filter=blob:none \
  --no-checkout \
  https://github.com/cirosantilli/test-git-partial-clone \
;
cd test-git-partial-clone
git sparse-checkout init

它不起作用，因为init实际上获取了所有对象。

但是，在 Git 2.28 中，它并没有按需要获取对象。但是，如果我这样做：

git sparse-checkout set d1

d1没有被提取和检出，即使这明确表示它应该：https ://github.blog/2020-01-17-bring-your-monorepo-down-to-size-with-sparse-checkout/#sparse-带有免责声明的结帐和部分克隆：

请留意部分克隆功能是否会普遍可用[1]。

[1]：GitHub 仍在内部评估此功能，同时它已在少数几个存储库（包括本文中使用的示例）上启用。随着该功能的稳定和成熟，我们会及时通知您其进展情况。

所以，是的，目前很难确定，这部分归功于 GitHub 是封闭源代码的乐趣。但让我们密切关注它。

命令分解

服务器应配置：

git config --local uploadpack.allowfilter 1
git config --local uploadpack.allowanysha1inwant 1

命令分解：

--filter=blob:none跳过所有 blob，但仍获取所有树对象
--filter=tree:0跳过不需要的树：https ://www.spinics.net/lists/git/msg342006.html
--depth 1已经暗示--single-branch，
file://$(path)需要克服git clone协议：如何用相对路径浅克隆本地 git 存储库？
--filter=combine:FILTER1+FILTER2是一次使用多个过滤器的语法，由于某种原因试图通过--filter失败：“多个过滤器规格无法组合”。这是在 Git 2.24 中添加的，位于 e987df5fe62b8b29be4cdcdeb3704681ada2b29e “list-objects-filter：实现复合过滤器”

编辑：在 Git 2.28 上，我实验性地看到它也有同样的效果，因为 GitHub到 2020-09-18--filter=FILTER1 --filter FILTER2还没有实现并且抱怨。TODO在哪个版本推出？combine:``fatal: invalid filter-spec 'combine:blob:none+tree:0'

的格式--filter记录在man git-rev-list.

Git 树上的文档：

在本地测试一下

以下脚本可重现地在本地生成https://github.com/cirosantilli/test-git-partial-clone存储库，执行本地克隆，并观察克隆的内容：

#!/usr/bin/env bash
set -eu

list-objects() (
  git rev-list --all --objects
  echo "master commit SHA: $(git log -1 --format="%H")"
  echo "mybranch commit SHA: $(git log -1 --format="%H")"
  git ls-tree master
  git ls-tree mybranch | grep mybranch
  git ls-tree master~ | grep root
)

# Reproducibility.
export GIT_COMMITTER_NAME='a'
export GIT_COMMITTER_EMAIL='a'
export GIT_AUTHOR_NAME='a'
export GIT_AUTHOR_EMAIL='a'
export GIT_COMMITTER_DATE='2000-01-01T00:00:00+0000'
export GIT_AUTHOR_DATE='2000-01-01T00:00:00+0000'

rm -rf server_repo local_repo
mkdir server_repo
cd server_repo

# Create repo.
git init --quiet
git config --local uploadpack.allowfilter 1
git config --local uploadpack.allowanysha1inwant 1

# First commit.
# Directories present in all branches.
mkdir d1 d2
printf 'd1/a' > ./d1/a
printf 'd1/b' > ./d1/b
printf 'd2/a' > ./d2/a
printf 'd2/b' > ./d2/b
# Present only in root.
mkdir 'root'
printf 'root' > ./root/root
git add .
git commit -m 'root' --quiet

# Second commit only on master.
git rm --quiet -r ./root
mkdir 'master'
printf 'master' > ./master/master
git add .
git commit -m 'master commit' --quiet

# Second commit only on mybranch.
git checkout -b mybranch --quiet master~
git rm --quiet -r ./root
mkdir 'mybranch'
printf 'mybranch' > ./mybranch/mybranch
git add .
git commit -m 'mybranch commit' --quiet

echo "# List and identify all objects"
list-objects
echo

# Restore master.
git checkout --quiet master
cd ..

# Clone. Don't checkout for now, only .git/ dir.
git clone --depth 1 --quiet --no-checkout --filter=blob:none "file://$(pwd)/server_repo" local_repo
cd local_repo

# List missing objects from master.
echo "# Missing objects after --no-checkout"
git rev-list --all --quiet --objects --missing=print
echo

echo "# Git checkout fails without internet"
mv ../server_repo ../server_repo.off
! git checkout master
echo

echo "# Git checkout fetches the missing directory from internet"
mv ../server_repo.off ../server_repo
git checkout master -- d1/
echo

echo "# Missing objects after checking out d1"
git rev-list --all --quiet --objects --missing=print

GitHub 上游.

Git v2.19.0 中的输出：

# List and identify all objects
c6fcdfaf2b1462f809aecdad83a186eeec00f9c1
fc5e97944480982cfc180a6d6634699921ee63ec
7251a83be9a03161acde7b71a8fda9be19f47128
62d67bce3c672fe2b9065f372726a11e57bade7e
b64bf435a3e54c5208a1b70b7bcb0fc627463a75 d1
308150e8fddde043f3dbbb8573abb6af1df96e63 d1/a
f70a17f51b7b30fec48a32e4f19ac15e261fd1a4 d1/b
84de03c312dc741d0f2a66df7b2f168d823e122a d2
0975df9b39e23c15f63db194df7f45c76528bccb d2/a
41484c13520fcbb6e7243a26fdb1fc9405c08520 d2/b
7d5230379e4652f1b1da7ed1e78e0b8253e03ba3 master
8b25206ff90e9432f6f1a8600f87a7bd695a24af master/master
ef29f15c9a7c5417944cc09711b6a9ee51b01d89
19f7a4ca4a038aff89d803f017f76d2b66063043 mybranch
1b671b190e293aa091239b8b5e8c149411d00523 mybranch/mybranch
c3760bb1a0ece87cdbaf9a563c77a45e30a4e30e
a0234da53ec608b54813b4271fbf00ba5318b99f root
93ca1422a8da0a9effc465eccbcb17e23015542d root/root
master commit SHA: fc5e97944480982cfc180a6d6634699921ee63ec
mybranch commit SHA: fc5e97944480982cfc180a6d6634699921ee63ec
040000 tree b64bf435a3e54c5208a1b70b7bcb0fc627463a75    d1
040000 tree 84de03c312dc741d0f2a66df7b2f168d823e122a    d2
040000 tree 7d5230379e4652f1b1da7ed1e78e0b8253e03ba3    master
040000 tree 19f7a4ca4a038aff89d803f017f76d2b66063043    mybranch
040000 tree a0234da53ec608b54813b4271fbf00ba5318b99f    root

# Missing objects after --no-checkout
?f70a17f51b7b30fec48a32e4f19ac15e261fd1a4
?8b25206ff90e9432f6f1a8600f87a7bd695a24af
?41484c13520fcbb6e7243a26fdb1fc9405c08520
?0975df9b39e23c15f63db194df7f45c76528bccb
?308150e8fddde043f3dbbb8573abb6af1df96e63

# Git checkout fails without internet
fatal: '/home/ciro/bak/git/test-git-web-interface/other-test-repos/partial-clone.tmp/server_repo' does not appear to be a git repository
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

# Git checkout fetches the missing directory from internet
remote: Enumerating objects: 1, done.
remote: Counting objects: 100% (1/1), done.
remote: Total 1 (delta 0), reused 0 (delta 0)
Receiving objects: 100% (1/1), 45 bytes | 45.00 KiB/s, done.
remote: Enumerating objects: 1, done.
remote: Counting objects: 100% (1/1), done.
remote: Total 1 (delta 0), reused 0 (delta 0)
Receiving objects: 100% (1/1), 45 bytes | 45.00 KiB/s, done.

# Missing objects after checking out d1
?8b25206ff90e9432f6f1a8600f87a7bd695a24af
?41484c13520fcbb6e7243a26fdb1fc9405c08520
?0975df9b39e23c15f63db194df7f45c76528bccb

结论：来自外部的所有 blobd1/都丢失了。例如0975df9b39e23c15f63db194df7f45c76528bccb，d2/b退房后不存在d1/a。

请注意root/root和mybranch/mybranch也丢失了，但从--depth 1丢失的文件列表中隐藏了它。如果您删除--depth 1，则它们会显示在丢失文件列表中。

我有一个梦想

这个特性可能会彻底改变 Git。

想象一下，将您企业的所有代码库都放在一个存储库中，而无需像repo.

想象一下，在没有任何丑陋的第三方扩展的情况下，直接在 repo 中存储巨大的 blob。

想象一下，如果 GitHub 允许每个文件/目录的元数据（如星号和权限），那么您可以将所有个人资料存储在单个 repo 下。

想象一下，如果子模块被完全视为常规目录：只需请求树 SHA，类似 DNS 的机制会解析您的请求，首先查看您的本地~/.git，然后首先查看更接近的服务器（您的企业的镜像/缓存）并最终在 GitHub 上。

2022-02-18