由 SuKai August 4, 2021
在模型开发过程中,训练数据和测试数据对开发人员来说非常重要。如何对数据进行版本管理,让训练可重复,开发人员之间数据共享?今天介绍开源的数据版本管理工具Data Version Control(DVC)。
Data Version Control(DVC)
DVC一般和Git一起使用,Git用来存储机器学习代码和DVC元数据文件,DVC将数据文件和模型文件存储到类型S3等远程存储上,dvc上传和拉取数据文件像git操作代码文件一样平滑。
安装DVC
pip install dvc -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install 'dvc[s3]' -i https://pypi.tuna.tsinghua.edu.cn/simple
在项目代码目录dvc初始化
dvc init
Initialized DVC repository.
You can now commit the changes to git.
+---------------------------------------------------------------------+
| |
| DVC has enabled anonymous aggregate usage analytics. |
| Read the analytics documentation (and how to opt-out) here: |
| <https://dvc.org/doc/user-guide/analytics> |
| |
+---------------------------------------------------------------------+
What's next?
------------
- Check out the documentation: <https://dvc.org/doc>
- Get help and share ideas: <https://dvc.org/chat>
- Star us on GitHub: <https://github.com/iterative/dvc>
配置dvc存储
dvc remote add -d minio s3://dataset-dvc/
dvc remote modify minio endpointurl http://s3.platform.sukai.com/
dvc remote modify minio access_key_id user
dvc remote modify minio secret_access_key password
添加数据文件
dvc add datasets-dogs-cats
⠏ Checking graph ⠋ Checking graph
Adding...
!
Computing file/dir hashes (only done once) |0.00 [00:00, ?md5/s]
!
0%| | 0.00/? [00:00<?, ?B/s]
!
0%| | 0.00/? [00:00<?, ?B/s]
!
0%| | 0.00/162 [00:00<?, ?B/s]
0%| | 0.00/162 [00:00<?, ?B/s]
!
0%| |Querying cache in .dvc/cache 0.00/3.00 [00:00<?, ?file/s]
!
0%| |Transferring 0/3 [00:00<?, ?file/s]
!
0%| |349b6a8088305350096b8ea9e136c5.dir 0.00/? [00:00<?, ?B/s]
0%| |349b6a8088305350096b8ea9e136c5.di0.00/162 [00:00<?, ?B/s]
!
0%| |.MUeLqrtTDftrkRWHc2FjwL.tmp 0/1 [00:00<?, ?it/s]
!
0%| |.cmpiesTVSamhdK83pKVHp7.tmp 0.00/832k [00:00<?, ?it/s]
!
0%| |.nw5ssGyuJB4JxAKtpeoRFp.tmp 0.00/277k [00:00<?, ?it/s]
100% Adding...|████████████████████████████████████████|1/1 [00:00, 6.12file/s]
To track the changes with git, run:
git add datasets-dogs-cats.dvc .gitignore
git推送dvc的meta-files
git add --all
git commit -m "control data with DVC"
dvc推送数据文件
dvc push
dvc查看文件列表
(base) jovyan@jupyter-0:~/ai-demo$ dvc list . -R --dvc-only
datasets-dogs-cats/dogs_vs_cats-train.csv
datasets-dogs-cats/dogs_vs_cats-val.csv
在其他机器上直接使用数据集
sukai@sukai:~/ai/a$ dvc get https://git.dev.sukai.com/sukai/ai-demo.git datasets-dogs-cats
Cloning |0.00 [00:00, ?obj/s]Username for 'https://git.dev.sukai.com': sukai
Password for 'https://sukai@git.dev.sukai.com':
sukai@sukai:~/ai/a$ ls
datasets-dogs-cats
sukai@sukai:~/ai/a$ ls -al
total 12
drwxrwxr-x 3 sukai sukai 4096 Nov 16 11:34 .
drwxrwxr-x 9 sukai sukai 4096 Nov 16 11:31 ..
drwxrwxr-x 2 sukai sukai 4096 Nov 16 11:34 datasets-dogs-cats
在开发人员共享使用数据集
sukai@sukai:~/ai$ git clone https://git.dev.sukai.com/sukai/ai-demo.git
Cloning into 'ai-demo'...
Username for 'https://git.dev.sukai.com': sukai
Password for 'https://sukai@git.dev.sukai.com':
remote: Enumerating objects: 178, done.
remote: Counting objects: 100% (175/175), done.
remote: Compressing objects: 100% (118/118), done.
remote: Total 178 (delta 76), reused 119 (delta 52), pack-reused 3
Receiving objects: 100% (178/178), 93.06 MiB | 40.86 MiB/s, done.
Resolving deltas: 100% (76/76), done.
sukai@sukai:~/ai$ cd ai-demo/
sukai@sukai:~/ai/ai-demo$ dvc pull
A datasets-dogs-cats/
1 file added and 2 files fetched
sukai@sukai:~/ai/ai-demo$