机器学习数据集可重复可共享

SuKai August 4, 2021

在模型开发过程中,训练数据和测试数据对开发人员来说非常重要。如何对数据进行版本管理,让训练可重复,开发人员之间数据共享?今天介绍开源的数据版本管理工具Data Version Control(DVC)。

Data Version Control(DVC)

DVC一般和Git一起使用,Git用来存储机器学习代码和DVC元数据文件,DVC将数据文件和模型文件存储到类型S3等远程存储上,dvc上传和拉取数据文件像git操作代码文件一样平滑。

flow.gif

安装DVC

pip install dvc  -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install 'dvc[s3]' -i https://pypi.tuna.tsinghua.edu.cn/simple

在项目代码目录dvc初始化

dvc init

Initialized DVC repository.

You can now commit the changes to git.

+---------------------------------------------------------------------+
|                                                                     |
|        DVC has enabled anonymous aggregate usage analytics.         |
|     Read the analytics documentation (and how to opt-out) here:     |
|             <https://dvc.org/doc/user-guide/analytics>              |
|                                                                     |
+---------------------------------------------------------------------+

What's next?
------------
- Check out the documentation: <https://dvc.org/doc>
- Get help and share ideas: <https://dvc.org/chat>
- Star us on GitHub: <https://github.com/iterative/dvc>

配置dvc存储

dvc remote add -d minio s3://dataset-dvc/
dvc remote modify minio endpointurl http://s3.platform.sukai.com/
dvc remote modify minio access_key_id user
dvc remote modify minio secret_access_key password

添加数据文件

dvc add datasets-dogs-cats

⠏ Checking graph                                                   ⠋ Checking graph
Adding...                                                                       
!
Computing file/dir hashes (only done once)            |0.00 [00:00,      ?md5/s]

!

  0%|          |                                   0.00/? [00:00<?,        ?B/s]

                                                                                

!

  0%|          |                                   0.00/? [00:00<?,        ?B/s]

                                                                                
                                                                                
!
  0%|          |                                 0.00/162 [00:00<?,        ?B/s]
  0%|          |                                 0.00/162 [00:00<?,        ?B/s]
                                                                                
!
  0%|          |Querying cache in .dvc/cache    0.00/3.00 [00:00<?,     ?file/s]
                                                                                
!
  0%|          |Transferring                          0/3 [00:00<?,     ?file/s]

!

  0%|          |349b6a8088305350096b8ea9e136c5.dir 0.00/? [00:00<?,        ?B/s]

  0%|          |349b6a8088305350096b8ea9e136c5.di0.00/162 [00:00<?,        ?B/s]

                                                                                
                                                                                
!
  0%|          |.MUeLqrtTDftrkRWHc2FjwL.tmp           0/1 [00:00<?,       ?it/s]
                                                                                
!
  0%|          |.cmpiesTVSamhdK83pKVHp7.tmp     0.00/832k [00:00<?,       ?it/s]
                                                                                
!
  0%|          |.nw5ssGyuJB4JxAKtpeoRFp.tmp     0.00/277k [00:00<?,       ?it/s]
100% Adding...|████████████████████████████████████████|1/1 [00:00,  6.12file/s]

To track the changes with git, run:

	git add datasets-dogs-cats.dvc .gitignore

git推送dvc的meta-files

git add --all
git commit -m "control data with DVC"

dvc推送数据文件

dvc push

dvc查看文件列表

(base) jovyan@jupyter-0:~/ai-demo$ dvc list . -R --dvc-only
datasets-dogs-cats/dogs_vs_cats-train.csv
datasets-dogs-cats/dogs_vs_cats-val.csv

在其他机器上直接使用数据集

sukai@sukai:~/ai/a$ dvc get https://git.dev.sukai.com/sukai/ai-demo.git datasets-dogs-cats
Cloning                                                                                                                                                                             |0.00 [00:00,      ?obj/s]Username for 'https://git.dev.sukai.com': sukai
Password for 'https://sukai@git.dev.sukai.com':
sukai@sukai:~/ai/a$ ls
datasets-dogs-cats
sukai@sukai:~/ai/a$ ls -al
total 12
drwxrwxr-x 3 sukai sukai 4096 Nov 16 11:34 .
drwxrwxr-x 9 sukai sukai 4096 Nov 16 11:31 ..
drwxrwxr-x 2 sukai sukai 4096 Nov 16 11:34 datasets-dogs-cats

在开发人员共享使用数据集

sukai@sukai:~/ai$ git clone https://git.dev.sukai.com/sukai/ai-demo.git
Cloning into 'ai-demo'...
Username for 'https://git.dev.sukai.com': sukai
Password for 'https://sukai@git.dev.sukai.com':
remote: Enumerating objects: 178, done.
remote: Counting objects: 100% (175/175), done.
remote: Compressing objects: 100% (118/118), done.
remote: Total 178 (delta 76), reused 119 (delta 52), pack-reused 3
Receiving objects: 100% (178/178), 93.06 MiB | 40.86 MiB/s, done.
Resolving deltas: 100% (76/76), done.
sukai@sukai:~/ai$ cd ai-demo/
sukai@sukai:~/ai/ai-demo$ dvc pull
A       datasets-dogs-cats/
1 file added and 2 files fetched
sukai@sukai:~/ai/ai-demo$