Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dawnbench demo #14

Merged
merged 4 commits into from
Jul 28, 2020
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions config/samples/data_v1alpha1_alluxioruntime.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -24,22 +24,22 @@ spec:
high: "0.95"
low: "0.7"
storageType: Disk
properteies:
properties:
alluxio.user.file.writetype.default: MUST_CACHE
alluxio.master.journal.folder: /journal
alluxio.master.journal.type: UFS
master:
replicas: 1
jvmOptions:
- "-Xmx4G"
properteies: {}
properties: {}
ports: {}
resources: {}
worker:
replicas: 2
jvmOptions:
- "-Xmx4G"
properteies: {}
properties: {}
ports: {}
resources: {}
fuse:
Expand Down
21 changes: 21 additions & 0 deletions dawnbench/imagenet/dataset.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
name: imagenet
#namespace: fluid-system
spec:
mounts:
- mountPoint: oss://<OSS_BUCKET>/<OSS_DIRECTORY>/
name: imagenet
options:
fs.oss.accessKeyId: <OSS_ACCESS_KEY_ID>
fs.oss.accessKeySecret: <OSS_ACCESS_KEY_SECRET>
fs.oss.endpoint: <OSS_ENDPOINT>
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: aliyun.accelerator/nvidia_name
operator: In
values:
- Tesla-V100-SXM2-16GB
114 changes: 114 additions & 0 deletions dawnbench/imagenet/runtime.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
---
apiVersion: data.fluid.io/v1alpha1
kind: AlluxioRuntime
metadata:
name: imagenet
#namespace: fluid-system
spec:
# Add fields here
dataCopies: 3
alluxioVersion:
image: registry.cn-huhehaote.aliyuncs.com/alluxio/alluxio
imageTag: "2.3.0-SNAPSHOT-b7629dc"
imagePullPolicy: Always
tieredstore:
levels:
- mediumtype: MEM
path: /dev/shm
quota: 150Gi
high: "0.99"
low: "0.8"
storageType: Memory
properties:
# jni-fuse related configurations
alluxio.fuse.jnifuse.enabled: "true"
alluxio.user.client.cache.enabled: "false"
alluxio.user.client.cache.store.type: MEMORY
alluxio.user.client.cache.dir: /alluxio/ram
alluxio.user.client.cache.page.size: 512KB
alluxio.user.client.cache.size: 1800MB
# alluxio configurations
alluxio.user.block.worker.client.pool.min: "512"
alluxio.fuse.debug.enabled: "false"
alluxio.web.ui.enabled: "false"
alluxio.user.file.writetype.default: MUST_CACHE
alluxio.user.ufs.block.read.location.policy: alluxio.client.block.policy.LocalFirstAvoidEvictionPolicy
alluxio.user.block.write.location.policy.class: alluxio.client.block.policy.LocalFirstAvoidEvictionPolicy
alluxio.worker.allocator.class: alluxio.worker.block.allocator.GreedyAllocator
alluxio.user.block.size.bytes.default: 16MB
alluxio.user.streaming.reader.chunk.size.bytes: 32MB
alluxio.user.local.reader.chunk.size.bytes: 32MB
alluxio.worker.network.reader.buffer.size: 32MB
alluxio.worker.file.buffer.size: 320MB
alluxio.user.metrics.collection.enabled: "false"
alluxio.master.rpc.executor.max.pool.size: "10240"
alluxio.master.rpc.executor.core.pool.size: "128"
#alluxio.master.mount.table.root.readonly: "true"
alluxio.user.update.file.accesstime.disabled: "true"
alluxio.user.file.passive.cache.enabled: "false"
alluxio.user.block.avoid.eviction.policy.reserved.size.bytes: 2GB
alluxio.master.journal.folder: /journal
alluxio.master.journal.type: UFS
alluxio.user.block.master.client.pool.gc.threshold: 2day
alluxio.user.file.master.client.threads: "1024"
alluxio.user.block.master.client.threads: "1024"
alluxio.user.file.readtype.default: CACHE
alluxio.security.stale.channel.purge.interval: 365d
alluxio.user.metadata.cache.enabled: "true"
alluxio.user.metadata.cache.expiration.time: 2day
alluxio.user.metadata.cache.max.size: "1000000"
alluxio.user.direct.memory.io.enabled: "true"
alluxio.fuse.cached.paths.max: "1000000"
alluxio.job.worker.threadpool.size: "164"
alluxio.user.worker.list.refresh.interval: 2min
alluxio.user.logging.threshold: 1000ms
alluxio.fuse.logging.threshold: 1000ms
alluxio.worker.block.master.client.pool.size: "1024"
master:
replicas: 1
resources:
limits:
cpu: "50"
memory: "150G"
jvmOptions:
- "-Xmx6G"
- "-XX:+UnlockExperimentalVMOptions"
- "-XX:ActiveProcessorCount=8"
worker:
replicas: 4
resources:
limits:
cpu: "50"
memory: "150G"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为什么cpu和memory设置的这么高?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

默认cpu和memory的limits设置得比较低,训练较慢,因此直接把上限调整到较高的指。后面会继续精细化调整每种容器的参数

jvmOptions:
- "-Xmx12G"
- "-XX:+UnlockExperimentalVMOptions"
- "-XX:MaxDirectMemorySize=32g"
- "-XX:ActiveProcessorCount=8"
fuse:
image: registry.cn-huhehaote.aliyuncs.com/alluxio/alluxio-fuse
imageTag: "2.3.0-SNAPSHOT-b7629dc"
imagePullPolicy: Always
env:
MAX_IDLE_THREADS: "64"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个配置是否还需要?

SPENT_TIME: "1000"
resources:
limits:
cpu: "50"
memory: "150G"
jvmOptions:
- "-Xmx16G"
- "-Xms16G"
- "-XX:+UseG1GC"
- "-XX:MaxDirectMemorySize=32g"
- "-XX:+UnlockExperimentalVMOptions"
- "-XX:ActiveProcessorCount=24"
- "-XX:+PrintGC"
- "-XX:+PrintGCDateStamps"
- "-XX:+PrintGCDetails"
- "-XX:+PrintGCTimeStamps"
# For now, only support local
shortCircuitPolicy: local
args:
- fuse
- --fuse-opts=kernel_cache,ro,max_read=131072,attr_timeout=7200,entry_timeout=7200
Loading