Leonurus-free

相关参数说明:

–model

–task

–tokenizer

–skip-tokenizer-init

–revision

–code-revision

–tokenizer-revision

–tokenizer-mode

–trust-remote-code

–allowed-local-media-path

–download-dir

–load-format

–config-format

–dtype

–kv-cache-dtype

–max-model-len

–guided-decoding-backend

–logits-processor-pattern

–model-impl

–distributed-executor-backend

–pipeline-parallel-size, -pp

–tensor-parallel-size, -tp

–max-parallel-loading-workers

–ray-workers-use-nsight

–block-size

–enable-prefix-caching, –no-enable-prefix-caching

–disable-sliding-window

–num-lookahead-slots

–seed

–swap-space

–cpu-offload-gb

–gpu-memory-utilization

–num-gpu-blocks-override

–max-num-batched-tokens

–max-num-partial-prefills

–max-long-partial-prefills

–long-prefill-token-threshold

–max-num-seqs

–max-logprobs

–disable-log-stats

–quantization, -q

–rope-scaling

–rope-theta

–hf-overrides

–enforce-eager

–max-seq-len-to-capture

–disable-custom-all-reduce

–tokenizer-pool-size

–tokenizer-pool-type

–tokenizer-pool-extra-config

–limit-mm-per-prompt

–mm-processor-kwargs

–disable-mm-preprocessor-cache

–enable-lora

–enable-lora-bias

–max-loras

–max-lora-rank

–lora-extra-vocab-size

–lora-dtype

–long-lora-scaling-factors

–max-cpu-loras

–fully-sharded-loras

–enable-prompt-adapter

–max-prompt-adapters

–max-prompt-adapter-token

–device

–num-scheduler-steps

–multi-step-stream-outputs

–scheduler-delay-factor

–enable-chunked-prefill

–speculative-model

–speculative-model-quantization

–num-speculative-tokens

–speculative-disable-mqa-scorer

–speculative-draft-tensor-parallel-size, -spec-draft-tp

–speculative-max-model-len

–speculative-disable-by-batch-size

–ngram-prompt-lookup-max

–ngram-prompt-lookup-min

–spec-decoding-acceptance-method

–typical-acceptance-sampler-posterior-threshold

–typical-acceptance-sampler-posterior-alpha

–disable-logprobs-during-spec-decoding

–model-loader-extra-config

–ignore-patterns

–preemption-mode

–served-model-name

–qlora-adapter-name-or-path

–show-hidden-metrics-for-version

–otlp-traces-endpoint

–collect-detailed-traces

–disable-async-output-proc

–scheduling-policy

–scheduler-cls

–override-neuron-config

–override-pooler-config

–compilation-config, -O

–kv-transfer-config

–worker-cls

–generation-config- 说明: 生成配置的文件夹路径。

–override-generation-config- 说明: 以JSON格式覆盖或设置生成配置。

–enable-sleep-mode- 说明: 启用引擎的睡眠模式(仅支持cuda平台)。

–calculate-kv-scales- 说明: 启用动态计算kv缓存的k_scale和v_scale。

–additional-config- 说明: 指定平台的额外配置,以JSON格式传递。