Skip to content

Commit 5ba07ba

Browse files
committed
Implemented MLTransform generate vocab Dataflow benchmark
1 parent 0471bb5 commit 5ba07ba

12 files changed

Lines changed: 861 additions & 2 deletions

File tree

.github/workflows/beam_Inference_Python_Benchmarks_Dataflow.yml

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -94,6 +94,7 @@ jobs:
9494
${{ github.workspace }}/.github/workflows/load-tests-pipeline-options/beam_Inference_Python_Benchmarks_Dataflow_VLLM_Gemma_Batch.txt
9595
${{ github.workspace }}/.github/workflows/load-tests-pipeline-options/beam_Inference_Python_Benchmarks_Dataflow_Table_Row_Inference_Batch.txt
9696
${{ github.workspace }}/.github/workflows/load-tests-pipeline-options/beam_Inference_Python_Benchmarks_Dataflow_Table_Row_Inference_Stream.txt
97+
${{ github.workspace }}/.github/workflows/load-tests-pipeline-options/beam_Inference_Python_Benchmarks_Dataflow_MLTransform_Generate_Vocab_Batch.txt
9798
# The env variables are created and populated in the test-arguments-action as "<github.job>_test_arguments_<argument_file_paths_index>"
9899
- name: get current time
99100
run: echo "NOW_UTC=$(date '+%m%d%H%M%S' --utc)" >> $GITHUB_ENV
@@ -214,3 +215,14 @@ jobs:
214215
-PpythonVersion=3.10 \
215216
-PloadTest.requirementsTxtFile=apache_beam/ml/inference/table_row_inference_requirements.txt \
216217
'-PloadTest.args=${{ env.beam_Inference_Python_Benchmarks_Dataflow_test_arguments_10 }} --autoscaling_algorithm=THROUGHPUT_BASED --max_num_workers=20 --metrics_table=result_table_row_inference_stream --influx_measurement=result_table_row_inference_stream --mode=streaming --input_subscription=projects/apache-beam-testing/subscriptions/table_row_inference_benchmark --window_size_sec=60 --trigger_interval_sec=30 --timeout_ms=900000 --output_table=apache-beam-testing:beam_run_inference.result_table_row_inference_stream_outputs --job_name=benchmark-tests-table-row-inference-stream-${{env.NOW_UTC}}'
218+
- name: run MLTransform Generate Vocab Batch
219+
uses: ./.github/actions/gradle-command-self-hosted-action
220+
timeout-minutes: 180
221+
with:
222+
gradle-command: :sdks:python:apache_beam:testing:load_tests:run
223+
arguments: |
224+
-PloadTest.mainClass=apache_beam.testing.benchmarks.inference.mltransform_generate_vocab_benchmark \
225+
-Prunner=DataflowRunner \
226+
-PpythonVersion=3.10 \
227+
-PloadTest.requirementsTxtFile=apache_beam/ml/transforms/mltransform_tests_requirements.txt \
228+
'-PloadTest.args=${{ env.beam_Inference_Python_Benchmarks_Dataflow_test_arguments_11 }} --job_name=benchmark-tests-mltransform-generate-vocab-batch-${{env.NOW_UTC}}'
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
# Licensed to the Apache Software Foundation (ASF) under one or more
2+
# contributor license agreements. See the NOTICE file distributed with
3+
# this work for additional information regarding copyright ownership.
4+
# The ASF licenses this file to You under the Apache License, Version 2.0
5+
# (the "License"); you may not use this file except in compliance with
6+
# the License. You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
#
16+
17+
--project=apache-beam-testing
18+
--region=us-central1
19+
--runner=DataflowRunner
20+
--temp_location=gs://temp-storage-for-perf-tests/loadtests
21+
--staging_location=gs://temp-storage-for-perf-tests/loadtests
22+
--machine_type=n1-standard-4
23+
--disk_size_gb=100
24+
--num_workers=8
25+
--max_num_workers=16
26+
--autoscaling_algorithm=THROUGHPUT_BASED
27+
--worker_zone=us-central1-b
28+
--sdk_location=container
29+
--requirements_file=apache_beam/ml/transforms/mltransform_tests_requirements.txt
30+
--input_options={}
31+
--publish_to_big_query=true
32+
--metrics_dataset=beam_run_inference
33+
--metrics_table=mltransform_generate_vocab_batch
34+
--influx_measurement=mltransform_generate_vocab_batch
35+
--input_file=gs://apache-beam-ml/testing/inputs/sentences_50k.txt
36+
--output_vocab=gs://temp-storage-for-perf-tests/mltransform/vocab_outputs/mltransform_generate_vocab_batch
37+
--columns=text
38+
--vocab_size=50000
39+
--min_frequency=1
40+
--lowercase=true
41+
--tokenizer=whitespace
42+
--oov_token=<UNK>
43+
--input_expand_factor=1
44+

.test-infra/tools/refresh_looker_metrics.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,8 @@
4444
("85", ["268", "269", "270", "271", "272"]), # PyTorch Sentiment Batch DistilBERT base uncased
4545
("86", ["284", "285", "286", "287", "288"]), # VLLM Batch Gemma
4646
("96", ["270", "304", "305", "353", "354"]), # Table Row Inference Sklearn Batch
47-
("106", ["355", "356", "357", "358", "359"]) # Table Row Inference Sklearn Streaming
47+
("106", ["355", "356", "357", "358", "359"]), # Table Row Inference Sklearn Streaming
48+
("107", ["360", "361", "362", "363", "364"]), # MLTransform Generate Vocab Batch
4849
]
4950

5051
def get_look(id: str) -> models.Look:
Lines changed: 130 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,130 @@
1+
<!--
2+
Licensed under the Apache License, Version 2.0 (the "License");
3+
you may not use this file except in compliance with the License.
4+
You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software
9+
distributed under the License is distributed on an "AS IS" BASIS,
10+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
11+
See the License for the specific language governing permissions and
12+
limitations under the License.
13+
-->
14+
15+
# MLTransform Examples
16+
17+
This directory contains Apache Beam examples for MLTransform pipelines.
18+
19+
## MLTransform - Generate Vocab (Batch only)
20+
21+
`mltransform_generate_vocab.py` builds a vocabulary artifact from batch input
22+
rows using `MLTransform` + `ComputeAndApplyVocabulary`.
23+
24+
### What it does
25+
26+
1. Reads input rows from JSONL (`--input_file`) or BigQuery (`--input_table`).
27+
2. Extracts specified columns (`--columns`).
28+
3. Normalizes text (`trim`, optional lowercasing).
29+
4. Tokenizes text (`whitespace` or `regex` tokenizer).
30+
5. Runs `ComputeAndApplyVocabulary` with top-k and min-frequency constraints.
31+
6. Ensures `--oov_token` is included first.
32+
7. Writes the vocabulary as one token per line.
33+
34+
### Required arguments
35+
36+
- `--output_vocab`
37+
- `--columns`
38+
- and one of:
39+
- `--input_file`
40+
- `--input_table`
41+
42+
### Optional arguments
43+
44+
- `--vocab_size` (default: `50000`)
45+
- `--min_frequency` (default: `1`)
46+
- `--lowercase` (default: `true`)
47+
- `--tokenizer` (`whitespace` or `regex`, default: `whitespace`)
48+
- `--oov_token` (default: `<UNK>`)
49+
- `--input_expand_factor` (default: `1`, useful for perf/load testing)
50+
51+
### Local batch example
52+
53+
```sh
54+
python -m apache_beam.examples.ml_transform.mltransform_generate_vocab \
55+
--input_file=/tmp/input.jsonl \
56+
--output_vocab=/tmp/vocab.txt \
57+
--columns=text,category \
58+
--vocab_size=5 \
59+
--min_frequency=1 \
60+
--lowercase=true \
61+
--tokenizer=whitespace \
62+
--oov_token=<UNK> \
63+
--input_expand_factor=1 \
64+
--runner=DirectRunner
65+
```
66+
67+
### Input format
68+
69+
JSONL input with object rows, for example:
70+
71+
```json
72+
{"id":"1","text":"Beam beam ML pipeline"}
73+
{"id":"2","text":"Beam pipeline dataflow"}
74+
{"id":"3","text":"ML transform beam"}
75+
{"id":"4","text":"vocab vocab vocab test"}
76+
{"id":"5","text":"rare_token_once"}
77+
{"id":"6","text":""}
78+
{"id":"7","text":null}
79+
```
80+
81+
The integration tests in `mltransform_generate_vocab_test.py` generate this
82+
sample data programmatically.
83+
84+
### Output format
85+
86+
One token per line:
87+
88+
1. `oov_token` first
89+
2. remaining tokens follow the vocabulary order produced by
90+
`ComputeAndApplyVocabulary`.
91+
92+
Example output:
93+
94+
```txt
95+
<UNK>
96+
beam
97+
ml
98+
```
99+
100+
For this sample and config:
101+
102+
```sh
103+
--columns=text --min_frequency=2 --vocab_size=3
104+
```
105+
106+
the expected output is:
107+
108+
```txt
109+
<UNK>
110+
beam
111+
vocab
112+
ml
113+
```
114+
115+
### Empty vocabulary behavior
116+
117+
If all tokens are filtered out by `--min_frequency`, the pipeline writes only
118+
the reserved `--oov_token` and logs a warning.
119+
120+
### Additional test datasets
121+
122+
Test data for happy path and null/empty/missing columns is generated inline in
123+
`mltransform_generate_vocab_test.py`.
124+
125+
### Performance testing pattern
126+
127+
- Small local files: functional correctness and output-stability tests.
128+
- Large GCS files (or moderate file + `--input_expand_factor`): throughput/cost
129+
benchmarking on Dataflow.
130+

0 commit comments

Comments
 (0)