Skip to content

Commit 66cfd4c

Browse files
committed
Refactor vocab pipeline to use MLTransform ComputeAndApplyVocabulary
1 parent 4c13508 commit 66cfd4c

3 files changed

Lines changed: 14 additions & 3 deletions

File tree

.github/workflows/load-tests-pipeline-options/beam_Inference_Python_Benchmarks_Dataflow_MLTransform_Generate_Vocab_Batch.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@
2626
--autoscaling_algorithm=THROUGHPUT_BASED
2727
--worker_zone=us-central1-b
2828
--sdk_location=container
29+
--requirements_file=apache_beam/ml/transforms/mltransform_tests_requirements.txt
2930
--input_options={}
3031
--publish_to_big_query=true
3132
--metrics_dataset=beam_run_inference

sdks/python/apache_beam/ml/transforms/mltransform_tests_requirements.txt

Lines changed: 12 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -14,8 +14,17 @@
1414
# limitations under the License.
1515
#
1616

17-
# Keep this benchmark requirements minimal. The vocab benchmark implementation
18-
# does not depend on TensorFlow/TensorFlow Transform, and those packages can
19-
# force incompatible apache-beam constraints during CI resolution.
17+
# Keep this benchmark requirements focused and deterministic for Dataflow
18+
# workers. MLTransform TFT operations require a consistent TensorFlow Transform
19+
# dependency set; otherwise workers can crash-loop with pandas/numpy ABI
20+
# mismatches during SDK harness startup.
2021
google-cloud-monitoring>=2.27.0
22+
tensorflow_transform>=1.14.0,<1.15.0
23+
tensorflow-metadata>=1.14.0,<1.15.0
24+
tfx-bsl>=1.14.0,<1.15.0
25+
# tfx-bsl / tensorflow-transform rely on pandas 1.x with numpy 1.x.
26+
numpy<2
27+
pandas<2
28+
# tensorflow-transform expects dill but does not hard-pin it.
29+
dill
2130

sdks/python/apache_beam/testing/benchmarks/inference/mltransform_generate_vocab_benchmark.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,7 @@ def test(self):
3939
extra_opts = {
4040
'input_file': self.pipeline.get_option('input_file'),
4141
'output_vocab': self.pipeline.get_option('output_vocab'),
42+
'artifact_location': self.pipeline.get_option('artifact_location'),
4243
'columns': self.pipeline.get_option('columns'),
4344
'vocab_size': self.pipeline.get_option('vocab_size'),
4445
'min_frequency': self.pipeline.get_option('min_frequency'),

0 commit comments

Comments
 (0)