parquetFile = spark.read.parquet('traj_pred_bc_train_data_sampled/dt=2021-09-30/city_id=88/')
parquetFile.count()
parquetFile.take(2)
2. pyarrow.parquet read parquet file
import pyarrow.parquet as pq
pfile = pq.read_table(file_list[0])
print("Column names: {}".format(pfile.column_names))
print("Schema: {}".format(pfile.schema))
3.parquet也可以用spark sql读
spark.sql( 'SELECt count(id) ' 'from parquet.`file:///tmp/hello_world_dataset`').collect()写
spark.conf.set('spark.sql.sources.partitionOverwriteMode', 'dynamic')
train_data.coalesce(1).write.partitionBy('dt', 'city_id').mode('overwrite').parquet('./traj_pred_bc_train_data_sampled/')
其中train_data是spark Dataframe。



