Record spark two large tables join data tilt optimization

Updated to 24 days ago

aTable 700 million 100 million items on table Table a aid c1 c2 c3 Table b bid bvalue Requirements: Use c1 c2 c3 of table a to obtain the bid join (left join) of table b to expand table a 1. Write three left directlyjoinDiscover data tilt 2.c1 c2 c3 is only associated once as follows, and check it, and it is found that there will be no data tilt. df1=("select * from b") df2=("select * from a") df3=(df1,df2.c1=,left) ()3. Consider why there will be no data skew in one join, and there will be multiple joins. 4. Because()It will not use all the data to join, or show will only return part of the results, spark will not use all the data to calculate. Join multiple times, join the two tables first (the full amount of data), and then go to the next join. 5. Check the data situation in table a and find that there are many cases where c1 is null.select c1,count(1) as cnt from a group by c1 order by cnt desc;6. Solution. Although c1 c2 c3 is null, this row of data cannot be removed. Both tables are big tables. Therefore, split join Since logic c1 is null c2 c3 must be null, there is the following sql df1=("select * from a where c1 is not null") df2=(select * from b) df3=(df2,xx,xx)Three joins df4=("select * from a where c1 is null") df4=('xx',())Since bvalue is an array, here is an array Last df3 df4 union df5=(df4)7. Run the job for 16 minutes before tuning, and run 58 seconds after tuning.