aTable 700 million
100 million items on table
Table a
aid c1 c2 c3
Table b
bid bvalue
Requirements: Use c1 c2 c3 of table a to obtain the bid join (left join) of table b to expand table a
1. Write three left directlyjoinDiscover data tilt
2.c1 c2 c3 is only associated once as follows, and check it, and it is found that there will be no data tilt.
df1=("select * from b")
df2=("select * from a")
df3=(df1,df2.c1=,left)
()3. Consider why there will be no data skew in one join, and there will be multiple joins.
4. Because()It will not use all the data to join, or show will only return part of the results, spark will not use all the data to calculate. Join multiple times, join the two tables first (the full amount of data), and then go to the next join.
5. Check the data situation in table a and find that there are many cases where c1 is null.select c1,count(1) as cnt from a group by c1 order by cnt desc;6. Solution. Although c1 c2 c3 is null, this row of data cannot be removed.
Both tables are big tables. Therefore, split join
Since logic c1 is null c2 c3 must be null, there is the following sql
df1=("select * from a where c1 is not null")
df2=(select * from b)
df3=(df2,xx,xx)Three joins
df4=("select * from a where c1 is null")
df4=('xx',())Since bvalue is an array, here is an array
Last df3 df4 union
df5=(df4)7. Run the job for 16 minutes before tuning, and run 58 seconds after tuning.