Pandas merge have two results with the same code and input data

by xpan   Last Updated September 14, 2018 18:26 PM

I have two dataframe to merge.When I run the program with the same input data and code,there will be two situations(First:Successful merge;Second:The data belongs to 'annotate' in merge data is NaN.)

raw_df2 = pd.merge(annotate,raw_df,on='gene',how='right').fillna("unkown")

Then I have a test:

count = 10001
while (count > 10000):
    raw_df2 = pd.merge(annotate,raw_df,on='gene',how='right').fillna("unkown")
    count = len(raw_df2[raw_df2["type"]=="unkown"])
    print(count)

If merge is faild,"raw_df" always is falied during the run.I must resubmit the script,and the result may be successful.

[First two columns are from 'annotate';Others are 'from raw_df']
The failed result:

|  type  |     gene      |          locus           | sample_1 | sample_2 | status | value_1 | value_2  |
+--------+---------------+--------------------------+----------+----------+--------+---------+----------+
| unknow | 0610040J01Rik | chr5:63812494-63899619   | Ctrl     | SPION10  | OK     | 2.02125 | 0.652688 |
| unknow | 1110008F13Rik | chr2:156863121-156887078 | Ctrl     | SPION10  | OK     | 87.7115 |  49.8795 |
+--------+---------------+--------------------------+----------+----------+--------+---------+----------+

The successful result:

+--------+----------+------------------------+----------+----------+--------+----------+---------+
|  gene  |   type   |         locus          | sample_1 | sample_2 | status | value_1  | value_2 |
+--------+----------+------------------------+----------+----------+--------+----------+---------+
| St18   | misc_RNA | chr1:6487230-6860940   | Ctrl     | SPION10  | OK     |  1.90988 | 3.91643 |
| Arid5a | misc_RNA | chr1:36307732-36324029 | Ctrl     | SPION10  | OK     |  1.33796 | 2.21057 |
| Carf   | misc_RNA | chr1:60076867-60153953 | Ctrl     | SPION10  | OK     | 0.846988 | 1.47619 |
+--------+----------+------------------------+----------+----------+--------+----------+---------+


Answers 1


I have a solution,but I still don't know what cause the previous problem. Set the column in two datafram that I want to merge as the Index.Then use the index to merge two dataframe. Run the script more than 10 times,the result is no longer wrong.

# the first dataframe
DataQiime = pd.read_csv(args.FileTranseq,header=None,sep=',') # 
DataQiime.columns={'Feature.ID','Frequency'}
DataQiime_index = DataQiime.set_index('Feature.ID', inplace=False, drop=True)
# the second dataframe
DataTranseq = pd.read_table(args.FileQiime,header=0,sep='\t',encoding='utf-8') # 
DataTranseq_index = DataTranseq.set_index('Feature.ID', inplace=False, drop=True)
# merge by index
DataMerge = pd.merge(DataQiime,DataTranseq,left_index=True,right_index=True,how="inner")
xpan
xpan
September 14, 2018 16:05 PM

Related Questions



How to repeat cells in pandas DataFrame

Updated May 31, 2017 22:26 PM