如何解决将XML文件转换为R
我有一个来自HMDB Saliva Metabolites
数据的数据集。此数据是XML文件。我想做的就是将此XML文件转换为R中的数据帧列表,但是,我不希望列表中的所有节点。
导入文件并转换为列表:
require(XML)
library("methods")
data <- xmlParse("D:/rout/to/my/downloaded/file/saliva_metabolites/saliva_metabolites.xml")
xml_data <- xmlToList(data)
现在,不确定如何选择特定节点。 意思是,我的目标是创建一个代谢物列表,列表中的每个代谢物都会有一个数据框列表。说<metabolite>
,然后说<accession>
作为字符串,然后说<name>
作为字符串,<synonym>
说所有同义词作为数据帧。
使用这个问题More direct way to create a list of data frames from XML file? 但是问题中指向数据的链接不起作用,而且我不知道如何在代码中实现它。
我尝试使用此问题代码xml to R dataframe选择特定的节点,但是没有用
x <- lapply(data["//metabolite"],XML:::xmlAttrsToDataFrame)
但这给了我一个空列表
> x
list()
任何提示,参考或帮助将不胜感激
解决方法
不确定这是否是您要寻找的,但这是前三个代谢物及其两个子节点的代码示例。
library( xml2 )
library( magrittr ) #for pipe operator %>%
doc <- read_xml( "./temp/saliva_metabolites.xml" )
#get metabolite nodes (only first three used in this sample)
met.nodes <- xml_find_all( doc,".//d1:metabolite" )[1:3]
#list of data.frames with secondary accessions
# only two in this sample
xpath_child.v <- c( "./d1:secondary_accessions/d1:accession","./d1:synonyms/d1:synonym" )
#what names should they get in the list?
child.names.v <- c( "secondary_accessions","synonyms" )
#first,loop over the met.nodes
L.sec_acc <- lapply( met.nodes,function(x) {
#second,loop over the xpath desired child-nodes
temp <- lapply( xpath_child.v,function(y) {
xml_find_all(x,y ) %>% xml_text() %>% data.frame( value = .)
})
#set their names
names(temp) = child.names.v
return(temp)
})
#set names of metabolites
names(L.sec_acc) <- xml_find_first( met.nodes,".//d1:name ") %>% xml_text()
输出
# $`1-Methylhistidine`
# $`1-Methylhistidine`$secondary_accessions
# value
# 1 HMDB00001
# 2 HMDB0004935
# 3 HMDB0006703
# 4 HMDB0006704
# 5 HMDB04935
# 6 HMDB06703
# 7 HMDB06704
#
# $`1-Methylhistidine`$synonyms
# value
# 1 (2S)-2-amino-3-(1-Methyl-1H-imidazol-4-yl)propanoic acid
# 2 1-Methylhistidine
# 3 Pi-methylhistidine
# 4 (2S)-2-amino-3-(1-Methyl-1H-imidazol-4-yl)propanoate
# 5 1 Methylhistidine
# 6 1-Methyl histidine
# 7 1-Methyl-histidine
# 8 1-Methyl-L-histidine
# 9 1-MHis
# 10 1-N-Methyl-L-histidine
# 11 L-1-Methylhistidine
# 12 N1-Methyl-L-histidine
# 13 1-Methylhistidine dihydrochloride
#
#
# $`2-Ketobutyric acid`
# $`2-Ketobutyric acid`$secondary_accessions
# value
# 1 HMDB00005
# 2 HMDB0006544
# 3 HMDB06544
#
# $`2-Ketobutyric acid`$synonyms
# value
# 1 2-Ketobutanoic acid
# 2 2-Oxobutyric acid
# 3 3-Methyl pyruvic acid
# 4 alpha-Ketobutyrate
# 5 alpha-Ketobutyric acid
# 6 alpha-oxo-N-Butyric acid
# 7 2-Ketobutanoate
# 8 2-Ketobutyrate
# 9 2-Oxobutyrate
# 10 3-Methyl pyruvate
# 11 a-Ketobutyrate
# 12 a-Ketobutyric acid
# 13 a-ketobutyrate
# 14 a-ketobutyric acid
# 15 a-oxo-N-Butyrate
# 16 a-oxo-N-Butyric acid
# 17 alpha-oxo-N-Butyrate
# 18 a-oxo-N-butyrate
# 19 a-oxo-N-butyric acid
# 20 2-oxo-Butanoate
# 21 2-oxo-Butanoic acid
# 22 2-oxo-Butyrate
# 23 2-oxo-Butyric acid
# 24 2-oxo-N-Butyrate
# 25 2-oxo-N-Butyric acid
# 26 2-Oxobutanoate
# 27 2-Oxobutanoic acid
# 28 3-Methylpyruvate
# 29 3-Methylpyruvic acid
# 30 a-keto-N-Butyrate
# 31 a-keto-N-Butyric acid
# 32 a-Oxobutyrate
# 33 a-Oxobutyric acid
# 34 alpha-keto-N-Butyrate
# 35 alpha-keto-N-Butyric acid
# 36 alpha-Ketobutric acid
# 37 alpha-Oxobutyrate
# 38 alpha-Oxobutyric acid
# 39 Methyl-pyruvate
# 40 Methyl-pyruvic acid
# 41 Propionyl-formate
# 42 Propionyl-formic acid
# 43 alpha-Ketobutyric acid,sodium salt
#
#
# $`2-Hydroxybutyric acid`
# $`2-Hydroxybutyric acid`$secondary_accessions
# value
# 1 HMDB00008
#
# $`2-Hydroxybutyric acid`$synonyms
# value
# 1 2-Hydroxybutanoic acid
# 2 alpha-Hydroxybutanoic acid
# 3 alpha-Hydroxybutyric acid
# 4 2-Hydroxybutanoate
# 5 2-Hydroxybutyrate
# 6 a-Hydroxybutanoate
# 7 a-Hydroxybutanoic acid
# 8 alpha-Hydroxybutanoate
# 9 a-hydroxybutanoate
# 10 a-hydroxybutanoic acid
# 11 a-Hydroxybutyrate
# 12 a-Hydroxybutyric acid
# 13 alpha-Hydroxybutyrate
# 14 a-hydroxybutyrate
# 15 a-hydroxybutyric acid
# 16 (RS)-2-Hydroxybutyrate
# 17 (RS)-2-Hydroxybutyric acid
# 18 2-Hydroxy-butanoate
# 19 2-Hydroxy-butanoic acid
# 20 2-Hydroxy-DL-butyrate
# 21 2-Hydroxy-DL-butyric acid
# 22 2-Hydroxy-N-butyrate
# 23 2-Hydroxy-N-butyric acid
# 24 a-Hydroxy-N-butyrate
# 25 a-Hydroxy-N-butyric acid
# 26 alpha-Hydroxy-N-butyrate
# 27 alpha-Hydroxy-N-butyric acid
# 28 DL-2-Hydroxybutanoate
# 29 DL-2-Hydroxybutanoic acid
# 30 DL-a-Hydroxybutyrate
# 31 DL-a-Hydroxybutyric acid
# 32 DL-alpha-Hydroxybutyrate
# 33 DL-alpha-Hydroxybutyric acid
# 34 2-Hydroxybutyric acid,(R)-isomer
# 35 2-Hydroxybutyric acid,monosodium salt
# 36 2-Hydroxybutyric acid,(+-)-isomer
# 37 2-Hydroxybutyric acid,monosodium salt,(+-)-isomer
,
另一个选择:
### packages
library(XML)
library(data.table)
library(dplyr)
### xml parse
xml <- xmlTreeParse("C://Users/.../saliva_metabolites/saliva_metabolites.xml",useInternalNode=TRUE)
### get the context nodes
ns <- getNodeSet(xml,"//*[local-name()='metabolite']")
### rbind the results of a function which extracts the data in order to construct the df
df=rbindlist(lapply(ns,function(x) {
nm = xpathSApply(x,"(.//*[local-name()='name'])[1]",xmlValue)
acc = xpathSApply(x,"(.//*[local-name()='accession'])[1]",xmlValue)
syn = xpathSApply(x,"(.//*[local-name()='synonyms'])[1]/*",xmlValue)
data.frame(name=nm,accession=acc,synonyms = paste(syn,collapse = '¤'))}),fill=TRUE)
### put synonyms of each row in a list (not mandatory)
df$synonyms = lapply(strsplit(as.character(df$synonyms),split='¤'),trimws)
## adding NA where the result is blank for synonyms and export the dfs (1 for each metabolite)
outp = df %>% mutate(synonyms=na_if(synonyms,"")) %>% group_split(xml_pos=row_number())
输出(前3个结果):
[[1]]
# A tibble: 1 x 4
name accession synonyms xml_pos
<chr> <chr> <list> <int>
1 1-Methylhistidine HMDB0000001 <chr [13]> 1
[[2]]
# A tibble: 1 x 4
name accession synonyms xml_pos
<chr> <chr> <list> <int>
1 2-Ketobutyric acid HMDB0000005 <chr [43]> 2
[[3]]
# A tibble: 1 x 4
name accession synonyms xml_pos
<chr> <chr> <list> <int>
1 2-Hydroxybutyric acid HMDB0000008 <chr [37]> 3
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。