#Charlop-Powers et al 2014
Global Biogeography of Bacterial Secondary Metabolism

## Results + Discussion

### Summary

• First thing I noticed is they didn’t do 16S rRNA analysis, which I think they did for their older papers - maybe it doesn’t matter? (It doesn’t matter to me, it’s all Streptomyces sp. followed by a serial number….)
• I like the citizen science effort http://www.drugsfromdirt.org, seems like a neat way to gather dirt, but I wonder what that means for the samples collected from other countries? Rio accord etc.
• the sampling isn’t particularly deep - seems that some areas (biomes?) are sampled more heavily than others
• I guess they only selected 96 because that fits on a plate, and they used a combination of 8X12 barcodes for their amplification… lack of ambition
• I missed the first time that they combined reads from a previous publication, from biomes where they had already sequenced. Seems a bit of a cheat, because they already knew what was in those samples. Must how they ended up with 185 biomes from 96 barcodes?
• amplicons clustered at 95% (first 97%, then representatives from each of these clusters were re-clustered at 95%, see Methods) -> Definition of OTU in this case, 95% distinct amplicons for each domain
• rarefaction curve goes up to 350,000 - I assume this is where they predict the curve hits an asymptote - > conclude they haven’t sampled enough.

### similarity metric of OTUs

• Jaccard distance on OTU composition
• Most samples had unique sets of OTUs = infer unique clusters

### Jaccard Distance

given
OTU set A = $\mathcal{A}$
OTU set B = $\mathcal{B}$

Jaccard index:

$$J(\mathcal{A}, \mathcal{B}) = \frac{|\mathcal{A}\bigcap\mathcal{B}|} {|\mathcal{A}\bigcup\mathcal{B}|}$$

has the nice property that:

$$0 \leq J(\mathcal{A},\mathcal{B}) \leq 1$$

the distance is then:

$$d_{J} = 1 - J(\mathcal{A}, \mathcal{B})$$

### Effect of Geographic Distance

• even using as little as 3% shared OTUs (clusters) between samples, could only find a link for the geographically closest samples
• reflects that they need more samples
• page 5, line 5 - how to justify that geographic distance drives differentiation? did they sample enough to make this claim?
• this paragraph is conflicting with previous statements. When they find unique OTU composition, it ends up being most similar to samples from other biomes (hotspring vs hotspring) and not nearby biomes (hotspring vs nearby dry soil)
• sometimes, similarity is reflect by biome, sometimes by geographic location?

### Estimates of Diversity

• how to measure OTUs?

### Chao1 metric

http://palaeo-electronica.org/2011_1/238/estimate.htm

• estimates number of OTUs
• unequal sampling density prevents comparisons
• ‘rarefaction’ helps interpolate
• choose a value (# of samples) to rarify to, paper chose 5000?
• can be by formula, or by resampling randomly

“The typical way these estimators operate is by using the number of rare species that are found in a sample as a way of calculating how likely it is there are more undiscovered species. “

The estimate of the true number of species is a function of the observed number of species plus a ratio of the squared number of singletons over the number of doubletons

$$S{1} = S{obs} + \frac{F{1}^{2}}{2F{2}^{2}}$$

The rationale is that as long as singletons are still being observed, there remain more unobserved species in the sample.

### scaffold hotspots

This is interesting, and I wish they did more

particular soil samples can be shown to enrich for OTUs (domains) that map precisely back to specific scaffold clusters (Ansamycin, glycopeptides, etc)

• they’ve shown that they can get OTUs that map to congeners of different scaffolds from a single sample
• what order does this proceed?
(Original producer - > naive resistant community -> recruitment of more diverse producers of different congeners -> diverse resistance in community?)
(Original producer -> diverse resistant community -> recruitment of diverse production?)

It seems stupid to me to map the hotspots of geographic latitudes. If they performed the same study but moved every sample site 100m in another direction, they would get a different set of hotspots. I think the distribution of hotspots was essential random. Doesn’t take away from the idea that hotspots exist, I just don’t think it’s predictive.

### Diversity hotspots

I can’t say whether I think it’s true or not that different environments are better or worse for exploring biosynthetic diversity. Neither this paper nor their previous ones have convinced me of this.

Atlantic forest and Desert samples contain more biosynthetic diversity - I speculate that this claim would fall apart if they sampled more broadly. It’s certainly true of this set of samples, but I don’t think it’s a recipe for looking for diversity, it just happened to be true for lack of more samples.