What was the underlying training data distribution?

#2
by treehugg3 - opened

There are many base models trained, but few of them give an in-depth description of the type of content these models looked at in training, and what filtering was done on the dataset which may have biased the model. Having no insight into that means we just have to guess and benchmark using outputs alone what it might be good at. It would be really helpful if the model card, or a technical report about this project, clarifies what the training data distribution is.

And of course, I appreciate this model was released without synthetic data.

yeah I'm looking forward to the tech report and hope there is info in there

+1

It doesn't have to be specific if that's an issue, even just vague hints about what was picked or filtered would help.

First off, kudos to this base model. It will do coherent raw completions at 120K context. In 24GB VRAM! Most base models, even those that claim such long context, fall apart well before that.

Second, this is how it completed part of the last chapter of Journey to the Center of the Earth, at that long context:

So much honor necessarily created for him many envious enemies. Of
course they existed, and as his theories, supported by certain facts,
contradicted the system of science upon the question of central heat, he
was attacked by a number of learned men.

But the Professor was not a man to be easily silenced. He defended his
views with the greatest energy, and the question was warmly discussed.

It is quite unnecessary to say that the Professor was victorious. The
new theory was accepted, and the old exploded.

As for myself, I have become a professor of geology, and have married
Gretchen.

Hans, the guide, has returned to his native land, and has become a
prosperous farmer.

As for the compass, it is still in my possession. I have tried it
several times, and it invariably points to the north.

Nevertheless, I am convinced that it is wrong, and that it is a
strange, an extraordinary, and an inexplicable phenomenon.

But, after all, what matters it?

The journey is over.




End of Project Gutenberg's Journey to the Center of the Earth

"Project Gutenberg" was nowhere in the context! So it must be a emphasized part of the pretraining data, which is neat.

Hello guys, we will release the technical report ASAP.

The model seems to be trained mostly on math or physics questions, some including reasoning. It's clear that the model is substantially contaminated with synthetic data, too. If you prompt a story, the model continues with known early ChatGPT characteristics in its output. When you prompt the model with the character "A", here is what you get.

Math problem
man is 24 years older than his son. In two years, his age will be twice the age of his son. The present age of his son is:
A. 14 years
B. 18 years
C. 20 years
D. 22 years

Let the son's present age be $x$ years. Then, man's present age =$(x + 24)$ years.

$(x + 24) + 2 = 2(x + 2)$

$x + 26 = 2x + 4$

$x = 22$.

Answer

D

Chinese multiple choice with some reasoning
、⾃汗盗汗

B、⾻蒸潮热

C、内热消渴

D、须发早⽩

E、肠燥便秘

本题考查补阴药⽯斛的主治病证。⽯斛⽢,微寒。归胃、肾经。功能益胃⽣津,滋阴除热,明⽬,强腰。其主治病证有:(1)热病伤津或胃阴不⾜的⾆⼲⼝燥,内热消渴(C对)。(2)阴虚虚热不退。(3)肾虚视物不清、腰膝软弱。⾃汗盗汗是补⽓药⽩术的主治病证(A错)。⾻蒸潮热是补阴药鳖甲的主治病证(B错)。须发早⽩是补⾎药何⾸乌的主治病证(D错)。肠燥便秘是补⾎药当归的主治病证(E错)。

结果如下:C

Multiple choice biology
.  The number of chromosomes in each cell doubles

B.  The number of chromosomes in each cell decreases by half

C.  The number of cells increases by half

D.  The number of cells doubles

Correct Answer:

B.  The number of chromosomes in each cell decreases by half

Math reasoning
: No, the number 6,532,114 is not a perfect square.

Why is 6,532,114 not a perfect square?

A perfect square is a number that can be expressed as the product of two equal integers.

The only way to accurately calculate if a number is a perfect square is to find the factors. Before we go through the trouble of finding the factors, there is a quick trick you can use to help determine if you need even need to do the extra work.

Try these steps first:

A number that is a perfect square never ends in 2, 3, 7 or 8. If your number ends in any of those numbers, you can stop here because your number is not a perfect square.

Obtain the digital root of the number. The digital root essentially is the sum of all of the digits. If you're lost, don't worry, we'll go over each step in more detail below.

All possible numbers that are a perfect square have a digital root of 1, 4, 7, 9.

Let's try it...

Step 1:

What is the last number of 6,532,114? It is this number: 6532114. The answer is 4. Is 4 in the list of numbers that are never perfect squares (2, 3, 7 or 8)?

Answer: NO, 4 is not in the list of numbers that are never perfect squares. Let's continue to the next step.

Step 2:

We now need to obtain the digital root of the number. Here's how you do it:

Split the number up and add each digit together:

6 + 5 + 3 + 2 + 1 + 1 + 4 = 22

If the answer is more than one digit, you would add each digit of the answer together again:

2 + 2 = 4

What is the digital root of number 6,532,114?

Answer: 4

Step 3:

So now we know the digital root of 6,532,114 is 4. Is 4 in the list of digital roots that are always a square root (1, 4, 7 or 9)?

Answer: YES, 4 is in the list of digital roots that are always perfect squares. We can conclude that 6,532,114 could be a perfect square!

Factoring

OK, so now we know that 6,532,114 could be a perfect square. We have to find the factors of the number to be sure.

Here are all of the factors of 6,532,114:

1 x 6,532,114 2 x 3,266,057 19 x 343,795 38 x 171,897

We're looking for a factor combination with equal numbers for X and Y (like 3x3) above. Notice there isn't a equal factor combination, that when multiplied together, produce the number 6,532,114. That means 6,532,114 is NOT a perfect square.

Math multiple choice
) $\frac{31}{14}$

B) $\frac{25}{14}$

C) $\frac{15}{14}$

D) $\frac{13}{14}$

Solution :

(b): ${{\left( 0.85 \right)}^{2}}=\left( 0.85+0.15 \right)\left( 0.85-0.15 \right)$ $=\frac{15}{100}\left( 1 \right)=\frac{3}{20}$ $\therefore$ $1-\frac{3}{20}=\frac{17}{20}$

Math multiple choice
200 m long train crosses a platform double its length in 36 s. What is the speed of the train in km/h?
  1. 60 km/h

  2. 48 km/h

  3. 64 km/h

  4. 66 km/h

  5. None of these

Let the length of the train be $L = 200$ m. Then the length of the platform is $2L = 2×200 = 400$ m. The total distance covered by the train to cross the platform is the sum of the length of the train and the platform, i.e., $200 + 400 = 600$ m. Time taken is $36$ s. Speed = Distance / Time = $600 / 36$ m/s. To convert this speed from m/s to km/h, we multiply by $18/5$ (since $1$ m/s = $3.6$ km/h = $18/5$ km/h). So, speed = $600 / 36 × 18 / 5 = 60$ km/h.

Answer

  1. 60 km/h

Oops, some of my own rationale is a little wrong. I can get other types of outputs using other starting prompts. I still see a lot of math, finance and physics questions and LLM-like instructions though. Definitely contaminated.

Sign up or log in to comment