This forum uses cookies

derpf · 04-23-2014, 12:11 PM

Just FYI, Nekotekina's SPU recompiler has been merged now, and when my PR is merged there will be an option to switch between that and the interpreter in the GUI.

***hlide*** · 04-23-2014, 04:17 PM

(04-23-2014, 01:51 AM)derpf Wrote:

(04-22-2014, 04:11 PM)ssshadow Wrote: and according to AlexAltea, the rpcs3 interpreter is not only horribly slow, it's even slower than what it could be.

It's not even the interpreter ATM, it's the memory manager, which is horrendously slow. Given that you can have many memory accesses per PPU instruction, it's not really a surprise that it tanks speed so much. It will be optimized in the future, though.

(The PPU interpreter just kind of sucks in general because there's a lot of spaghetti and indirection going on. It's not very cache-friendly at all, either.)

You have a memory manager which "interprets" memory access ? Tongue

I'm not sure but it seems you use a lot of dynamic polymorphism for PPU and SPU threads with virtual methods. If it is the case no wonder you get a slow interpreter. Unless you try to mix two different concepts into one : win32 thread to represent a PPU/SPU thread and PPU/SPU interpreter, I see no point in such polymorphism. Even so, you might not need that kind of polymorphism...

Try a curiously recurring template pattern (CRTP) as much as possible instead (http://en.wikipedia.org/wiki/Curiously_r...te_pattern)

~~androidlover~~ · (This post was last modified: 04-23-2014, 06:42 PM by androidlover.)

Hah, well, do well guys! I'd use rpcs3 whether it has a crappy interpreter, dynarec, virtualizer, etc.

Btw, I am not a fan of FF13 anyways ... I'd be content when rpcs3 can get in-game with something like Rockband or Little Big Planet. Much sooner than FF13, almost guaranteed.

A question though: Why don't you guys cache the interpreter? I mean, wouldn't that speed it up for now until the recompilers are finished later on for everything else? From using a few emulators of other tenors, I can say that a cached interpreter runs quite better compared to a pure interpreter (one that just interprets, but doesn't cache any instructions/etc.). SuperN64 emulator for Android has three options for the execution core:

1.Pure interpreter.
2.Cached interpreter.
3.Dynamic recompiler.

Pure gets 8 FPS with Super Mario 64 on a the Galaxy S5,
Cached gets 13-26 FPS,
and Dynamic gets a baseline 60 or so perfectly.

Basically, cached is a bit faster, so wouldn't that be an option (last I knew, RPCS3 doesn't have a cached one)?

d875j · 04-24-2014, 07:41 AM

(04-23-2014, 06:38 PM)androidlover Wrote: Hah, well, do well guys! I'd use rpcs3 whether it has a crappy interpreter, dynarec, virtualizer, etc.

Btw, I am not a fan of FF13 anyways ... I'd be content when rpcs3 can get in-game with something like Rockband or Little Big Planet. Much sooner than FF13, almost guaranteed.

A question though: Why don't you guys cache the interpreter? I mean, wouldn't that speed it up for now until the recompilers are finished later on for everything else? From using a few emulators of other tenors, I can say that a cached interpreter runs quite better compared to a pure interpreter (one that just interprets, but doesn't cache any instructions/etc.). SuperN64 emulator for Android has three options for the execution core:

1.Pure interpreter.
2.Cached interpreter.
3.Dynamic recompiler.

Pure gets 8 FPS with Super Mario 64 on a the Galaxy S5,
Cached gets 13-26 FPS,
and Dynamic gets a baseline 60 or so perfectly.

Basically, cached is a bit faster, so wouldn't that be an option (last I knew, RPCS3 doesn't have a cached one)?

This is a great idea mate.

***ssshadow*** · 04-24-2014, 08:20 AM

How do you even make a cached interpreter? Any part of the game code that doesn't have the same input and output every time would just have to be recalculated anyway. And this is probably 90 percent of the code I would guess. The performance gains would be minimal, and you would have to check evey variable value every time, etc...

derpf · 04-24-2014, 08:40 AM

(04-24-2014, 08:20 AM)ssshadow Wrote: How do you even make a cached interpreter? Any part of the game code that doesn't have the same input and output every time would just have to be recalculated anyway. And this is probably 90 percent of the code I would guess. The performance gains would be minimal, and you would have to check evey variable value every time, etc...

If only those damn scientists would just solve the halting problem so us folk can have GTA V at >9000 FPS on the PS3 on PC.

But seriously do explain what you mean by "caching interpreter", will you.

***hlide*** · (This post was last modified: 04-24-2014, 10:21 PM by hlide.)

I'm not sure about what a caching interpreter is. But it may be a case of JIT (like the one in Xenia) which allows to reduce the overhead of decoding a contiguous set of instructions (usually called a basic block) by decoding them only once then execute them every time the address is hit by the instruction fetcher. Under certain architectures, you need to use a special instruction to flush icache so you can also discard a basic block from the instruction cache this way.

The simplest way is probably something like that:

Code:
struct insn { void (*interpret_insn)(Context & context); };

...

std::unordered_map< u32, std::vector< insn > > insn_cache;

...

do

{

    auto pc = context.pc;

    auto bb = insn_cache.find(pc);

    if (bb == insn_cache.end())

    {

        insn_cache[pc] = decode_insn(pc);

    }

    else

    {

        for (auto i : bb) i->interpret_insn(context);

    }

}

while (context.no_external_event);

...

}

P.S.: decode_insn returns a vector of decoded insns which represents a basic block (there is no branch-like instruction in the block except for the last instruction)

derpf · (This post was last modified: 04-25-2014, 01:19 AM by derpf.)

(04-24-2014, 08:42 PM)hlide Wrote: I'm not sure about what a caching interpreter is. But it may be a case of JIT (like the one in Xenia) which allows to reduce the overhead of decoding a contiguous set of instructions (usually called a basic block) by decoding them only once then execute them every time the address is hit by the instruction fetcher. Under certain architectures, you need to use a special instruction to flush icache so you can also discard a basic block from the instruction cache this way.

The simplest way is probably something like that:

Code:
struct insn { void (*interpret_insn)(Context & context); }; ... std::unordered_map< u32, std::vector< insn > > insn_cache; ... do { auto pc = context.pc; auto bb = insn_cache.find(pc); if (bb == insn_cache.end()) { insn_cache[pc] = decode_insn(pc); } else { for (auto i : bb) i->interpret_insn(context); } } while (context.no_external_event); ... }

P.S.: decode_insn returns a vector of decoded insns which represents a basic block (there is no branch-like instruction in the block except for the last instruction)

It looks like all you're doing is caching the decoded instructions and storing them as function objects or something. At least, that's the only thing I could make out of it. I doubt this would bring any benefit over the amount of memory it uses. Tongue

A JIT, instead, would take a basic block or an entire procedure and recompile it to the target ISA, and then cache that code, so it can simply run that. (And indeed that is a great goal to have -- which rpcs3 will do in the future. Big Grin

)

~~mushroom~~ · 04-25-2014, 07:02 PM

Don't know exactly what a "cached interpreter" is, but someone mention that here in an old thread: http://www.emunewz.net/forum/showthread.php?tid=158608

In post #6, "Chalking Fenterbyte" mentions the same thing "cached interpreter", but AlexAltea says there's no need to improve the interpreter, and instead wish to work on the dynamic recompiler.

I also wonder what caching an interpreter means and how it improves speed supposedly.

***hlide*** · (This post was last modified: 04-26-2014, 09:02 AM by hlide.)

(04-25-2014, 01:18 AM)derpf Wrote: It looks like all you're doing is caching the decoded instructions and storing them as function objects or something. At least, that's the only thing I could make out of it. I doubt this would bring any benefit over the amount of memory it uses.

A JIT, instead, would take a basic block or an entire procedure and recompile it to the target ISA, and then cache that code, so it can simply run that. (And indeed that is a great goal to have -- which rpcs3 will do in the future. )

That's an oversimplified example and indeed it lacks at least some arguments to avoid decoding the opcode so it can get the register indexes directly when interpreting an instruction. And you can go further by making super blocks instead of basic blocks. It will be faster than a plain interpreter while it takes more memory. You have also the same principle with JIT where the backend may be an interpreter (for designing and debugging JIT) then new backends are added to produce a block of native instructions to run directly.

Xenia have both backends : an independent architecture similar to what I described above and x64 architecture. The first is mostly to help to design JIT and debug it (there are several passes which tries to optimize the "produced code"). But I was told by Vanik that interpreter backend is faster than what Asmjit produced. For this reason he simply ditched Asmjit and made his own jit + xbyak (x64).

Login
Username:
Password:	Lost Password?
	Remember me